Ever felt frustrated with tables stuck in PDFs, wishing you could just pop them out and use them directly? Well, that’s the journey we’re about to dive into! We’re gearing up to unlock the secret to extracting tables from PDFs using Python.
In today’s tutorial, we’re diving into how to extract tables from PDF using Python! With the help of tabula-py
and tkinter
libraries, we’ll break down the steps to make managing PDF data simple and straightforward. Ready to get started? Let’s jump in and tackle this together!
Learn also: How to Create a Simple PDF File Viewer in Python
Table of Contents
- Necessary Libraries
- Imports
- Extract Tables from PDF Function
- Creating the Main Window
- Creating the Text Widget
- Extract Tables Button
- Starting the Main Loop
- Example
- Full Code
Necessary Libraries
For the code to function properly, make sure to install the tkinter and tabula-py libraries using the terminal or your command prompt by running these commands:
$ pip install tk
$ pip install tabula-py
Imports
As usual, we start by importing the necessary modules and libraries for our script. To facilitate interaction between the user and the script through a graphical user interface (GUI), we will import the tkinter
library.
import tkinter as tk
Secondly, we aim to enable the user to select the PDF file from which they wish to extract tables. To achieve this, we will import the filedialog
module. Additionally, we will import messagebox
to display messages in case any errors occur during the execution of the code. Furthermore, to allow scrolling through the text widget, we will import scrolledtext
.
from tkinter import filedialog, messagebox, scrolledtext
Last but not least we import tabula
which we will use to extract tables from the PDF file.
import tabula
Extract Tables from PDF Function
Now that we have imported the necessary modules and libraries, let’s go ahead and define the extract_tables
function. This function starts by opening a file dialog, allowing the user to choose the PDF file containing the tables they want to extract. Once the file is selected, it proceeds to extract tables using the tabula.read_pdf()
command. After extracting the tables, the function prepares to display them.
To ensure the text widget is empty, it deletes any previous content. Then, it iterates through the extracted tables, formats them, and inserts the data into the text widget using write_table_to_text_widget()
. If this process is successful, a success message will be displayed. However, if something goes wrong, an error message will appear.
def extract_tables():
# Open file dialog to select PDF file
file_path = filedialog.askopenfilename(filetypes=[("PDF Files", "*.pdf")])
if file_path:
try:
# Read tables from the selected PDF
tables = tabula.read_pdf(file_path, pages="all", output_format="json", stream=True)
# Clear previous content from the text widget
text_widget.delete(1.0, tk.END)
# Function to write table data to text widget with aligned columns
def write_table_to_text_widget(table_data):
# Calculate maximum width for each column
max_widths = [max(len(str(cell.get("text", ""))) for cell in column) for column in zip(*table_data)]
# Format and insert rows with aligned columns
for row in table_data:
row_str = "\t".join(str(cell.get("text", "")).center(width) for cell, width in zip(row, max_widths))
text_widget.insert(tk.END, row_str + "\n")
text_widget.insert(tk.END, "\n")
# Write tables to text widget
for i, table in enumerate(tables):
text_widget.insert(tk.END, f"Table {i + 1}:\n")
if "data" in table:
write_table_to_text_widget(table["data"])
else:
text_widget.insert(tk.END, "No table data found.\n\n")
messagebox.showinfo("Success", "Tables extracted successfully!")
except Exception as e:
messagebox.showerror("Error", f"An error occurred: {str(e)}")
Creating the Main Window
This is the part where we create the main window, which will house our text widget, and give it a title.
# Create main Tkinter window
root = tk.Tk()
root.title("PDF Table Extractor - The Pycodes")
Creating the Text Widget
After setting up the main window, it’s time to create the text widget. Naturally, we need to customize it, so we will specify its width and height, make it scrollable, and also enhance its readability by using wrap=tk.WORD
. This ensures that tkinter
does not split words when they reach the boundary of the text widget, but instead, starts a new line for each unbroken word.
# Create text widget to display tables
text_widget = scrolledtext.ScrolledText(root, wrap=tk.WORD, width=80, height=20)
text_widget.pack(padx=10, pady=10)
Extract Tables Button
Now that we have the interface, the text widget, and the function set up, all we need is a trigger to launch the operation. To achieve this, we will create a button named “Extract Tables” and associate it with the extract_tables()
function. Once clicked, this button will call the function.
# Create button to trigger table extraction
extract_button = tk.Button(root, text="Extract Tables", command=extract_tables)
extract_button.pack(pady=10)
Starting the Main Loop
Lastly, this section ensures that the main window keeps running and is responsive to the user until he exits willingly.
# Run the Tkinter event loop
root.mainloop()
Example
Full Code
import tkinter as tk
from tkinter import filedialog, messagebox, scrolledtext
import tabula
def extract_tables():
# Open file dialog to select PDF file
file_path = filedialog.askopenfilename(filetypes=[("PDF Files", "*.pdf")])
if file_path:
try:
# Read tables from the selected PDF
tables = tabula.read_pdf(file_path, pages="all", output_format="json", stream=True)
# Clear previous content from the text widget
text_widget.delete(1.0, tk.END)
# Function to write table data to text widget with aligned columns
def write_table_to_text_widget(table_data):
# Calculate maximum width for each column
max_widths = [max(len(str(cell.get("text", ""))) for cell in column) for column in zip(*table_data)]
# Format and insert rows with aligned columns
for row in table_data:
row_str = "\t".join(str(cell.get("text", "")).center(width) for cell, width in zip(row, max_widths))
text_widget.insert(tk.END, row_str + "\n")
text_widget.insert(tk.END, "\n")
# Write tables to text widget
for i, table in enumerate(tables):
text_widget.insert(tk.END, f"Table {i + 1}:\n")
if "data" in table:
write_table_to_text_widget(table["data"])
else:
text_widget.insert(tk.END, "No table data found.\n\n")
messagebox.showinfo("Success", "Tables extracted successfully!")
except Exception as e:
messagebox.showerror("Error", f"An error occurred: {str(e)}")
# Create main Tkinter window
root = tk.Tk()
root.title("PDF Table Extractor - The Pycodes")
# Create text widget to display tables
text_widget = scrolledtext.ScrolledText(root, wrap=tk.WORD, width=80, height=20)
text_widget.pack(padx=10, pady=10)
# Create button to trigger table extraction
extract_button = tk.Button(root, text="Extract Tables", command=extract_tables)
extract_button.pack(pady=10)
# Run the Tkinter event loop
root.mainloop()
Happy Coding!