Home » General Python Tutorials » How to Extract Tables from PDF in Python

How to Extract Tables from PDF in Python

Ever felt frustrated with tables stuck in PDFs, wishing you could just pop them out and use them directly? Well, that’s the journey we’re about to dive into! We’re gearing up to unlock the secret to extracting tables from PDFs using Python.

In today’s tutorial, we’re diving into how to extract tables from PDF using Python! With the help of tabula-py and tkinter libraries, we’ll break down the steps to make managing PDF data simple and straightforward. Ready to get started? Let’s jump in and tackle this together!

Learn also: How to Create a Simple PDF File Viewer in Python

Table of Contents

Necessary Libraries

For the code to function properly, make sure to install the tkinter and tabula-py libraries using the terminal or your command prompt by running these commands:

$ pip install tk
$ pip install tabula-py

Imports

As usual, we start by importing the necessary modules and libraries for our script. To facilitate interaction between the user and the script through a graphical user interface (GUI), we will import the tkinter library.

import tkinter as tk

Secondly, we aim to enable the user to select the PDF file from which they wish to extract tables. To achieve this, we will import the filedialog module. Additionally, we will import messagebox to display messages in case any errors occur during the execution of the code. Furthermore, to allow scrolling through the text widget, we will import scrolledtext.

from tkinter import filedialog, messagebox, scrolledtext

Last but not least we import tabula which we will use to extract tables from the PDF file.

import tabula

Extract Tables from PDF Function

Now that we have imported the necessary modules and libraries, let’s go ahead and define the extract_tables function. This function starts by opening a file dialog, allowing the user to choose the PDF file containing the tables they want to extract. Once the file is selected, it proceeds to extract tables using the tabula.read_pdf() command. After extracting the tables, the function prepares to display them.

To ensure the text widget is empty, it deletes any previous content. Then, it iterates through the extracted tables, formats them, and inserts the data into the text widget using write_table_to_text_widget(). If this process is successful, a success message will be displayed. However, if something goes wrong, an error message will appear.

def extract_tables():
   # Open file dialog to select PDF file
   file_path = filedialog.askopenfilename(filetypes=[("PDF Files", "*.pdf")])
   if file_path:
       try:
           # Read tables from the selected PDF
           tables = tabula.read_pdf(file_path, pages="all", output_format="json", stream=True)


           # Clear previous content from the text widget
           text_widget.delete(1.0, tk.END)


           # Function to write table data to text widget with aligned columns
           def write_table_to_text_widget(table_data):
               # Calculate maximum width for each column
               max_widths = [max(len(str(cell.get("text", ""))) for cell in column) for column in zip(*table_data)]
               # Format and insert rows with aligned columns
               for row in table_data:
                   row_str = "\t".join(str(cell.get("text", "")).center(width) for cell, width in zip(row, max_widths))
                   text_widget.insert(tk.END, row_str + "\n")
               text_widget.insert(tk.END, "\n")


           # Write tables to text widget
           for i, table in enumerate(tables):
               text_widget.insert(tk.END, f"Table {i + 1}:\n")
               if "data" in table:
                   write_table_to_text_widget(table["data"])
               else:
                   text_widget.insert(tk.END, "No table data found.\n\n")


           messagebox.showinfo("Success", "Tables extracted successfully!")


       except Exception as e:
           messagebox.showerror("Error", f"An error occurred: {str(e)}")

Creating the Main Window

This is the part where we create the main window, which will house our text widget, and give it a title.

# Create main Tkinter window
root = tk.Tk()
root.title("PDF Table Extractor - The Pycodes")

Creating the Text Widget

After setting up the main window, it’s time to create the text widget. Naturally, we need to customize it, so we will specify its width and height, make it scrollable, and also enhance its readability by using wrap=tk.WORD. This ensures that tkinter does not split words when they reach the boundary of the text widget, but instead, starts a new line for each unbroken word.

# Create text widget to display tables
text_widget = scrolledtext.ScrolledText(root, wrap=tk.WORD, width=80, height=20)
text_widget.pack(padx=10, pady=10)

Extract Tables Button

Now that we have the interface, the text widget, and the function set up, all we need is a trigger to launch the operation. To achieve this, we will create a button named “Extract Tables” and associate it with the extract_tables() function. Once clicked, this button will call the function.

# Create button to trigger table extraction
extract_button = tk.Button(root, text="Extract Tables", command=extract_tables)
extract_button.pack(pady=10)

Starting the Main Loop

Lastly, this section ensures that the main window keeps running and is responsive to the user until he exits willingly.

# Run the Tkinter event loop
root.mainloop()

Example

Full Code

import tkinter as tk
from tkinter import filedialog, messagebox, scrolledtext
import tabula


def extract_tables():
   # Open file dialog to select PDF file
   file_path = filedialog.askopenfilename(filetypes=[("PDF Files", "*.pdf")])
   if file_path:
       try:
           # Read tables from the selected PDF
           tables = tabula.read_pdf(file_path, pages="all", output_format="json", stream=True)


           # Clear previous content from the text widget
           text_widget.delete(1.0, tk.END)


           # Function to write table data to text widget with aligned columns
           def write_table_to_text_widget(table_data):
               # Calculate maximum width for each column
               max_widths = [max(len(str(cell.get("text", ""))) for cell in column) for column in zip(*table_data)]
               # Format and insert rows with aligned columns
               for row in table_data:
                   row_str = "\t".join(str(cell.get("text", "")).center(width) for cell, width in zip(row, max_widths))
                   text_widget.insert(tk.END, row_str + "\n")
               text_widget.insert(tk.END, "\n")


           # Write tables to text widget
           for i, table in enumerate(tables):
               text_widget.insert(tk.END, f"Table {i + 1}:\n")
               if "data" in table:
                   write_table_to_text_widget(table["data"])
               else:
                   text_widget.insert(tk.END, "No table data found.\n\n")


           messagebox.showinfo("Success", "Tables extracted successfully!")


       except Exception as e:
           messagebox.showerror("Error", f"An error occurred: {str(e)}")


# Create main Tkinter window
root = tk.Tk()
root.title("PDF Table Extractor - The Pycodes")


# Create text widget to display tables
text_widget = scrolledtext.ScrolledText(root, wrap=tk.WORD, width=80, height=20)
text_widget.pack(padx=10, pady=10)


# Create button to trigger table extraction
extract_button = tk.Button(root, text="Extract Tables", command=extract_tables)
extract_button.pack(pady=10)


# Run the Tkinter event loop
root.mainloop()

Happy Coding!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top