Home » Tutorials » How to Build a Text Classification App with Machine Learning in Python

How to Build a Text Classification App with Machine Learning in Python

Building a text classification app can be an exciting journey into the world of machine learning. In our digital age, understanding and categorizing text data has never been more important, whether it’s filtering emails, analyzing sentiments, or organizing news articles. Today, you’ll learn how to harness the power of machine learning to create your own text classification application using Python.

In this tutorial, we’ll guide you through downloading and preparing the dataset, exploring text vectorization with TfidfVectorizer, and tuning our model with GridSearchCV. You’ll also get hands-on experience with popular classification models like Naive Bayes, Logistic Regression, and SVM, all while creating an interactive interface using Tkinter. So, let’s dive in and start building something amazing together!

Table of Contents

Before we jump into the code, let’s make sure we have all the libraries we need. Open your command line and run the following commands:

$ pip install requests 
$ pip install tqdm 
$ pip install nltk 
$ pip install scikit-learn
$ pip install tk 

Imports

You can’t build anything without a strong foundation, so let’s gather our tools to do just that:

  • We’ll use os and tarfile to manage file paths and extraction.
  • For the dataset, requests will handle downloading, and tqdm will provide a progress bar for tracking.
  • To make our program user-friendly, we’ll create a graphical interface using tkinter and its themed widgets from ttk, along with messagebox to display information.
  • For machine learning models, we’ll rely on our go-to library, sklearn.
  • Finally, since we’re classifying texts, we’ll preprocess them with natural language processing tools from nltk, an ideal choice for this task.
import os
import requests
from tqdm import tqdm
import tarfile
import tkinter as tk
from tkinter import messagebox, ttk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_files
from sklearn.metrics import accuracy_score
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import nltk

Downloading Stopwords

Even with our strong libraries, analyzing text is an arduous process. First, we need to remove common words that don’t add meaning to the classification, such as ‘ah,’ ‘or,’ ‘and,’ etc. To achieve this, we will use NLTK’s stopwords.

nltk.download("stopwords")

Define Dataset Paths and Variables

# Download path and dataset variables
DATA_URL = "https://ndownloader.figshare.com/files/5975967"
DATA_DIR = "20_newsgroups"
ARCHIVE_FILE = "20_newsgroups.tar.gz"
DATA_DIR_TRAIN = os.path.join(DATA_DIR, "20news-bydate-train")

To ensure we don’t lose track when downloading, storing, and locating the dataset, we need navigators in the form of constants. These constants play important roles:

  • DATA_URL: This is the link to the origin of the dataset we’ll be downloading.
  • DATA_DIR and ARCHIVE_FILE: Together, these specify the locations where the dataset will be stored and the name of the archive file.
  • DATA_DIR_TRAIN: Finally, this constant indicates where the training data will go after we extract it.

Dataset Download, Extraction, and Loading

Now, all that’s left is to download the dataset. Fortunately, we have the download_dataset() function, which connects us to the data world by using requests.get() to download the dataset in small, memory-efficient chunks. With tqdm, we also create a progress bar, allowing us to follow the download process in real-time.

# Download dataset with progress bar
def download_dataset(url, output_path):
   response = requests.get(url, stream=True)
   total_size = int(response.headers.get('content-length', 0))
   with open(output_path, "wb") as file, tqdm(
       desc="Downloading 20 Newsgroups dataset",
       total=total_size,
       unit="B",
       unit_scale=True,
       unit_divisor=1024,
   ) as bar:
       for data in response.iter_content(chunk_size=1024):
           file.write(data)
           bar.update(len(data))

Since we’ve downloaded the dataset, it’s time to move on to extraction. The extract_dataset() function handles this step, checking first if the dataset has already been extracted to avoid redoing it. If not, it extracts the dataset to our specified directory.

# Extract the dataset if not done already
def extract_dataset(archive_path, extract_to):
   if not os.path.isdir(extract_to):
       print("Extracting dataset...")
       with tarfile.open(archive_path, "r:gz") as tar:
           tar.extractall(path=extract_to)

Checking and Preparing the Dataset

Though everything seems ready, it’s always wise to double-check. Here, we verify that the data directory exists. If it doesn’t, we download and extract the dataset, then list its contents to confirm everything is set.

# Check and prepare dataset
if not os.path.isdir(DATA_DIR):
   if not os.path.isfile(ARCHIVE_FILE):
       download_dataset(DATA_URL, ARCHIVE_FILE)
   extract_dataset(ARCHIVE_FILE, DATA_DIR)

# Verify dataset directory content
if os.path.isdir(DATA_DIR):
   print(f"Contents of '{DATA_DIR}':", os.listdir(DATA_DIR))

Loading and Verifying Dataset Categories

Now, we load specific categories from our dataset: general politics (talk.politics.misc), sports (rec.sport.baseball), graphics (comp.graphics), and medicine (sci.med). Once the data is loaded, we check if any data exists; if it doesn’t, an error message will be displayed, and the program will exit.

# Load dataset with selective categories
try:
   newsgroups = load_files(DATA_DIR_TRAIN, categories=["talk.politics.misc", "rec.sport.baseball", "sci.med", "comp.graphics"])
   if len(newsgroups.data) == 0:
       raise ValueError("Dataset is empty. Check if the files were extracted correctly.")
except Exception as e:
   print(f"Error loading dataset: {e}")
   exit()

Preprocessing Data and Training the Models

After extracting and loading the data, we need to clean and stem it. The preprocess_documents() function handles this by using PorterStemmer to reduce words to their root forms, making the analysis more consistent. It also filters out stopwords and any additional unhelpful words that could introduce bias in the analysis. After processing, the data is ready for the next phase.

# Decode documents and apply preprocessing
def preprocess_documents(documents):
    ps = PorterStemmer()
    
    # Define custom stop words for the categories
    custom_stop_words = stopwords.words("english") + [
        "politics", "political", "gun", "guns", "sports", "baseball", 
        "graphics", "graphic", "medicine", "medical", "health", "sci", 
        "science", "discussion", "topic", "newsgroup", "forum"
    ]
    
    processed_docs = []
    for doc in documents:
        try:
            decoded = doc.decode("utf-8", errors="ignore")
            stemmed = " ".join([ps.stem(word) for word in decoded.split() if word.lower() not in custom_stop_words])
            processed_docs.append(stemmed)
        except UnicodeDecodeError:
            continue
    return processed_docs

Data Splitting and Vectorizing

Once the data is cleaned, we use train_test_split() to divide it into a training and a testing set. Next, TfidfVectorizer converts the documents into TF-IDF matrices that amplify unique terms and diminish common ones. This matrix forms the foundation for our classifier, helping to highlight meaningful word patterns.

# Preprocess and split the dataset
X_data = preprocess_documents(newsgroups.data)
X_train, X_test, y_train, y_test = train_test_split(X_data, newsgroups.target, test_size=0.2, random_state=42)

# Vectorizer with custom stop words
vectorizer = TfidfVectorizer(max_df=0.7)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

Tuning Parameters for Logistic Regression

Now we move to the model tuning phase, specifically for Logistic Regression. We use GridSearchCV to identify the optimal configuration for this model by adjusting the C parameter, which balances simplicity and error tolerance, yielding a finely-tuned classifier with maximum accuracy.

# Parameter tuning for Logistic Regression
logreg = LogisticRegression(max_iter=1000)
param_grid = {'C': [0.1, 1, 10, 100]}
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
logreg_cv.fit(X_train_tfidf, y_train)

Defining and Training Models

With the data ready, we set up three models, each with unique strengths:

  • Naive Bayes: Efficient for word count analysis.
  • Logistic Regression: Fine-tuned for accuracy.
  • SVM: Capable of learning complex patterns.
# Models with tuned parameters
models = {
    "Naive Bayes": MultinomialNB(),
    "Logistic Regression": logreg_cv.best_estimator_,
    "SVM": SVC(probability=True)
}

Training and Evaluating the Models

Lastly, we train each model on the training set and evaluate their performance on the testing set. We record their accuracies to compare how each classifier performs.

# Training models and storing accuracies
model_pipelines = {}
accuracies = {}
for model_name, model in models.items():
    pipeline = make_pipeline(vectorizer, model)
    pipeline.fit(X_train, y_train)
    model_pipelines[model_name] = pipeline
    y_pred = pipeline.predict(X_test)
    accuracies[model_name] = accuracy_score(y_test, y_pred)

Implementing Text Classification and Reset Functionality

Creating the Classification Function

Before we dive into the visual part of our program, we need a way to connect the GUI with the classification logic. That’s where the classify_text() function comes in. This function retrieves the user input using input_text.get(), stripping any whitespace with the strip() function.

First, it checks if there’s any input; if not, it prompts the user to enter some text through a message box. Once valid input is provided, it grabs the selected model from the GUI and uses it to predict the category, along with the associated probabilities, using pipeline.predict() and pipeline.predict_proba(). Finally, it displays the results to the user.

# Function to classify text based on selected model
def classify_text():
    user_text = input_text.get("1.0", "end-1c").strip()
    if not user_text:
        messagebox.showwarning("Input Required", "Please enter some text to classify.")
        return

    selected_model = model_choice.get()
    pipeline = model_pipelines[selected_model]
    prediction = pipeline.predict([user_text])[0]
    probabilities = pipeline.predict_proba([user_text])[0]

    # Display results
    result_text.set(f"Predicted Category: {newsgroups.target_names[prediction]}")
    prob_text = "\n".join([f"{newsgroups.target_names[i]}: {prob:.2%}" for i, prob in enumerate(probabilities)])
    prob_label.config(text=f"Prediction Probabilities:\n{prob_text}")

Adding the Reset Function

To enable classification of new texts, we need a way to reset the program. This is accomplished through the clear_text() function, which clears the input text box using input_text.delete(), resets the results label with result_text.set(""), and clears out the probability label with prob_label.config().

# Function to clear input and output fields
def clear_text():
    input_text.delete("1.0", tk.END)
    result_text.set("")
    prob_label.config(text="")

Build and Run Text Classifier UI

To kick things off, we set up our main window using Tk, giving it a title and a comfortable size. This creates a friendly space for users to interact with our text classifier. We add a label to guide them on what to do, followed by a spacious text box where they can enter the text they want to classify. To make it even more user-friendly, we include a drop-down menu that allows users to select from the different models we’ve implemented, with Naive Bayes set as the default option.

# Tkinter UI setup
app = tk.Tk()
app.title("Text Classifier - The Pycodes")
app.geometry("600x600")

# Input Text Label
tk.Label(app, text="Enter text to classify:", font=("Arial", 12)).pack(pady=10)

# Text Box for Input
input_text = tk.Text(app, height=8, width=60, font=("Arial", 10))
input_text.pack(pady=10)

# Classifier Choice Dropdown
model_choice = ttk.Combobox(app, values=list(models.keys()), font=("Arial", 10))
model_choice.set("Naive Bayes")
model_choice.pack(pady=10)

Next, we provide users with valuable information by displaying the training accuracy of each model right in the interface. Using StringVar(), we keep this information dynamic, allowing it to update automatically based on the model selected. To trigger the classification process, we create a button that users can click to classify the text they’ve entered. The results are then displayed in a designated label, giving instant feedback on the predicted category.

# Show training accuracy of selected model
accuracy_text = tk.StringVar()
accuracy_text.set(f"Training Accuracies:\n" + "\n".join([f"{model}: {acc:.2%}" for model, acc in accuracies.items()]))
accuracy_label = tk.Label(app, textvariable=accuracy_text, font=("Arial", 10), fg="blue")
accuracy_label.pack(pady=5)

# Classify Button
classify_button = tk.Button(app, text="Classify Text", command=classify_text, font=("Arial", 12), bg="lightgreen")
classify_button.pack(pady=5)

# Result Label
result_text = tk.StringVar()
result_label = tk.Label(app, textvariable=result_text, font=("Arial", 14), fg="green")
result_label.pack(pady=10)

To enhance usability, we add a label for showing the prediction probabilities for each category, helping users understand the confidence level of the predictions. Additionally, a “Clear Text” button allows users to easily reset their input and start fresh.

# Prediction Probabilities Label
prob_label = tk.Label(app, text="", font=("Arial", 10), fg="purple")
prob_label.pack(pady=10)

# Clear Button
clear_button = tk.Button(app, text="Clear Text", command=clear_text, font=("Arial", 12), bg="lightcoral")
clear_button.pack(pady=5)

To wrap things up, we kick off the main event loop using mainloop(). This keeps our window active and responsive, ensuring that users have a smooth and enjoyable experience as they interact with the application.

# Run the Tkinter main loop
app.mainloop()

Example

As shown in the example from the app interface, if we input the text ‘Recent studies show promising results for new treatments in cancer therapy‘, the classifier should ideally categorize this under sci.med due to the medical and scientific context of the statement.

Then, I tried this text about Politics: ‘Political discussions about the upcoming elections are dominating the news cycle‘.

Full Code

import os
import requests
from tqdm import tqdm
import tarfile
import tkinter as tk
from tkinter import messagebox, ttk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_files
from sklearn.metrics import accuracy_score
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import nltk


nltk.download("stopwords")


# Download path and dataset variables
DATA_URL = "https://ndownloader.figshare.com/files/5975967"
DATA_DIR = "20_newsgroups"
ARCHIVE_FILE = "20_newsgroups.tar.gz"
DATA_DIR_TRAIN = os.path.join(DATA_DIR, "20news-bydate-train")


# Download dataset with progress bar
def download_dataset(url, output_path):
   response = requests.get(url, stream=True)
   total_size = int(response.headers.get('content-length', 0))
   with open(output_path, "wb") as file, tqdm(
       desc="Downloading 20 Newsgroups dataset",
       total=total_size,
       unit="B",
       unit_scale=True,
       unit_divisor=1024,
   ) as bar:
       for data in response.iter_content(chunk_size=1024):
           file.write(data)
           bar.update(len(data))


# Extract the dataset if not done already
def extract_dataset(archive_path, extract_to):
   if not os.path.isdir(extract_to):
       print("Extracting dataset...")
       with tarfile.open(archive_path, "r:gz") as tar:
           tar.extractall(path=extract_to)


# Check and prepare dataset
if not os.path.isdir(DATA_DIR):
   if not os.path.isfile(ARCHIVE_FILE):
       download_dataset(DATA_URL, ARCHIVE_FILE)
   extract_dataset(ARCHIVE_FILE, DATA_DIR)


# Verify dataset directory content
if os.path.isdir(DATA_DIR):
   print(f"Contents of '{DATA_DIR}':", os.listdir(DATA_DIR))


# Load dataset with selective categories
try:
   newsgroups = load_files(DATA_DIR_TRAIN, categories=["talk.politics.misc", "rec.sport.baseball", "sci.med", "comp.graphics"])
   if len(newsgroups.data) == 0:
       raise ValueError("Dataset is empty. Check if the files were extracted correctly.")
except Exception as e:
   print(f"Error loading dataset: {e}")
   exit()


# Decode documents and apply preprocessing
def preprocess_documents(documents):
    ps = PorterStemmer()
    
    # Define custom stop words for the categories
    custom_stop_words = stopwords.words("english") + [
        "politics", "political", "gun", "guns", "sports", "baseball", 
        "graphics", "graphic", "medicine", "medical", "health", "sci", 
        "science", "discussion", "topic", "newsgroup", "forum"
    ]
    
    processed_docs = []
    for doc in documents:
        try:
            decoded = doc.decode("utf-8", errors="ignore")
            stemmed = " ".join([ps.stem(word) for word in decoded.split() if word.lower() not in custom_stop_words])
            processed_docs.append(stemmed)
        except UnicodeDecodeError:
            continue
    return processed_docs




# Preprocess and split the dataset
X_data = preprocess_documents(newsgroups.data)
X_train, X_test, y_train, y_test = train_test_split(X_data, newsgroups.target, test_size=0.2, random_state=42)


# Vectorizer with custom stop words
vectorizer = TfidfVectorizer(max_df=0.7)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


# Parameter tuning for Logistic Regression
logreg = LogisticRegression(max_iter=1000)
param_grid = {'C': [0.1, 1, 10, 100]}
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
logreg_cv.fit(X_train_tfidf, y_train)


# Models with tuned parameters
models = {
   "Naive Bayes": MultinomialNB(),
   "Logistic Regression": logreg_cv.best_estimator_,
   "SVM": SVC(probability=True)
}


# Training models and storing accuracies
model_pipelines = {}
accuracies = {}
for model_name, model in models.items():
   pipeline = make_pipeline(vectorizer, model)
   pipeline.fit(X_train, y_train)
   model_pipelines[model_name] = pipeline
   y_pred = pipeline.predict(X_test)
   accuracies[model_name] = accuracy_score(y_test, y_pred)


# Function to classify text based on selected model
def classify_text():
   user_text = input_text.get("1.0", "end-1c").strip()
   if not user_text:
       messagebox.showwarning("Input Required", "Please enter some text to classify.")
       return


   selected_model = model_choice.get()
   pipeline = model_pipelines[selected_model]
   prediction = pipeline.predict([user_text])[0]
   probabilities = pipeline.predict_proba([user_text])[0]


   # Display results
   result_text.set(f"Predicted Category: {newsgroups.target_names[prediction]}")
   prob_text = "\n".join([f"{newsgroups.target_names[i]}: {prob:.2%}" for i, prob in enumerate(probabilities)])
   prob_label.config(text=f"Prediction Probabilities:\n{prob_text}")


# Function to clear input and output fields
def clear_text():
   input_text.delete("1.0", tk.END)
   result_text.set("")
   prob_label.config(text="")


# Tkinter UI setup
app = tk.Tk()
app.title("Text Classifier - The Pycodes")
app.geometry("600x600")


# Input Text Label
tk.Label(app, text="Enter text to classify:", font=("Arial", 12)).pack(pady=10)


# Text Box for Input
input_text = tk.Text(app, height=8, width=60, font=("Arial", 10))
input_text.pack(pady=10)


# Classifier Choice Dropdown
model_choice = ttk.Combobox(app, values=list(models.keys()), font=("Arial", 10))
model_choice.set("Naive Bayes")
model_choice.pack(pady=10)


# Show training accuracy of selected model
accuracy_text = tk.StringVar()
accuracy_text.set(f"Training Accuracies:\n" + "\n".join([f"{model}: {acc:.2%}" for model, acc in accuracies.items()]))
accuracy_label = tk.Label(app, textvariable=accuracy_text, font=("Arial", 10), fg="blue")
accuracy_label.pack(pady=5)


# Classify Button
classify_button = tk.Button(app, text="Classify Text", command=classify_text, font=("Arial", 12), bg="lightgreen")
classify_button.pack(pady=5)


# Result Label
result_text = tk.StringVar()
result_label = tk.Label(app, textvariable=result_text, font=("Arial", 14), fg="green")
result_label.pack(pady=10)


# Prediction Probabilities Label
prob_label = tk.Label(app, text="", font=("Arial", 10), fg="purple")
prob_label.pack(pady=10)


# Clear Button
clear_button = tk.Button(app, text="Clear Text", command=clear_text, font=("Arial", 12), bg="lightcoral")
clear_button.pack(pady=5)


# Run the Tkinter main loop
app.mainloop()

Happy Coding!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top