Home » Tutorials » How to Detect Fake News in Python

How to Detect Fake News in Python

In our digital world, it feels like we’re swimming in a sea of information, and not all of it is true. The rise of fake news is troubling; it can shape public opinion and influence major events. That’s why it’s more important than ever to have the right tools to spot the real from the fake.

Today, you’ll learn how to harness Python to build a powerful weapon against misinformation. I’ll show you how to detect fake news by building a news detector in Python, we’re gonna create a machine learning-powered news article classifier that can tell real news from fake using the tkinter library to develop the graphical user interface (GUI), processing datasets with Pandas, and utilizing Scikit-learn algorithms to accurately train your classifier.

Let’s get started!

Table of Contents

Necessary Libraries

Make sure to install these libraries via the terminal or your command prompt for the code to function properly:

$ pip install tk
$ pip install pandas
$ pip install scikit-learn

Before we start, you need a dataset with examples of fake and real news in CSV format. You can download it from this link:

https://github.com/lutzhamel/fake-news/blob/master/data/fake_or_real_news.csv

Imports

We intend to create a graphical user interface, so we’ll start by importing the tkinter library. From tkinter we import the necessary components:

  • filedialog: Allows users to select files.
  • Text: A widget used for multi-line text input.
  • Button: A widget for creating clickable buttons.
  • Label: A widget for displaying text.
  • DISABLED and NORMAL: Constants that represent the state of GUI elements.

To ensure that the main window remains responsive when handling multiple tasks, we also import:

  • threading: Helps manage operations in separate threads, preventing the GUI from freezing.

For data handling and machine learning:

  • pandas: A library used for manipulating and analyzing data.
  • Lastly, from sklearn, we import:
    • train_test_split: Splits the data into training and testing sets.
    • TfidfVectorizer: Converts text data into numerical feature vectors, essential for machine learning.
    • SVC (Support Vector Classifier): A powerful tool for classification tasks.
import tkinter as tk
from tkinter import filedialog, Text, Button, Label, DISABLED, NORMAL
import threading
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

Building Functions for Fake News Detection

Now, let’s define our functions:

load_data_thread Function

The first function starts a new thread to handle the data-loading process using the load_and_preprocess_data() function.

def load_data_thread():
   thread = threading.Thread(target=load_and_preprocess_data)
   thread.start()

load_and_preprocess_data Function

This one begins by opening a file dialog, allowing the user to select a dataset file. The path of the selected file is then stored in the file_path variable. If a file is selected, the status_label is updated to inform the user that the data loading process has begun. The function attempts to read the selected CSV file using pandas.read_csv(), specifically loading only the “text” and “label” columns while skipping any bad lines. This selective loading can significantly reduce memory usage.

If the “label” column exists in the dataset, the function converts the labels to numeric values, assigning 0 for ‘REAL’ and 1 for ‘FAKE’, and then removes the original “label” column. The modified data is then used to train the model through the train_model() function. Upon successful training, the status_label is updated again to indicate that the data is loaded and the model is trained, allowing predictions to commence by enabling the previously disabled predict button. If the “label” column is not found or if an error occurs during file loading, the status_label will provide appropriate feedback.

def load_and_preprocess_data():
   file_path = filedialog.askopenfilename()
   if file_path:
       status_label.config(text="Loading data, please wait...")
       try:
           # Loading only the necessary columns can significantly reduce memory usage
           data = pd.read_csv(file_path, usecols=['text', 'label'], on_bad_lines='skip')
           if 'label' in data.columns:
               data['fake'] = data['label'].apply(lambda x: 0 if x == 'REAL' else 1)
               data.drop('label', axis=1, inplace=True)
               train_model(data)
               status_label.config(text="Data loaded and model trained. You can now predict.")
               predict_button.config(state=NORMAL)
           else:
               status_label.config(text="Necessary column 'label' not found in the dataset.")
       except Exception as e:
           status_label.config(text=f"Failed to load data: {str(e)}")

train_model Function

The function train_model(data) starts by utilizing train_test_split() from sklearn to divide the dataset of news articles into two parts: a training set and a testing set. The training set is used to teach the computer to recognize patterns, while the testing set evaluates the effectiveness of this training. Next, the function declares two global variables:

  • The first: vectorizer, utilizes TfidfVectorizer() to convert text into a numerical format that the computer can process.
  • The second: model, employs a Support Vector Machine (SVM) algorithm, which is pivotal for pattern recognition, enabling the computer to differentiate between real and fake articles.

Lastly, the function assesses the performance of the trained model using model.score(), which calculates the accuracy of the model’s predictions.

def train_model(data):
   x_train, x_test, y_train, y_test = train_test_split(data['text'], data['fake'], test_size=0.2)
   global vectorizer, model
   vectorizer = TfidfVectorizer(stop_words="english", max_df=0.7)
   x_train_vectorized = vectorizer.fit_transform(x_train)
   model = SVC(kernel="linear")
   model.fit(x_train_vectorized, y_train)
   accuracy = model.score(vectorizer.transform(x_test), y_test)
   status_label.config(text=f"Model Accuracy: {accuracy:.2f}")

predict_article Function

The last one starts by verifying whether both the vectorizer and model are initialized, essentially determining if the model has been trained. If not, and both are None, it updates the prediction_label to prompt the user to train the model. Once the model is ready and everything is initialized, it retrieves the input text using article_input.get(). This text is then transformed into a numerical format using vectorizer.transform(). Subsequently, the model.predict() function, which utilizes the trained SVM model, predicts whether the input article is REAL or FAKE.

Finally, the prediction, a numerical value, is converted to a human-readable format, where “1” indicates the article is FAKE, and otherwise, it is REAL.

def predict_article():
   if vectorizer is None or model is None:
       prediction_label.config(text="Model not ready. Please load data and train the model first.")
       return
   article_text = article_input.get("1.0", "end-1c")
   article_vectorized = vectorizer.transform([article_text])
   prediction = model.predict(article_vectorized)
   label = "FAKE" if prediction[0] == 1 else "REAL"
   prediction_label.config(text=f"Predicted Label: {label}")

Initializing the Main Window

This part creates the main window and sets its title.

root = tk.Tk()
root.title("News Article Classifier - The Pycodes")

Initializing the GUI Elements

For this step, we start by creating the status_label that prompts the user to load data. Next, we create the “Load Button” that triggers the load_data_thread() function. We then set up the article_input widget, where the user can input the article they want to classify as REAL or FAKE.

After that, we create the “Predict Article” button, which initially triggers the predict_article() function in a DISABLED state. Finally, we create the prediction_label widget that displays the prediction results for the inputted article.

# Status label
status_label = Label(root, text="Load data to start", fg="blue")
status_label.pack()


# Load data button
load_button = Button(root, text="Load Data", command=load_data_thread)
load_button.pack()


# User input for predictions
article_input = Text(root, height=5, width=50)
article_input.pack()


# Prediction button
predict_button = Button(root, text="Predict Article", command=predict_article, state=DISABLED)
predict_button.pack()


# Prediction result display
prediction_label = Label(root, text="")
prediction_label.pack()

Global Variables Initialization

Here we will Initialize the vectorizer and the model variables and set them to None so they can be updated later once the model is trained.

# Global variables for model and vectorizer
model = None
vectorizer = None

Main Event Loop

Lastly, this part starts the main event loop and ensures that the main window is running and responsive to the user.

root.mainloop()

Example

Full Code

import tkinter as tk
from tkinter import filedialog, Text, Button, Label, DISABLED, NORMAL
import threading
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC


def load_data_thread():
   thread = threading.Thread(target=load_and_preprocess_data)
   thread.start()


def load_and_preprocess_data():
   file_path = filedialog.askopenfilename()
   if file_path:
       status_label.config(text="Loading data, please wait...")
       try:
           # Loading only the necessary columns can significantly reduce memory usage
           data = pd.read_csv(file_path, usecols=['text', 'label'], on_bad_lines='skip')
           if 'label' in data.columns:
               data['fake'] = data['label'].apply(lambda x: 0 if x == 'REAL' else 1)
               data.drop('label', axis=1, inplace=True)
               train_model(data)
               status_label.config(text="Data loaded and model trained. You can now predict.")
               predict_button.config(state=NORMAL)
           else:
               status_label.config(text="Necessary column 'label' not found in the dataset.")
       except Exception as e:
           status_label.config(text=f"Failed to load data: {str(e)}")


def train_model(data):
   x_train, x_test, y_train, y_test = train_test_split(data['text'], data['fake'], test_size=0.2)
   global vectorizer, model
   vectorizer = TfidfVectorizer(stop_words="english", max_df=0.7)
   x_train_vectorized = vectorizer.fit_transform(x_train)
   model = SVC(kernel="linear")
   model.fit(x_train_vectorized, y_train)
   accuracy = model.score(vectorizer.transform(x_test), y_test)
   status_label.config(text=f"Model Accuracy: {accuracy:.2f}")


def predict_article():
   if vectorizer is None or model is None:
       prediction_label.config(text="Model not ready. Please load data and train the model first.")
       return
   article_text = article_input.get("1.0", "end-1c")
   article_vectorized = vectorizer.transform([article_text])
   prediction = model.predict(article_vectorized)
   label = "FAKE" if prediction[0] == 1 else "REAL"
   prediction_label.config(text=f"Predicted Label: {label}")


root = tk.Tk()
root.title("News Article Classifier - The Pycodes")


# Status label
status_label = Label(root, text="Load data to start", fg="blue")
status_label.pack()


# Load data button
load_button = Button(root, text="Load Data", command=load_data_thread)
load_button.pack()


# User input for predictions
article_input = Text(root, height=5, width=50)
article_input.pack()


# Prediction button
predict_button = Button(root, text="Predict Article", command=predict_article, state=DISABLED)
predict_button.pack()


# Prediction result display
prediction_label = Label(root, text="")
prediction_label.pack()


# Global variables for model and vectorizer
model = None
vectorizer = None


root.mainloop()

Happy Coding!

Subscribe for Top Free Python Tutorials!

Receive the best directly.  Elevate Your Coding Journey!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
×