In our digital world, it feels like we’re swimming in a sea of information, and not all of it is true. The rise of fake news is troubling; it can shape public opinion and influence major events. That’s why it’s more important than ever to have the right tools to spot the real from the fake.
Today, you’ll learn how to harness Python to build a powerful weapon against misinformation. I’ll show you how to detect fake news by building a news detector in Python, we’re gonna create a machine learning-powered news article classifier that can tell real news from fake using the tkinter library to develop the graphical user interface (GUI), processing datasets with Pandas, and utilizing Scikit-learn algorithms to accurately train your classifier.
Let’s get started!
Table of Contents
- Necessary Libraries
- Imports
- Building Functions for Fake News Detection
- Initializing the Main Window
- Initializing the GUI Elements
- Global Variables Initialization
- Main Event Loop
- Example
- Full Code
Necessary Libraries
Make sure to install these libraries via the terminal or your command prompt for the code to function properly:
$ pip install tk
$ pip install pandas
$ pip install scikit-learn
Before we start, you need a dataset with examples of fake and real news in CSV format. You can download it from this link:
https://github.com/lutzhamel/fake-news/blob/master/data/fake_or_real_news.csv
Imports
We intend to create a graphical user interface, so we’ll start by importing the tkinter
library. From tkinter
we import the necessary components:
filedialog
: Allows users to select files.Text
: A widget used for multi-line text input.Button
: A widget for creating clickable buttons.Label
: A widget for displaying text.DISABLED
andNORMAL
: Constants that represent the state of GUI elements.
To ensure that the main window remains responsive when handling multiple tasks, we also import:
threading
: Helps manage operations in separate threads, preventing the GUI from freezing.
For data handling and machine learning:
pandas
: A library used for manipulating and analyzing data.- Lastly, from
sklearn
, we import:train_test_split
: Splits the data into training and testing sets.TfidfVectorizer
: Converts text data into numerical feature vectors, essential for machine learning.SVC (Support Vector Classifier)
: A powerful tool for classification tasks.
import tkinter as tk
from tkinter import filedialog, Text, Button, Label, DISABLED, NORMAL
import threading
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
Building Functions for Fake News Detection
Now, let’s define our functions:
load_data_thread Function
The first function starts a new thread to handle the data-loading process using the load_and_preprocess_data()
function.
def load_data_thread():
thread = threading.Thread(target=load_and_preprocess_data)
thread.start()
load_and_preprocess_data Function
This one begins by opening a file dialog, allowing the user to select a dataset file. The path of the selected file is then stored in the file_path
variable. If a file is selected, the status_label
is updated to inform the user that the data loading process has begun. The function attempts to read the selected CSV file using pandas.read_csv()
, specifically loading only the “text” and “label” columns while skipping any bad lines. This selective loading can significantly reduce memory usage.
If the “label” column exists in the dataset, the function converts the labels to numeric values, assigning 0 for ‘REAL’ and 1 for ‘FAKE’, and then removes the original “label” column. The modified data is then used to train the model through the train_model()
function. Upon successful training, the status_label
is updated again to indicate that the data is loaded and the model is trained, allowing predictions to commence by enabling the previously disabled predict button. If the “label” column is not found or if an error occurs during file loading, the status_label
will provide appropriate feedback.
def load_and_preprocess_data():
file_path = filedialog.askopenfilename()
if file_path:
status_label.config(text="Loading data, please wait...")
try:
# Loading only the necessary columns can significantly reduce memory usage
data = pd.read_csv(file_path, usecols=['text', 'label'], on_bad_lines='skip')
if 'label' in data.columns:
data['fake'] = data['label'].apply(lambda x: 0 if x == 'REAL' else 1)
data.drop('label', axis=1, inplace=True)
train_model(data)
status_label.config(text="Data loaded and model trained. You can now predict.")
predict_button.config(state=NORMAL)
else:
status_label.config(text="Necessary column 'label' not found in the dataset.")
except Exception as e:
status_label.config(text=f"Failed to load data: {str(e)}")
train_model Function
The function train_model(data)
starts by utilizing train_test_split()
from sklearn
to divide the dataset of news articles into two parts: a training set and a testing set. The training set is used to teach the computer to recognize patterns, while the testing set evaluates the effectiveness of this training. Next, the function declares two global variables:
- The first:
vectorizer
, utilizesTfidfVectorizer()
to convert text into a numerical format that the computer can process. - The second:
model
, employs a Support Vector Machine (SVM) algorithm, which is pivotal for pattern recognition, enabling the computer to differentiate between real and fake articles.
Lastly, the function assesses the performance of the trained model using model.score()
, which calculates the accuracy of the model’s predictions.
def train_model(data):
x_train, x_test, y_train, y_test = train_test_split(data['text'], data['fake'], test_size=0.2)
global vectorizer, model
vectorizer = TfidfVectorizer(stop_words="english", max_df=0.7)
x_train_vectorized = vectorizer.fit_transform(x_train)
model = SVC(kernel="linear")
model.fit(x_train_vectorized, y_train)
accuracy = model.score(vectorizer.transform(x_test), y_test)
status_label.config(text=f"Model Accuracy: {accuracy:.2f}")
predict_article Function
The last one starts by verifying whether both the vectorizer
and model
are initialized, essentially determining if the model has been trained. If not, and both are None
, it updates the prediction_label
to prompt the user to train the model. Once the model is ready and everything is initialized, it retrieves the input text using article_input.get()
. This text is then transformed into a numerical format using vectorizer.transform()
. Subsequently, the model.predict()
function, which utilizes the trained SVM model, predicts whether the input article is REAL or FAKE.
Finally, the prediction, a numerical value, is converted to a human-readable format, where “1” indicates the article is FAKE, and otherwise, it is REAL.
def predict_article():
if vectorizer is None or model is None:
prediction_label.config(text="Model not ready. Please load data and train the model first.")
return
article_text = article_input.get("1.0", "end-1c")
article_vectorized = vectorizer.transform([article_text])
prediction = model.predict(article_vectorized)
label = "FAKE" if prediction[0] == 1 else "REAL"
prediction_label.config(text=f"Predicted Label: {label}")
Initializing the Main Window
This part creates the main window and sets its title.
root = tk.Tk()
root.title("News Article Classifier - The Pycodes")
Initializing the GUI Elements
For this step, we start by creating the status_label
that prompts the user to load data. Next, we create the “Load Button” that triggers the load_data_thread()
function. We then set up the article_input
widget, where the user can input the article they want to classify as REAL or FAKE.
After that, we create the “Predict Article” button, which initially triggers the predict_article()
function in a DISABLED state. Finally, we create the prediction_label
widget that displays the prediction results for the inputted article.
# Status label
status_label = Label(root, text="Load data to start", fg="blue")
status_label.pack()
# Load data button
load_button = Button(root, text="Load Data", command=load_data_thread)
load_button.pack()
# User input for predictions
article_input = Text(root, height=5, width=50)
article_input.pack()
# Prediction button
predict_button = Button(root, text="Predict Article", command=predict_article, state=DISABLED)
predict_button.pack()
# Prediction result display
prediction_label = Label(root, text="")
prediction_label.pack()
Global Variables Initialization
Here we will Initialize the vectorizer
and the model
variables and set them to None
so they can be updated later once the model is trained.
# Global variables for model and vectorizer
model = None
vectorizer = None
Main Event Loop
Lastly, this part starts the main event loop and ensures that the main window is running and responsive to the user.
root.mainloop()
Example
Full Code
import tkinter as tk
from tkinter import filedialog, Text, Button, Label, DISABLED, NORMAL
import threading
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
def load_data_thread():
thread = threading.Thread(target=load_and_preprocess_data)
thread.start()
def load_and_preprocess_data():
file_path = filedialog.askopenfilename()
if file_path:
status_label.config(text="Loading data, please wait...")
try:
# Loading only the necessary columns can significantly reduce memory usage
data = pd.read_csv(file_path, usecols=['text', 'label'], on_bad_lines='skip')
if 'label' in data.columns:
data['fake'] = data['label'].apply(lambda x: 0 if x == 'REAL' else 1)
data.drop('label', axis=1, inplace=True)
train_model(data)
status_label.config(text="Data loaded and model trained. You can now predict.")
predict_button.config(state=NORMAL)
else:
status_label.config(text="Necessary column 'label' not found in the dataset.")
except Exception as e:
status_label.config(text=f"Failed to load data: {str(e)}")
def train_model(data):
x_train, x_test, y_train, y_test = train_test_split(data['text'], data['fake'], test_size=0.2)
global vectorizer, model
vectorizer = TfidfVectorizer(stop_words="english", max_df=0.7)
x_train_vectorized = vectorizer.fit_transform(x_train)
model = SVC(kernel="linear")
model.fit(x_train_vectorized, y_train)
accuracy = model.score(vectorizer.transform(x_test), y_test)
status_label.config(text=f"Model Accuracy: {accuracy:.2f}")
def predict_article():
if vectorizer is None or model is None:
prediction_label.config(text="Model not ready. Please load data and train the model first.")
return
article_text = article_input.get("1.0", "end-1c")
article_vectorized = vectorizer.transform([article_text])
prediction = model.predict(article_vectorized)
label = "FAKE" if prediction[0] == 1 else "REAL"
prediction_label.config(text=f"Predicted Label: {label}")
root = tk.Tk()
root.title("News Article Classifier - The Pycodes")
# Status label
status_label = Label(root, text="Load data to start", fg="blue")
status_label.pack()
# Load data button
load_button = Button(root, text="Load Data", command=load_data_thread)
load_button.pack()
# User input for predictions
article_input = Text(root, height=5, width=50)
article_input.pack()
# Prediction button
predict_button = Button(root, text="Predict Article", command=predict_article, state=DISABLED)
predict_button.pack()
# Prediction result display
prediction_label = Label(root, text="")
prediction_label.pack()
# Global variables for model and vectorizer
model = None
vectorizer = None
root.mainloop()
Happy Coding!