Natural Language Processing (NLP) has transformed how we engage with technology by enabling machines to comprehend and process human language. This field of artificial intelligence opens up endless possibilities, from simple chatbots to complex sentiment analysis and language translation systems. As we continue to generate vast amounts of textual data every day, the need for efficient and accurate NLP tools becomes ever more critical. Libraries like NLTK and SpaCy have emerged as powerful tools in the Python ecosystem, making it easier for developers to implement NLP solutions in their applications.
Today, you’ll learn how to implement natural language processing using NLTK and SpaCy. We’ll guide you through building a comprehensive text analysis application that leverages the strengths of these two libraries. You’ll dive into text preprocessing, part-of-speech tagging, named entity recognition, and sentiment analysis. By the end of this tutorial, you’ll have a fully functional and interactive GUI application that showcases the impressive capabilities of NLP.
Let’s get started!
Table of Contents
- Necessary Libraries
- Imports
- Loading the SpaCy Model and Getting VADER Ready
- Text Processing with NLTK
- Text Processing with SpaCy
- Sentiment Analysis with VADER
- Displaying Results
- Main GUI Layout
- Example
- Full Code
Necessary Libraries
Make sure to install these libraries via the terminal or command prompt for the code to function properly:
$ pip install nltk
$ pip install spacy
$ python -m spacy download en_core_web_sm
$ pip install tk
Imports
If we want to dive into the world of Natural Language Processing (NLP) and really get a handle on it, we need some powerful tools at our disposal. That’s exactly what we’re about to gather. So, without further ado, let’s start by importing the essential modules and libraries:
- nltk: Short for Natural Language Toolkit, this is our go-to collection of libraries for all things natural language processing.
- spacy: A powerhouse library that makes tackling NLP tasks a piece of cake!
- word_tokenize and sent_tokenize from nltk: These handy functions help us split text into words and sentences.
- stopwords from nltk: Provides a list of common words we usually want to filter out to focus on the important stuff.
- WordNetLemmatizer from nltk: This tool is our friend for reducing words to their base or root form, making analysis more accurate.
- SentimentIntensityAnalyzer from nltk: This brings in VADER, a powerful tool for sentiment analysis to understand the emotions behind the text.
- tkinter: Used to create a graphical user interface. We also import
scrolledtext
andttk
from it to create scrollable text widgets and themed widgets.
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import tkinter as tk
from tkinter import scrolledtext, ttk
Additionally, we’ll need to download some NLTK data files, because just importing them isn’t enough.
# We Download necessary NLTK data files
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')
Loading the SpaCy Model and Getting VADER Ready
Now it’s time to gather our tools so we can jump in and break down the details of any text. We start by loading the SpaCy model, which splits text into words and sentences and recognizes named entities. Next, we initialize VADER, which performs sentiment analysis using a lexicon and rule-based approach.
# Load SpaCy model
nlp = spacy.load('en_core_web_sm')
# Initialize NLTK's VADER sentiment intensity analyzer
sia = SentimentIntensityAnalyzer()
Text Processing with NLTK
With all our tools in place, we can finally start analyzing the text inputs. This is where the nltk_preprocess()
function comes in. It filters, tokenizes, and refines words to their base forms. But how does it do that? No worries, we’re about to find out:
- First off, it grabs the text string you provide. Then, it uses
sent_tokenize()
to split the text into sentences andword_tokenize()
to split the text into words. - Next, it creates a set of English stopwords and filters them out from the tokenized words. After that, it initializes the
WordNetLemmatizer
to reduce the filtered words to their base or root forms.
Finally, it gathers all these results and gets them ready to be displayed.
# Function for text preprocessing using NLTK
def nltk_preprocess(text):
sentences = sent_tokenize(text)
words = word_tokenize(text)
stop_words = set(stopwords.words('english'))
# Filter out stopwords using filter() and a lambda function
filtered_words = list(filter(lambda word: word.lower() not in stop_words, words))
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in filtered_words]
return sentences, filtered_words, lemmatized
Text Processing with SpaCy
While the previous function breaks the text into words and sentences, the spacy_process()
function identifies key entities and reveals the role each word plays. So without further ado, let’s see how it does that:
- First, it takes the input text and transforms it into a SpaCy document. Then, it uses
token_pos_pairs
to uncover the position and role of each word in the grand narrative, extracting the heroes and key players (named entities) in the story. - Finally, it wraps up all these insights and gets them ready to be displayed.
# Function for text processing using SpaCy
def spacy_process(text):
doc = nlp(text)
token_pos_pairs = [(token.text, token.pos_) for token in doc]
named_entities = [(entity.text, entity.label_) for entity in doc.ents]
return token_pos_pairs, named_entities
Sentiment Analysis with VADER
You can think of the analyze_sentiment()
function as the oracle of emotions, telling you whether the text is positive, negative, neutral, or somewhere in between. Thanks to sia.polarity_scores()
, it returns a detailed emotional profile of the text, capturing its sentiment essence.
# Function for sentiment analysis using NLTK
def analyze_sentiment(text):
return sia.polarity_scores(text)
Displaying Results
Now that we have dissected the text and gathered all the results, it’s time to share our findings on the GUI. This is where the show_results()
function shines. It kicks off by retrieving the user’s input using text_input.get()
. Next, it preprocesses the text with nltk_preprocess()
and processes it further with spacy_process()
.
Then, analyze_sentiment()
dives in to perform its sentiment analysis. After that, the function fetches the selected tag from the combobox via pos_combobox.get()
and filters the tokens accordingly. It compiles all this information into a neat string of results. Finally, it clears any existing text from the output_text
widget and displays our freshly compiled results.
# Function to display results in the GUI
def show_results():
input_text_content = text_input.get("1.0", tk.END)
sentences, filtered_words, lemmatized_words = nltk_preprocess(input_text_content)
token_pos, entities = spacy_process(input_text_content)
sentiment = analyze_sentiment(input_text_content)
selected_pos = pos_combobox.get()
if selected_pos != "All":
token_pos = [pair for pair in token_pos if pair[1] == selected_pos]
results = f"Sentences (NLTK):\n{sentences}\n\n"
results += f"Filtered Words (NLTK):\n{filtered_words}\n\n"
results += f"Lemmatized Words (NLTK):\n{lemmatized_words}\n\n"
results += f"Tokens and POS Tags (SpaCy):\n{token_pos}\n\n"
results += f"Named Entities (SpaCy):\n{entities}\n\n"
results += f"Sentiment Analysis (NLTK VADER):\n{sentiment}\n"
text_output.delete("1.0", tk.END)
text_output.insert(tk.INSERT, results)
Main GUI Layout
Welcome to where the magic happens! This is the command center where users can interact with the application. To make this possible, we need to create a graphical user interface (GUI). We start by creating a main window using tk
, setting its title, and defining its geometry. Then, we create two frames:
- input_frame: As its name suggests, this frame contains the input widget where the user can enter text.
- output_frame: Designed to contain the widget that displays the results.
Now, let’s fill these two frames:
Filling the input_frame:
- First, we add a label to indicate where the user should input the text.
- Next, we add a scrollable
input_text
widget where the user can enter the text. - Then, we add a combobox containing tags to filter the text, such as “All”, “NOUN”, “VERB”, etc.
- Finally, we add the “Process Text” button that calls the
show_results()
function.
Filling the output_frame:
- We add a label to indicate where the results will be displayed.
- We add a scrollable
output_text
widget where the results will be shown.
We position all these elements using the grid layout. With our central command ready, the only thing left to do is ignite it with mainloop()
, which keeps the main window running and responsive to the user.
# Create the main window
window = tk.Tk()
window.title("Advanced NLP with NLTK and SpaCy - The Pycodes")
# Configure window layout
window.geometry("800x700")
window.grid_columnconfigure(0, weight=1)
window.grid_rowconfigure(0, weight=1)
# Create frames for better layout management
input_frame = ttk.Frame(window, padding="10 10 10 10")
input_frame.grid(row=0, column=0, sticky=(tk.W, tk.E, tk.N, tk.S))
output_frame = ttk.Frame(window, padding="10 10 10 10")
output_frame.grid(row=1, column=0, sticky=(tk.W, tk.E, tk.N, tk.S))
# Create and place widgets in the input frame
label_input = ttk.Label(input_frame, text="Enter Text:")
label_input.grid(row=0, column=0, sticky=tk.W)
text_input = scrolledtext.ScrolledText(input_frame, wrap=tk.WORD, width=70, height=10)
text_input.grid(row=1, column=0, padx=10, pady=10)
pos_label = ttk.Label(input_frame, text="Filter by POS Tag:")
pos_label.grid(row=2, column=0, sticky=tk.W)
pos_combobox = ttk.Combobox(input_frame, values=["All", "NOUN", "VERB", "ADJ", "ADV"], state="readonly")
pos_combobox.set("All")
pos_combobox.grid(row=3, column=0, sticky=tk.W)
process_button = ttk.Button(input_frame, text="Process Text", command=show_results)
process_button.grid(row=4, column=0, pady=10)
# We Create and place widgets in the output frame
label_output = ttk.Label(output_frame, text="Output:")
label_output.grid(row=0, column=0, sticky=tk.W)
text_output = scrolledtext.ScrolledText(output_frame, wrap=tk.WORD, width=70, height=20)
text_output.grid(row=1, column=0, padx=10, pady=10)
# Run the GUI event loop
window.mainloop()
Example
I ran this script on a Windows system, as shown in the images below. This example is with the filter set to the default “All”.
Now, this one is with the filter set to “VERB”.
This code also works on Linux systems.
Full Code
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import tkinter as tk
from tkinter import scrolledtext, ttk
# We Download necessary NLTK data files
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')
# Load SpaCy model
nlp = spacy.load('en_core_web_sm')
# Initialize NLTK's VADER sentiment intensity analyzer
sia = SentimentIntensityAnalyzer()
# Function for text preprocessing using NLTK
def nltk_preprocess(text):
sentences = sent_tokenize(text)
words = word_tokenize(text)
stop_words = set(stopwords.words('english'))
# Filter out stopwords using filter() and a lambda function
filtered_words = list(filter(lambda word: word.lower() not in stop_words, words))
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in filtered_words]
return sentences, filtered_words, lemmatized
# Function for text processing using SpaCy
def spacy_process(text):
doc = nlp(text)
token_pos_pairs = [(token.text, token.pos_) for token in doc]
named_entities = [(entity.text, entity.label_) for entity in doc.ents]
return token_pos_pairs, named_entities
# Function for sentiment analysis using NLTK
def analyze_sentiment(text):
return sia.polarity_scores(text)
# Function to display results in the GUI
def show_results():
input_text_content = text_input.get("1.0", tk.END)
sentences, filtered_words, lemmatized_words = nltk_preprocess(input_text_content)
token_pos, entities = spacy_process(input_text_content)
sentiment = analyze_sentiment(input_text_content)
selected_pos = pos_combobox.get()
if selected_pos != "All":
token_pos = [pair for pair in token_pos if pair[1] == selected_pos]
results = f"Sentences (NLTK):\n{sentences}\n\n"
results += f"Filtered Words (NLTK):\n{filtered_words}\n\n"
results += f"Lemmatized Words (NLTK):\n{lemmatized_words}\n\n"
results += f"Tokens and POS Tags (SpaCy):\n{token_pos}\n\n"
results += f"Named Entities (SpaCy):\n{entities}\n\n"
results += f"Sentiment Analysis (NLTK VADER):\n{sentiment}\n"
text_output.delete("1.0", tk.END)
text_output.insert(tk.INSERT, results)
# Create the main window
window = tk.Tk()
window.title("Advanced NLP with NLTK and SpaCy - The Pycodes")
# Configure window layout
window.geometry("800x700")
window.grid_columnconfigure(0, weight=1)
window.grid_rowconfigure(0, weight=1)
# Create frames for better layout management
input_frame = ttk.Frame(window, padding="10 10 10 10")
input_frame.grid(row=0, column=0, sticky=(tk.W, tk.E, tk.N, tk.S))
output_frame = ttk.Frame(window, padding="10 10 10 10")
output_frame.grid(row=1, column=0, sticky=(tk.W, tk.E, tk.N, tk.S))
# Create and place widgets in the input frame
label_input = ttk.Label(input_frame, text="Enter Text:")
label_input.grid(row=0, column=0, sticky=tk.W)
text_input = scrolledtext.ScrolledText(input_frame, wrap=tk.WORD, width=70, height=10)
text_input.grid(row=1, column=0, padx=10, pady=10)
pos_label = ttk.Label(input_frame, text="Filter by POS Tag:")
pos_label.grid(row=2, column=0, sticky=tk.W)
pos_combobox = ttk.Combobox(input_frame, values=["All", "NOUN", "VERB", "ADJ", "ADV"], state="readonly")
pos_combobox.set("All")
pos_combobox.grid(row=3, column=0, sticky=tk.W)
process_button = ttk.Button(input_frame, text="Process Text", command=show_results)
process_button.grid(row=4, column=0, pady=10)
# We Create and place widgets in the output frame
label_output = ttk.Label(output_frame, text="Output:")
label_output.grid(row=0, column=0, sticky=tk.W)
text_output = scrolledtext.ScrolledText(output_frame, wrap=tk.WORD, width=70, height=20)
text_output.grid(row=1, column=0, padx=10, pady=10)
# Run the GUI event loop
window.mainloop()
Happy Coding!