Imagine you’re sailing through a huge ocean of websites, where each site is like an island full of hidden treasures. The question is, how do you find all these hidden gems?
In today’s tutorial, you’ll learn exactly how to extract links or URLs from any webpage using Python. We’ll guide you through creating a Python application that acts as your digital compass. With the addition of Tkinter, this compass transforms into a user-friendly GUI, enabling you to extract and categorize links from any given website URL by the user.
Let’s get started!
Table of Contents
- Necessary Libraries
- Imports
- Extract URLs Functions
- Creating the Main Window
- GUI Setup
- Creating the Output_text Widget
- Main Loop
- Example
- Full Code
Necessary Libraries
For the code to function properly, make sure to install the tkinter, requests, and beautifulsoup4 libraries via the terminal or your command prompt by running these commands:
$ pip install tk
$ pip install requests
$ pip install beautifulsoup4
Imports
We start by importing tkinter
because we are going to use a graphical user interface (GUI). From tkinter
, we will import scrolledtext
so that we can have a scrollable widget to display the output. Then, we import requests
to make HTTP requests, specifically the GET request.
Next, we import BeautifulSoup
from bs4
to parse HTML documents, and finally, we import urlparse
from urllib.parse
to parse URLs.
import tkinter as tk
from tkinter import scrolledtext
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
Extract URLs Functions
After importing the necessary libraries and modules, it’s time to define our functions:
classify_link Function
This function takes base_url
and link
as inputs to classify the links as internal or external. The classification process is straightforward: it parses the link into its components, extracting the scheme and netloc, and then compares these to the base_url
. If they match, the input link is considered internal; otherwise, it is deemed external.
def classify_link(base_url, link):
parsed_link = urlparse(link)
if parsed_link.scheme and parsed_link.netloc:
if parsed_link.netloc == urlparse(base_url).netloc:
return "Internal"
else:
return "External"
else:
return "Internal"
extract_links Function
The extract_links
function retrieves the URL entered by the user and makes an HTTP request to obtain the HTML content. Using BeautifulSoup, it parses this HTML, searching for links by targeting the ‘a’ tags with find_all('a')
. This process is aimed at extracting all potential links from the given webpage.
It then initializes two empty lists, external_links
and internal_links
, along with a variable total_links
set to 0. This ensures that total_links
starts at 0 each time the function is triggered. As the function discovers each link, total_links
is incremented by 1. Following this, it calls the Classify_link()
function (though, consider renaming to classify_link()
for Python naming conventions) to classify each link as either internal or external.
These links are placed into their respective lists and counted, with the results displayed in the output_text
widget for the user.
def extract_links():
url = url_entry.get()
try:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
base_url = response.url
links = [link.get('href') for link in soup.find_all('a')]
external_links = []
internal_links = []
total_links = 0
output_text.delete(1.0, tk.END) # Clear previous output
for link in links:
total_links += 1
classification = classify_link(base_url, link)
if classification == "External":
external_links.append(link)
else:
internal_links.append(link)
output_text.insert(tk.END, "Internal Links:\n")
for link in internal_links:
output_text.insert(tk.END, f"{link} [Internal]\n")
output_text.insert(tk.END, "\nExternal Links:\n")
for link in external_links:
output_text.insert(tk.END, f"{link} [External]\n")
output_text.insert(tk.END, f"\nTotal URLs: {total_links}\n")
output_text.insert(tk.END, f"External Links: {len(external_links)}\n")
output_text.insert(tk.END, f"Internal Links: {len(internal_links)}\n")
except Exception as e:
output_text.delete(1.0, tk.END) # Clear previous output
output_text.insert(tk.END, f"Error: {e}")
Creating the Main Window
Following that, we create the graphical interface that is going to interact with the user (main window) and sets its title as well as its geometry.
# Create main window
window = tk.Tk()
window.title("Website Links Extractor - The Pycodes")
window.geometry("700x500")
GUI Setup
For this step, we begin by creating a label that prompts the user to “Enter URL:“. Directly below this label, we’ll place an entry field where the URL can be inputted. Following that, we introduce a button. Upon clicking this button, the extract_links()
function will be triggered. We will name this button “Extract Links“.
# URL entry
url_label = tk.Label(window, text="Enter URL:")
url_label.pack()
url_entry = tk.Entry(window, width=80)
url_entry.pack()
# Button to extract links
extract_button = tk.Button(window, text="Extract Links", command=extract_links)
extract_button.pack()
Creating the Output_text Widget
In this part, we create a scrollable Widget with a defined width and height to display the extracted links.
# Output area
output_label = tk.Label(window, text="Extracted Links:")
output_label.pack()
output_text = scrolledtext.ScrolledText(window, width=100, height=30)
output_text.pack(padx=10)
Main Loop
Lastly, this section ensures that the main window remains open and responsive to user interactions until the user chooses to exit, typically by closing the window.
# Run the main loop
window.mainloop()
Example
In our example, we will demonstrate how today’s code works by applying it to the following URL:
“https://thepycodes.com/how-to-build-a-weather-app-with-flask-in-python/”
Full Code
import tkinter as tk
from tkinter import scrolledtext
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
def classify_link(base_url, link):
parsed_link = urlparse(link)
if parsed_link.scheme and parsed_link.netloc:
if parsed_link.netloc == urlparse(base_url).netloc:
return "Internal"
else:
return "External"
else:
return "Internal"
def extract_links():
url = url_entry.get()
try:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
base_url = response.url
links = [link.get('href') for link in soup.find_all('a')]
external_links = []
internal_links = []
total_links = 0
output_text.delete(1.0, tk.END) # Clear previous output
for link in links:
total_links += 1
classification = classify_link(base_url, link)
if classification == "External":
external_links.append(link)
else:
internal_links.append(link)
output_text.insert(tk.END, "Internal Links:\n")
for link in internal_links:
output_text.insert(tk.END, f"{link} [Internal]\n")
output_text.insert(tk.END, "\nExternal Links:\n")
for link in external_links:
output_text.insert(tk.END, f"{link} [External]\n")
output_text.insert(tk.END, f"\nTotal URLs: {total_links}\n")
output_text.insert(tk.END, f"External Links: {len(external_links)}\n")
output_text.insert(tk.END, f"Internal Links: {len(internal_links)}\n")
except Exception as e:
output_text.delete(1.0, tk.END) # Clear previous output
output_text.insert(tk.END, f"Error: {e}")
# Create main window
window = tk.Tk()
window.title("Website Links Extractor - The Pycodes")
window.geometry("700x500")
# URL entry
url_label = tk.Label(window, text="Enter URL:")
url_label.pack()
url_entry = tk.Entry(window, width=80)
url_entry.pack()
# Button to extract links
extract_button = tk.Button(window, text="Extract Links", command=extract_links)
extract_button.pack()
# Output area
output_label = tk.Label(window, text="Extracted Links:")
output_label.pack()
output_text = scrolledtext.ScrolledText(window, width=100, height=30)
output_text.pack(padx=10)
# Run the main loop
window.mainloop()
Happy Coding!