Home » Tutorials » How to Extract all the URLs from the Webpage in Python

How to Extract all the URLs from the Webpage in Python

Imagine you’re sailing through a huge ocean of websites, where each site is like an island full of hidden treasures. The question is, how do you find all these hidden gems?

In today’s tutorial, you’ll learn exactly how to extract links or URLs from any webpage using Python. We’ll guide you through creating a Python application that acts as your digital compass. With the addition of Tkinter, this compass transforms into a user-friendly GUI, enabling you to extract and categorize links from any given website URL by the user.

Let’s get started!

Table of Contents

Necessary Libraries

For the code to function properly, make sure to install the tkinterrequests, and beautifulsoup4 libraries via the terminal or your command prompt by running these commands:

$ pip install tk
$ pip install requests 
$ pip install beautifulsoup4 

Imports

We start by importing tkinter because we are going to use a graphical user interface (GUI). From tkinter, we will import scrolledtext so that we can have a scrollable widget to display the output. Then, we import requests to make HTTP requests, specifically the GET request.

Next, we import BeautifulSoup from bs4 to parse HTML documents, and finally, we import urlparse from urllib.parse to parse URLs.

import tkinter as tk
from tkinter import scrolledtext
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

Extract URLs Functions

After importing the necessary libraries and modules, it’s time to define our functions:

This function takes base_url and link as inputs to classify the links as internal or external. The classification process is straightforward: it parses the link into its components, extracting the scheme and netloc, and then compares these to the base_url. If they match, the input link is considered internal; otherwise, it is deemed external.

def classify_link(base_url, link):
   parsed_link = urlparse(link)
   if parsed_link.scheme and parsed_link.netloc:
       if parsed_link.netloc == urlparse(base_url).netloc:
           return "Internal"
       else:
           return "External"
   else:
       return "Internal"

The extract_links function retrieves the URL entered by the user and makes an HTTP request to obtain the HTML content. Using BeautifulSoup, it parses this HTML, searching for links by targeting the ‘a’ tags with find_all('a'). This process is aimed at extracting all potential links from the given webpage.

It then initializes two empty lists, external_links and internal_links, along with a variable total_links set to 0. This ensures that total_links starts at 0 each time the function is triggered. As the function discovers each link, total_links is incremented by 1. Following this, it calls the Classify_link() function (though, consider renaming to classify_link() for Python naming conventions) to classify each link as either internal or external.

These links are placed into their respective lists and counted, with the results displayed in the output_text widget for the user.

def extract_links():
   url = url_entry.get()
   try:
       response = requests.get(url)
       soup = BeautifulSoup(response.text, 'html.parser')
       base_url = response.url
       links = [link.get('href') for link in soup.find_all('a')]
       external_links = []
       internal_links = []
       total_links = 0
       output_text.delete(1.0, tk.END)  # Clear previous output
       for link in links:
           total_links += 1
           classification = classify_link(base_url, link)
           if classification == "External":
               external_links.append(link)
           else:
               internal_links.append(link)
       output_text.insert(tk.END, "Internal Links:\n")
       for link in internal_links:
           output_text.insert(tk.END, f"{link} [Internal]\n")
       output_text.insert(tk.END, "\nExternal Links:\n")
       for link in external_links:
           output_text.insert(tk.END, f"{link} [External]\n")
       output_text.insert(tk.END, f"\nTotal URLs: {total_links}\n")
       output_text.insert(tk.END, f"External Links: {len(external_links)}\n")
       output_text.insert(tk.END, f"Internal Links: {len(internal_links)}\n")
   except Exception as e:
       output_text.delete(1.0, tk.END)  # Clear previous output
       output_text.insert(tk.END, f"Error: {e}")

Creating the Main Window

Following that, we create the graphical interface that is going to interact with the user (main window) and sets its title as well as its geometry.

# Create main window
window = tk.Tk()
window.title("Website Links Extractor - The Pycodes")
window.geometry("700x500")

GUI Setup

For this step, we begin by creating a label that prompts the user to “Enter URL:“. Directly below this label, we’ll place an entry field where the URL can be inputted. Following that, we introduce a button. Upon clicking this button, the extract_links() function will be triggered. We will name this button “Extract Links“.

# URL entry
url_label = tk.Label(window, text="Enter URL:")
url_label.pack()
url_entry = tk.Entry(window, width=80)
url_entry.pack()

# Button to extract links
extract_button = tk.Button(window, text="Extract Links", command=extract_links)
extract_button.pack()

Creating the Output_text Widget

In this part, we create a scrollable Widget with a defined width and height to display the extracted links.

# Output area
output_label = tk.Label(window, text="Extracted Links:")
output_label.pack()
output_text = scrolledtext.ScrolledText(window, width=100, height=30)
output_text.pack(padx=10)

Main Loop

Lastly, this section ensures that the main window remains open and responsive to user interactions until the user chooses to exit, typically by closing the window.

# Run the main loop
window.mainloop()

Example

In our example, we will demonstrate how today’s code works by applying it to the following URL:

“https://thepycodes.com/how-to-build-a-weather-app-with-flask-in-python/”

Full Code

import tkinter as tk
from tkinter import scrolledtext
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse


def classify_link(base_url, link):
   parsed_link = urlparse(link)
   if parsed_link.scheme and parsed_link.netloc:
       if parsed_link.netloc == urlparse(base_url).netloc:
           return "Internal"
       else:
           return "External"
   else:
       return "Internal"


def extract_links():
   url = url_entry.get()
   try:
       response = requests.get(url)
       soup = BeautifulSoup(response.text, 'html.parser')
       base_url = response.url
       links = [link.get('href') for link in soup.find_all('a')]
       external_links = []
       internal_links = []
       total_links = 0
       output_text.delete(1.0, tk.END)  # Clear previous output
       for link in links:
           total_links += 1
           classification = classify_link(base_url, link)
           if classification == "External":
               external_links.append(link)
           else:
               internal_links.append(link)
       output_text.insert(tk.END, "Internal Links:\n")
       for link in internal_links:
           output_text.insert(tk.END, f"{link} [Internal]\n")
       output_text.insert(tk.END, "\nExternal Links:\n")
       for link in external_links:
           output_text.insert(tk.END, f"{link} [External]\n")
       output_text.insert(tk.END, f"\nTotal URLs: {total_links}\n")
       output_text.insert(tk.END, f"External Links: {len(external_links)}\n")
       output_text.insert(tk.END, f"Internal Links: {len(internal_links)}\n")
   except Exception as e:
       output_text.delete(1.0, tk.END)  # Clear previous output
       output_text.insert(tk.END, f"Error: {e}")


# Create main window
window = tk.Tk()
window.title("Website Links Extractor - The Pycodes")
window.geometry("700x500")


# URL entry
url_label = tk.Label(window, text="Enter URL:")
url_label.pack()
url_entry = tk.Entry(window, width=80)
url_entry.pack()


# Button to extract links
extract_button = tk.Button(window, text="Extract Links", command=extract_links)
extract_button.pack()


# Output area
output_label = tk.Label(window, text="Extracted Links:")
output_label.pack()
output_text = scrolledtext.ScrolledText(window, width=100, height=30)
output_text.pack(padx=10)


# Run the main loop
window.mainloop()

Happy Coding!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top