Home » Tutorials » How to Extract Script and CSS Files from Web Pages in Python

How to Extract Script and CSS Files from Web Pages in Python

Ever wondered what makes your favorite websites so dynamic and visually appealing? Behind the scenes, JavaScript and CSS are hard at work. These scripts and stylesheets control everything from content updates to design. By learning to access and analyze these files, you can uncover the secrets of web development and gain valuable insights into creating engaging web experiences.

In today’s article, we’ll explore web scraping with Python. We’ll create an easy-to-use tkinter interface to extract JavaScript and CSS files from any website. By the end, you’ll know how to use libraries like requests and BeautifulSoup to fetch and parse web pages. This tutorial will help you develop practical web scraping skills and provide an effective method for extracting and understanding web content.

Let’s get started!

Table of Contents

Necessary Libraries

Let’s get everything set up before we get into the code part. So, make sure to install these libraries via the terminal or by using your command prompt for the code to function properly:

$ pip install requests 
$ pip install beautifulsoup4 
$ pip install tkinter 

Imports

When you start building something, the first step is to gather all the tools you need. Programming is similar; you need to import the necessary libraries and modules. Here’s what we import:

  • requests: Helps us fetch web pages.
  • BeautifulSoup from bs4: Parses HTML documents.
  • urljoin from urllib.parse: Constructs an absolute URL from base and relative URLs.
  • logging: Keeps track of our steps and any issues, like a diary for our building process.
  • tkinter as tk: Allows us to create a graphical user interface (GUI).
  • ttk and messagebox from tkinter: Provides themed widgets and message boxes for the GUI.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import logging
import tkinter as tk
from tkinter import ttk, messagebox

Setting Up Logging

You can think of this as setting up a diary or journal that will record every important event, the time it happened, and its degree of importance.

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

Functions for Extracting Script and CSS Files

fetch_url_content Function

Exactly as its name suggests, the objective of this function is to fetch the content from the input URL. Here’s how it works:

  • Imagine you are at a shopping mall (web page URL) and you want to fetch some ingredients (URL content). The first thing you do is ready your cart or shopping bag, which this function does via request.Session(). The second thing you do is greet the cashier (web server) before asking for the ingredients, which is what session.headers.update() does by adding a User-Agent header so the web server knows who you are. Now that you have introduced yourself, you ask for the ingredients just as this function does through session.get(url) by sending the GET request to fetch the web page.
  • After that, it uses response.raise_for_status() to check if the request was successful. If successful, it means the web page content (HTML) has been successfully fetched and it will be recorded in the logging. If it fails, it will raise an error and also record it in the logging.
def fetch_url_content(url):
   """Fetches the content of the given URL."""
   try:
       session = requests.Session()
       session.headers.update({
           "User-Agent": "CustomUserAgent/1.0 (UniqueUser; Python Script)"
       })
       response = session.get(url)
       response.raise_for_status()
       logging.info(f"Successfully fetched content from {url}")
       return response.content
   except requests.RequestException as e:
       logging.error(f"Error fetching content from {url}: {e}")
       return None

extract_file_urls Function

This one finds the URLs of the JavaScript and CSS files by using soup.find_all() to search for all relevant tags in the fetched and parsed HTML. It then checks if these tags have the attributes we are looking for (e.g., script and src for JavaScript, link and href for CSS files). After that, it uses urljoin() to convert the relative URLs of these extracted attributes to absolute URLs so we can get their addresses. Finally, it collects them in a list using file_urls.append().

def extract_file_urls(soup, tag, attribute):
   """Extracts URLs of files from the given BeautifulSoup object based on the specified tag and attribute."""
   file_urls = []
   for element in soup.find_all(tag):
       if element.attrs.get(attribute):
           file_url = urljoin(target_url, element.attrs[attribute])
           file_urls.append(file_url)
   return file_urls

Defining on_extract Function

This is the heart of the script, the function that manages the other functions. How, you may ask? Well, here’s how it works:

  • First, it gets the input URL and strips any whitespace using url_entry.get().strip(). If there’s no input URL, it prompts the user to enter one through a messagebox. Then, it calls the fetch_url_content() function to fetch the web page content (HTML). Next, it uses BeautifulSoup to parse the HTML content.
  • After parsing, it calls the extract_file_urls() function to extract JavaScript and CSS file URLs from the parsed HTML, so they can be displayed in the text widget.
def on_extract():
   """Handles the extract button click event."""
   global target_url
   target_url = url_entry.get().strip()


   if not target_url:
       messagebox.showerror("Input Error", "Please enter a URL.")
       return


   html_content = fetch_url_content(target_url)
   if html_content is None:
       messagebox.showerror("Fetch Error", "Failed to retrieve HTML content.")
       return


   soup = BeautifulSoup(html_content, "html.parser")
   js_files = extract_file_urls(soup, "script", "src")
   css_files = extract_file_urls(soup, "link", "href")


   # Display results in the text widget
   result_text.config(state=tk.NORMAL)
   result_text.delete(1.0, tk.END)  # Clear previous results
   result_text.insert(tk.END, f"JavaScript files ({len(js_files)}):\n")
   result_text.insert(tk.END, "\n".join(js_files) + "\n\n")
   result_text.insert(tk.END, f"CSS files ({len(css_files)}):\n")
   result_text.insert(tk.END, "\n".join(css_files))
   result_text.config(state=tk.DISABLED)

Creating the Main Window

This is the part where we create the interface, set its title, and define its geometry.

# Create the main Tkinter window
root = tk.Tk()
root.title("Web Resource Extractor - The Pycodes")
root.geometry("600x400")

Creating the GUI Elements

Now, let’s make our program user-friendly by designing an intuitive interface. We’ll start by adding a label that prompts the user to input the URL. Next, we’ll provide an entry widget where they can type in the URL. To kick off the extraction process, we’ll add an “Extract” button that triggers our on_extract() function.

To effectively display our results, we’ll create a frame that houses a text widget and a scrollbar. The text widget will showcase the extracted JavaScript and CSS file URLs, while the scrollbar ensures smooth navigation through the results. This setup will make the whole process seamless and easy to use!

# URL input
url_label = ttk.Label(root, text="Enter URL:")
url_label.pack(pady=10)


url_entry = ttk.Entry(root, width=50)
url_entry.pack(pady=5)


# Extract button
extract_button = ttk.Button(root, text="Extract", command=on_extract)
extract_button.pack(pady=20)


# Frame for Text widget and Scrollbar
frame = ttk.Frame(root)
frame.pack(pady=10, fill=tk.BOTH, expand=True)


# Text widget to display results
result_text = tk.Text(frame, wrap=tk.WORD, state=tk.DISABLED, width=70, height=15)
result_text.pack(side=tk.LEFT, fill=tk.BOTH, expand=True)


# Scrollbar
scrollbar = ttk.Scrollbar(frame, orient=tk.VERTICAL, command=result_text.yview)
scrollbar.pack(side=tk.RIGHT, fill=tk.Y)
result_text.config(yscrollcommand=scrollbar.set)

Running the Application

This section starts the main event loop, ensuring that the main window keeps running and remains responsive to the user until they choose to exit.

# Run the Tkinter event loop
root.mainloop()

Example

We ran this script on a Windows system, as shown in the image below:

Also on Linux:

Full Code

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import logging
import tkinter as tk
from tkinter import ttk, messagebox


# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')




def fetch_url_content(url):
   """Fetches the content of the given URL."""
   try:
       session = requests.Session()
       session.headers.update({
           "User-Agent": "CustomUserAgent/1.0 (UniqueUser; Python Script)"
       })
       response = session.get(url)
       response.raise_for_status()
       logging.info(f"Successfully fetched content from {url}")
       return response.content
   except requests.RequestException as e:
       logging.error(f"Error fetching content from {url}: {e}")
       return None




def extract_file_urls(soup, tag, attribute):
   """Extracts URLs of files from the given BeautifulSoup object based on the specified tag and attribute."""
   file_urls = []
   for element in soup.find_all(tag):
       if element.attrs.get(attribute):
           file_url = urljoin(target_url, element.attrs[attribute])
           file_urls.append(file_url)
   return file_urls




def on_extract():
   """Handles the extract button click event."""
   global target_url
   target_url = url_entry.get().strip()


   if not target_url:
       messagebox.showerror("Input Error", "Please enter a URL.")
       return


   html_content = fetch_url_content(target_url)
   if html_content is None:
       messagebox.showerror("Fetch Error", "Failed to retrieve HTML content.")
       return


   soup = BeautifulSoup(html_content, "html.parser")
   js_files = extract_file_urls(soup, "script", "src")
   css_files = extract_file_urls(soup, "link", "href")


   # Display results in the text widget
   result_text.config(state=tk.NORMAL)
   result_text.delete(1.0, tk.END)  # Clear previous results
   result_text.insert(tk.END, f"JavaScript files ({len(js_files)}):\n")
   result_text.insert(tk.END, "\n".join(js_files) + "\n\n")
   result_text.insert(tk.END, f"CSS files ({len(css_files)}):\n")
   result_text.insert(tk.END, "\n".join(css_files))
   result_text.config(state=tk.DISABLED)




# Create the main Tkinter window
root = tk.Tk()
root.title("Web Resource Extractor - The Pycodes")
root.geometry("600x400")


# URL input
url_label = ttk.Label(root, text="Enter URL:")
url_label.pack(pady=10)


url_entry = ttk.Entry(root, width=50)
url_entry.pack(pady=5)


# Extract button
extract_button = ttk.Button(root, text="Extract", command=on_extract)
extract_button.pack(pady=20)


# Frame for Text widget and Scrollbar
frame = ttk.Frame(root)
frame.pack(pady=10, fill=tk.BOTH, expand=True)


# Text widget to display results
result_text = tk.Text(frame, wrap=tk.WORD, state=tk.DISABLED, width=70, height=15)
result_text.pack(side=tk.LEFT, fill=tk.BOTH, expand=True)


# Scrollbar
scrollbar = ttk.Scrollbar(frame, orient=tk.VERTICAL, command=result_text.yview)
scrollbar.pack(side=tk.RIGHT, fill=tk.Y)
result_text.config(yscrollcommand=scrollbar.set)


# Run the Tkinter event loop
root.mainloop()

Happy Coding!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top