YouTube is a treasure trove of videos, covering everything from step-by-step tutorials to the latest music hits and personal stories. For anyone interested in understanding what makes a video successful, diving into data like video titles, views, and likes can reveal a lot about viewer trends and preferences.
Today, you’ll learn how to extract YouTube data using Python. By the end of this tutorial, you’ll be equipped with the skills to create a graphical user interface (GUI) using tkinter, and leverage the power of Python’s requests and BeautifulSoup libraries to fetch and display information from YouTube videos.
Let’s get started!
Table of Contents
- Necessary Libraries
- Imports
- Extract YouTube Data Functions
- Class Application
- Initializing the Tkinter Application
- Example
- Full Code
Necessary Libraries
Make sure to install these libraries via the terminal or command prompt to ensure the code functions properly:
pip install tk
pip install requests
pip install beautifulsoup4
Imports
Since we want our program to be user-friendly, we start by creating a graphical user interface (GUI) using the tkinter
library.
First, we import scrolledtext
and messagebox
from tkinter
to set up a scrolled text widget and to display messages. Then, we bring in the requests
library to send HTTP requests.
Next, to scrape information from YouTube video pages, we use the BeautifulSoup
class. Additionally, for more flexibility with text data, we include the re
module. Finally, we add the json
module to handle JSON data.
import tkinter as tk
from tkinter import scrolledtext, messagebox
import requests
from bs4 import BeautifulSoup
import re
import json
Extract YouTube Data Functions
After importing the necessary libraries and modules, it’s time to define our functions:
get_video_info Function
This function sends an HTTP GET request to the URL provided by the user using session.get()
, and then waits for the response, which is stored in the response
variable. If the response fails (i.e., response.status_code
is not 200), it displays an error message indicating the failure. On the other hand, if the response is successful (i.e., response.status_code
is 200), the function proceeds to the next steps.
First, it creates a soup
object that uses BeautifulSoup
to parse the HTML content of the response. The function then calls get_content()
to extract content from meta tags based on their properties and uses multiple CSS selectors to extend data retrieval to include the video’s name, views, and description, among other details. Additionally, it employs re.search()
to find the video duration in seconds, if present. If JSON-LD data is available, the function parses this data to obtain channel information. Once all these operations are complete, the function returns the data. Should any error occur during these steps, an error message will be displayed.
def get_video_info(session, url):
try:
response = session.get(url)
if response.status_code != 200:
return {"error": "Failure to retrieve the video page, can you verify the URL and your network connection."}
soup = BeautifulSoup(response.text, "html.parser")
def get_content(meta_property, property_type='itemprop', default="Not found"):
content = soup.find("meta", **{property_type: meta_property})
return content["content"] if content else default
data = {
"title": get_content("name"),
"views": get_content("interactionCount"),
"description": get_content("description"),
"date_published": get_content("datePublished"),
"tags": ', '.join(tag['content'] for tag in soup.find_all("meta", property="og:video:tag")) or "No tags"
}
thumbnail = soup.find("link", rel="image_src")
data["thumbnail"] = thumbnail["href"] if thumbnail else "No thumbnail"
json_ld = soup.find("script", type="application/ld+json")
if json_ld:
json_data = json.loads(json_ld.string)
# Check for video author or publisher
channel_info = json_data.get("author", {}) or json_data.get("publisher", {})
data["channel_name"] = channel_info.get("name", "Channel name not found")
# Additional check for breadcrumb list
if "@type" in json_data and json_data["@type"] == "BreadcrumbList":
items = json_data.get("itemListElement", [])
if items and "item" in items[0]:
data["channel_name"] = items[0]["item"].get("name", "Channel name not found")
# Extract duration
duration_match = re.search(r'\"lengthSeconds\":\"(\d+)\"', response.text)
if duration_match:
duration_seconds = int(duration_match.group(1))
data["duration"] = format_duration(duration_seconds)
else:
data["duration"] = "Duration not found"
return data
except Exception as e:
return {"error": str(e)}
format_duration Function
The objective of this one is to convert the video duration from seconds to a format of hours, minutes, and seconds. It uses the divmod()
function, which divides the total duration by 3600 to obtain the hours. The remainder is then divided by 60 to separate the minutes and seconds. Finally, the function returns the formatted result as its output.
def format_duration(seconds):
hours, remainder = divmod(seconds, 3600)
minutes, seconds = divmod(remainder, 60)
return f"{hours:02d}:{minutes:02d}:{seconds:02d}"
Class Application
For this step, we start by creating a class named Application
that inherits from tk.Frame
in the Tkinter library. In the __init__
method, we initialize the frame widget within the main window using super().__init__()
, set the master attribute, and create a session object to handle HTTP requests. We then call self.pack()
to organize widgets both horizontally and vertically in the window and invoke self.create_widgets()
to add UI components.
In the create_widgets
method, we set up a label asking the user to “Enter YouTube URL” and an entry widget where the URL is to be input. We also add a “Get Video Info” button that triggers the show_video_info()
method. This method checks if the entered URL includes “youtube.com”, fetches video data if valid, and handles errors appropriately. If the data retrieval is successful, it formats the video information using format_video_info()
and displays it in a scrolledtext.ScrolledText
widget for better readability.
class Application(tk.Frame):
def __init__(self, master=None):
super().__init__(master)
self.master = master
self.session = requests.Session()
self.pack(fill=tk.BOTH, expand=True)
self.create_widgets()
def create_widgets(self):
self.label_url = tk.Label(self, text="Enter YouTube URL:")
self.label_url.pack(padx=10, pady=5)
self.url_entry = tk.Entry(self, width=50)
self.url_entry.pack(padx=10, pady=5)
self.get_info_button = tk.Button(self, text="Get Video Info", command=self.show_video_info)
self.get_info_button.pack(pady=10)
self.info_text = scrolledtext.ScrolledText(self, height=15)
self.info_text.pack(padx=10, pady=10, fill=tk.BOTH, expand=True)
def show_video_info(self):
url = self.url_entry.get()
if "youtube.com" in url:
data = get_video_info(self.session, url)
if "error" in data:
messagebox.showerror("Error", data["error"])
else:
display_text = self.format_video_info(data)
self.info_text.delete('1.0', tk.END)
self.info_text.insert(tk.INSERT, display_text)
else:
messagebox.showerror("The YouTube Video URL you just Entered is Invalid Please enter a valid One.")
def format_video_info(self, data):
formatted_text = (
f"Title: {data.get('title', 'N/A')}\n"
f"Views: {data.get('views', 'N/A')}\n"
f"Description: {data.get('description', 'N/A')}\n"
f"Date Published: {data.get('date_published', 'N/A')}\n"
f"Tags: {data.get('tags', 'N/A')}\n"
f"Thumbnail: {data.get('thumbnail', 'N/A')}\n"
f"Channel Name: {data.get('channel_name', 'N/A')}\n"
f"Duration: {data.get('duration', 'N/A')}\n"
)
return formatted_text
Initializing the Tkinter Application
Now, we initialize the main window, setting its title and geometry. Then, we create an instance of the Application
class, linking it to the main window. This ensures that the GUI elements are controlled and updated according to the logic defined in the Application
class.
Lastly, this setup keeps the main window running and responsive to the user until they choose to exit.
root = tk.Tk()
root.title("YouTube Video Data Extractor - The Pycodes")
root.geometry("600x400")
app = Application(master=root)
app.mainloop()
Example
Full Code
import tkinter as tk
from tkinter import scrolledtext, messagebox
import requests
from bs4 import BeautifulSoup
import re
import json
def get_video_info(session, url):
try:
response = session.get(url)
if response.status_code != 200:
return {"error": "Failure to retrieve the video page, can you verify the URL and your network connection."}
soup = BeautifulSoup(response.text, "html.parser")
def get_content(meta_property, property_type='itemprop', default="Not found"):
content = soup.find("meta", **{property_type: meta_property})
return content["content"] if content else default
data = {
"title": get_content("name"),
"views": get_content("interactionCount"),
"description": get_content("description"),
"date_published": get_content("datePublished"),
"tags": ', '.join(tag['content'] for tag in soup.find_all("meta", property="og:video:tag")) or "No tags"
}
thumbnail = soup.find("link", rel="image_src")
data["thumbnail"] = thumbnail["href"] if thumbnail else "No thumbnail"
json_ld = soup.find("script", type="application/ld+json")
if json_ld:
json_data = json.loads(json_ld.string)
# Check for video author or publisher
channel_info = json_data.get("author", {}) or json_data.get("publisher", {})
data["channel_name"] = channel_info.get("name", "Channel name not found")
# Additional check for breadcrumb list
if "@type" in json_data and json_data["@type"] == "BreadcrumbList":
items = json_data.get("itemListElement", [])
if items and "item" in items[0]:
data["channel_name"] = items[0]["item"].get("name", "Channel name not found")
# Extract duration
duration_match = re.search(r'\"lengthSeconds\":\"(\d+)\"', response.text)
if duration_match:
duration_seconds = int(duration_match.group(1))
data["duration"] = format_duration(duration_seconds)
else:
data["duration"] = "Duration not found"
return data
except Exception as e:
return {"error": str(e)}
def format_duration(seconds):
hours, remainder = divmod(seconds, 3600)
minutes, seconds = divmod(remainder, 60)
return f"{hours:02d}:{minutes:02d}:{seconds:02d}"
class Application(tk.Frame):
def __init__(self, master=None):
super().__init__(master)
self.master = master
self.session = requests.Session()
self.pack(fill=tk.BOTH, expand=True)
self.create_widgets()
def create_widgets(self):
self.label_url = tk.Label(self, text="Enter YouTube URL:")
self.label_url.pack(padx=10, pady=5)
self.url_entry = tk.Entry(self, width=50)
self.url_entry.pack(padx=10, pady=5)
self.get_info_button = tk.Button(self, text="Get Video Info", command=self.show_video_info)
self.get_info_button.pack(pady=10)
self.info_text = scrolledtext.ScrolledText(self, height=15)
self.info_text.pack(padx=10, pady=10, fill=tk.BOTH, expand=True)
def show_video_info(self):
url = self.url_entry.get()
if "youtube.com" in url:
data = get_video_info(self.session, url)
if "error" in data:
messagebox.showerror("Error", data["error"])
else:
display_text = self.format_video_info(data)
self.info_text.delete('1.0', tk.END)
self.info_text.insert(tk.INSERT, display_text)
else:
messagebox.showerror("The YouTube Video URL you just Entered is Invalid Please enter a valid One.")
def format_video_info(self, data):
formatted_text = (
f"Title: {data.get('title', 'N/A')}\n"
f"Views: {data.get('views', 'N/A')}\n"
f"Description: {data.get('description', 'N/A')}\n"
f"Date Published: {data.get('date_published', 'N/A')}\n"
f"Tags: {data.get('tags', 'N/A')}\n"
f"Thumbnail: {data.get('thumbnail', 'N/A')}\n"
f"Channel Name: {data.get('channel_name', 'N/A')}\n"
f"Duration: {data.get('duration', 'N/A')}\n"
)
return formatted_text
root = tk.Tk()
root.title("YouTube Video Data Extractor - The Pycodes")
root.geometry("600x400")
app = Application(master=root)
app.mainloop()
Happy Coding!