Home » Tutorials » How to Extract Youtube Data in Python

How to Extract Youtube Data in Python

YouTube is a treasure trove of videos, covering everything from step-by-step tutorials to the latest music hits and personal stories. For anyone interested in understanding what makes a video successful, diving into data like video titles, views, and likes can reveal a lot about viewer trends and preferences.

Today, you’ll learn how to extract YouTube data using Python. By the end of this tutorial, you’ll be equipped with the skills to create a graphical user interface (GUI) using tkinter, and leverage the power of Python’s requests and BeautifulSoup libraries to fetch and display information from YouTube videos.

Let’s get started!

Table of Contents

Necessary Libraries

Make sure to install these libraries via the terminal or command prompt to ensure the code functions properly:

pip install tk 
pip install requests
pip install beautifulsoup4

Imports

Since we want our program to be user-friendly, we start by creating a graphical user interface (GUI) using the tkinter library.

First, we import scrolledtext and messagebox from tkinter to set up a scrolled text widget and to display messages. Then, we bring in the requests library to send HTTP requests.

Next, to scrape information from YouTube video pages, we use the BeautifulSoup class. Additionally, for more flexibility with text data, we include the re module. Finally, we add the json module to handle JSON data.

import tkinter as tk
from tkinter import scrolledtext, messagebox
import requests
from bs4 import BeautifulSoup
import re
import json

Extract YouTube Data Functions

After importing the necessary libraries and modules, it’s time to define our functions:

get_video_info Function

This function sends an HTTP GET request to the URL provided by the user using session.get(), and then waits for the response, which is stored in the response variable. If the response fails (i.e., response.status_code is not 200), it displays an error message indicating the failure. On the other hand, if the response is successful (i.e., response.status_code is 200), the function proceeds to the next steps.

First, it creates a soup object that uses BeautifulSoup to parse the HTML content of the response. The function then calls get_content() to extract content from meta tags based on their properties and uses multiple CSS selectors to extend data retrieval to include the video’s name, views, and description, among other details. Additionally, it employs re.search() to find the video duration in seconds, if present. If JSON-LD data is available, the function parses this data to obtain channel information. Once all these operations are complete, the function returns the data. Should any error occur during these steps, an error message will be displayed.

def get_video_info(session, url):
   try:
       response = session.get(url)
       if response.status_code != 200:
           return {"error": "Failure to retrieve the video page, can you verify the URL and your network connection."}


       soup = BeautifulSoup(response.text, "html.parser")


       def get_content(meta_property, property_type='itemprop', default="Not found"):
           content = soup.find("meta", **{property_type: meta_property})
           return content["content"] if content else default


       data = {
           "title": get_content("name"),
           "views": get_content("interactionCount"),
           "description": get_content("description"),
           "date_published": get_content("datePublished"),
           "tags": ', '.join(tag['content'] for tag in soup.find_all("meta", property="og:video:tag")) or "No tags"
       }


       thumbnail = soup.find("link", rel="image_src")
       data["thumbnail"] = thumbnail["href"] if thumbnail else "No thumbnail"


       json_ld = soup.find("script", type="application/ld+json")
       if json_ld:
           json_data = json.loads(json_ld.string)
           # Check for video author or publisher
           channel_info = json_data.get("author", {}) or json_data.get("publisher", {})
           data["channel_name"] = channel_info.get("name", "Channel name not found")


           # Additional check for breadcrumb list
           if "@type" in json_data and json_data["@type"] == "BreadcrumbList":
               items = json_data.get("itemListElement", [])
               if items and "item" in items[0]:
                   data["channel_name"] = items[0]["item"].get("name", "Channel name not found")


       # Extract duration
       duration_match = re.search(r'\"lengthSeconds\":\"(\d+)\"', response.text)
       if duration_match:
           duration_seconds = int(duration_match.group(1))
           data["duration"] = format_duration(duration_seconds)
       else:
           data["duration"] = "Duration not found"


       return data
   except Exception as e:
       return {"error": str(e)}

format_duration Function

The objective of this one is to convert the video duration from seconds to a format of hours, minutes, and seconds. It uses the divmod() function, which divides the total duration by 3600 to obtain the hours. The remainder is then divided by 60 to separate the minutes and seconds. Finally, the function returns the formatted result as its output.

def format_duration(seconds):
   hours, remainder = divmod(seconds, 3600)
   minutes, seconds = divmod(remainder, 60)
   return f"{hours:02d}:{minutes:02d}:{seconds:02d}"

Class Application

For this step, we start by creating a class named Application that inherits from tk.Frame in the Tkinter library. In the __init__ method, we initialize the frame widget within the main window using super().__init__(), set the master attribute, and create a session object to handle HTTP requests. We then call self.pack() to organize widgets both horizontally and vertically in the window and invoke self.create_widgets() to add UI components.

In the create_widgets method, we set up a label asking the user to “Enter YouTube URL” and an entry widget where the URL is to be input. We also add a “Get Video Info” button that triggers the show_video_info() method. This method checks if the entered URL includes “youtube.com”, fetches video data if valid, and handles errors appropriately. If the data retrieval is successful, it formats the video information using format_video_info() and displays it in a scrolledtext.ScrolledText widget for better readability.

class Application(tk.Frame):
   def __init__(self, master=None):
       super().__init__(master)
       self.master = master
       self.session = requests.Session()
       self.pack(fill=tk.BOTH, expand=True)
       self.create_widgets()


   def create_widgets(self):
       self.label_url = tk.Label(self, text="Enter YouTube URL:")
       self.label_url.pack(padx=10, pady=5)


       self.url_entry = tk.Entry(self, width=50)
       self.url_entry.pack(padx=10, pady=5)


       self.get_info_button = tk.Button(self, text="Get Video Info", command=self.show_video_info)
       self.get_info_button.pack(pady=10)


       self.info_text = scrolledtext.ScrolledText(self, height=15)
       self.info_text.pack(padx=10, pady=10, fill=tk.BOTH, expand=True)


   def show_video_info(self):
       url = self.url_entry.get()
       if "youtube.com" in url:
           data = get_video_info(self.session, url)
           if "error" in data:
               messagebox.showerror("Error", data["error"])
           else:
               display_text = self.format_video_info(data)
               self.info_text.delete('1.0', tk.END)
               self.info_text.insert(tk.INSERT, display_text)
       else:
           messagebox.showerror("The YouTube Video URL you just Entered is Invalid Please enter a valid One.")


   def format_video_info(self, data):
       formatted_text = (
           f"Title: {data.get('title', 'N/A')}\n"
           f"Views: {data.get('views', 'N/A')}\n"
           f"Description: {data.get('description', 'N/A')}\n"
           f"Date Published: {data.get('date_published', 'N/A')}\n"
           f"Tags: {data.get('tags', 'N/A')}\n"
           f"Thumbnail: {data.get('thumbnail', 'N/A')}\n"
           f"Channel Name: {data.get('channel_name', 'N/A')}\n"
           f"Duration: {data.get('duration', 'N/A')}\n"
       )
       return formatted_text

Initializing the Tkinter Application

Now, we initialize the main window, setting its title and geometry. Then, we create an instance of the Application class, linking it to the main window. This ensures that the GUI elements are controlled and updated according to the logic defined in the Application class.

Lastly, this setup keeps the main window running and responsive to the user until they choose to exit.

root = tk.Tk()
root.title("YouTube Video Data Extractor - The Pycodes")
root.geometry("600x400")
app = Application(master=root)
app.mainloop()

Example

Full Code

import tkinter as tk
from tkinter import scrolledtext, messagebox
import requests
from bs4 import BeautifulSoup
import re
import json




def get_video_info(session, url):
   try:
       response = session.get(url)
       if response.status_code != 200:
           return {"error": "Failure to retrieve the video page, can you verify the URL and your network connection."}


       soup = BeautifulSoup(response.text, "html.parser")


       def get_content(meta_property, property_type='itemprop', default="Not found"):
           content = soup.find("meta", **{property_type: meta_property})
           return content["content"] if content else default


       data = {
           "title": get_content("name"),
           "views": get_content("interactionCount"),
           "description": get_content("description"),
           "date_published": get_content("datePublished"),
           "tags": ', '.join(tag['content'] for tag in soup.find_all("meta", property="og:video:tag")) or "No tags"
       }


       thumbnail = soup.find("link", rel="image_src")
       data["thumbnail"] = thumbnail["href"] if thumbnail else "No thumbnail"


       json_ld = soup.find("script", type="application/ld+json")
       if json_ld:
           json_data = json.loads(json_ld.string)
           # Check for video author or publisher
           channel_info = json_data.get("author", {}) or json_data.get("publisher", {})
           data["channel_name"] = channel_info.get("name", "Channel name not found")


           # Additional check for breadcrumb list
           if "@type" in json_data and json_data["@type"] == "BreadcrumbList":
               items = json_data.get("itemListElement", [])
               if items and "item" in items[0]:
                   data["channel_name"] = items[0]["item"].get("name", "Channel name not found")


       # Extract duration
       duration_match = re.search(r'\"lengthSeconds\":\"(\d+)\"', response.text)
       if duration_match:
           duration_seconds = int(duration_match.group(1))
           data["duration"] = format_duration(duration_seconds)
       else:
           data["duration"] = "Duration not found"


       return data
   except Exception as e:
       return {"error": str(e)}




def format_duration(seconds):
   hours, remainder = divmod(seconds, 3600)
   minutes, seconds = divmod(remainder, 60)
   return f"{hours:02d}:{minutes:02d}:{seconds:02d}"




class Application(tk.Frame):
   def __init__(self, master=None):
       super().__init__(master)
       self.master = master
       self.session = requests.Session()
       self.pack(fill=tk.BOTH, expand=True)
       self.create_widgets()


   def create_widgets(self):
       self.label_url = tk.Label(self, text="Enter YouTube URL:")
       self.label_url.pack(padx=10, pady=5)


       self.url_entry = tk.Entry(self, width=50)
       self.url_entry.pack(padx=10, pady=5)


       self.get_info_button = tk.Button(self, text="Get Video Info", command=self.show_video_info)
       self.get_info_button.pack(pady=10)


       self.info_text = scrolledtext.ScrolledText(self, height=15)
       self.info_text.pack(padx=10, pady=10, fill=tk.BOTH, expand=True)


   def show_video_info(self):
       url = self.url_entry.get()
       if "youtube.com" in url:
           data = get_video_info(self.session, url)
           if "error" in data:
               messagebox.showerror("Error", data["error"])
           else:
               display_text = self.format_video_info(data)
               self.info_text.delete('1.0', tk.END)
               self.info_text.insert(tk.INSERT, display_text)
       else:
           messagebox.showerror("The YouTube Video URL you just Entered is Invalid Please enter a valid One.")


   def format_video_info(self, data):
       formatted_text = (
           f"Title: {data.get('title', 'N/A')}\n"
           f"Views: {data.get('views', 'N/A')}\n"
           f"Description: {data.get('description', 'N/A')}\n"
           f"Date Published: {data.get('date_published', 'N/A')}\n"
           f"Tags: {data.get('tags', 'N/A')}\n"
           f"Thumbnail: {data.get('thumbnail', 'N/A')}\n"
           f"Channel Name: {data.get('channel_name', 'N/A')}\n"
           f"Duration: {data.get('duration', 'N/A')}\n"
       )
       return formatted_text




root = tk.Tk()
root.title("YouTube Video Data Extractor - The Pycodes")
root.geometry("600x400")
app = Application(master=root)
app.mainloop()

Happy Coding!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top