Scraper Scripts

All scripts that are needed to obtain and manipulate the data. Located in dataset/scraper-scripts.

Note: The number before the name of each script corresponds to the order the scripts are normally used in. Some scripts may have the same number because they do different tasks that take the same spot in the data processing process. For instance, there may be one script to work with slide presentations (PDFs) and another to work with videos that occupy the same position (for instance 2. Slides Downloader and 2. Video Downloader)

1. Website Scraper

Takes a video page link, video download link, and video published date and then adds that information to dataset/videos-dataset.csv.

  • Command:

    python <date> <page_link> <video_download_link> <description (optional)>
    • <date> is the date the lecture was published

    • <page_link> is the link to the webpage where the video can be found

    • <video_download_link> is the direct link to the video

    • <description (optional)> is an optional description that gets saved with the rest of the information (currently not used internally)

  • Example:

    python 1-1-2010 \ \

1. YouTube Scraper

Takes a video id or channel id from YouTube, extracts important information using the YouTube Data API, and then adds that information to dataset/videos-dataset.csv.

  • Output of python --help:

    usage: [-h] [-n N] [-t] [--transcript-use-yt-api] [-l N]
                        [-f PATH] [-o SEARCH_ORDER] [-p PARAMS]
                        {video,channel,transcript} STR
    YouTube Scraper
    positional arguments:
                            Get metadata for a video or a certain number of videos
                            from a channel. Transcript mode downloads the
                            transcript for a video_id.
    STR                   Channel or video id depending on mode
    optional arguments:
    -h, --help            show this help message and exit
    -n N, --num_pages N   Number of pages of videos to scape if mode is
                            `channel`. 50 videos per page.
    -t, --transcript      Download transcript for each video scraped.
                            Use the YouTube API instead of youtube-dl to download
                            transcripts. `--transcript` must be specified for this
                            option to take effect.
    -l N, --min_length_check N
                            Minimum video length in minutes to be scraped. Only
                            works when `mode` is "channel"
    -f PATH, --file PATH  File to add scraped results to.
    -o SEARCH_ORDER, --search_order SEARCH_ORDER
                            The order to list videos from a channel when `mode` is
                            'channel'. Acceptable values are in the YouTube API
    -p PARAMS, --params PARAMS
                            A string dictionary of parameters to pass to the call
                            to the YouTube API. If mode=video then the
                            `videos.list` api is used. If mode=channel then the
                            `search.list` api is used.
  • Examples
    • Add a single lecture video to the dataset:
      python video 63hAHbkzJG4
    • Get the transcript for a video file:
      python transcript 63hAHbkzJG4
    • Add a video to the dataset/videos-dataset.csv and get the transcript:
      python video 63hAHbkzJG4 --transcript
    • Scrape the 50 latest videos from a channel:
      python channel UCEBb1b_L6zDS3xTUrIALZOw --num_pages 1
    • Scrape the 50 most viewed videos from a channel:
      python channel UCEBb1b_L6zDS3xTUrIALZOw --num_pages 1 --search_order viewCount
    • Scrape the 50 latest videos from a channel that were published before 2020:
      python channel UCEBb1b_L6zDS3xTUrIALZOw --num_pages 1 --params '{"publishedBefore": "2020-01-01T00:00:00Z"}'
    • Scrape the 100 latest videos from a channel longer than 20 minutes:
      python channel UCEBb1b_L6zDS3xTUrIALZOw --num_pages 2 --min_length_check 20
    • Mass Download 1 (to be used with 2. Mass Data Collector):
      python channel UCEBb1b_L6zDS3xTUrIALZOw --num_pages 2 --min_length_check 20 -f ../mass-download-list.csv
    • Mass Download 2 (specify certain dates and times):
      python channel UCEBb1b_L6zDS3xTUrIALZOw --num_pages 2 --min_length_check 20 -f ../mass-download-list.csv --params '{"publishedBefore": "2015-01-01T00:00:00Z", "publishedAfter": "2014-01-01T00:00:00Z"}'

2. Mass Data Collector

This script provides a method to collect massive amounts of new data for the slide classifier. These new lecture videos are selected based on what the model struggles with (where its certainty is lowest). This means the collected videos train the model the fastest while exposing it to the most unique situations. However, this method will ignore videos that the model is very confident with but is actually incorrect. These videos are the most beneficial but must be manually found.

The Mass Data Collector does the following for each video in dataset/mass-download-list.csv:
  1. Download the video to dataset/mass-download-temp/[video_id]

  2. Extracts frames

  3. Classifies the frames to obtain certainties and the percent incorrect (where certainty is below a threshold)

  4. Adds video_id, average_certainty, num_incorrect, percent_incorrect, and certainties to dataset/mass-download-results.csv

  5. Deletes video folder (dataset/mass-download-temp/[video_id])

The --top-k (or -k) argument can be specified to the script add the top k most uncertain videos to the dataset/videos-dataset.csv. This must be ran after the dataset/mass-download-results.csv file has been populated.


This script will use a lot of bandwidth/data. For instance, the below commands will download 100 videos from YouTube. If each video is 100MB (which is likely on the low end) then this will download at least 10GB of data.


  1. Low Disk Space Usage, High Bandwidth, Duplicate Calculations, Large Dataset Filesize

    Recommended if you want to build the dataset at full 1080p resolution so that it can be used with a plethora of model architectures. This was how the official dataset was compiled.

    The below commands do the following:

    1. Scrape the MIT OpenCourseWare YouTube channel for the latest 100 videos that are longer than 20 minutes and save the data to ../mass-download-list.csv

    2. Run the Mass Data Collector to download each video at 480p and determine how certain the model is with its predictions on that video.

    3. Take the top 20 most uncertain videos and add them to the dataset/videos-dataset.csv.

    4. Download the newly added 20 videos at 480p

    5. Extract frames from the new videos

    6. Sort the frames from top 20 most uncertain videos

    7. Now it is time for you to check the model’s predictions, fix them, and then train a better model on the new data.

    python channel UCEBb1b_L6zDS3xTUrIALZOw --num_pages 2 --min_length_check 20 -f ../mass-download-list.csv
    python --resolution 480
    python -k 20
    python csv
    python auto
  2. High Disk Space Usage, Higher Bandwidth, No Duplicate Calculations, Large Dataset Filesize

    Recommended if you want to build the dataset at full 1080p resolution but do not want to “waste” compute resources on duplicate calculations.

    Specifying the --no_remove argument to will make the script keep the processed videos instead of removing them. This means the videos can be copied to the dataset/videos folder, manually inspected and fixed, and then 5. Compile Data can be used to copy them to the dataset/classifier-data folder.

    It is recommended to not set the --resolution if using this method because some of the downloaded videos will eventually be added to the dataset. The dataset is compiled at maximum resolution so that different models can be used that accept different resolutions.

  3. Lower Disk Space Usage, Low Bandwidth, Duplicate Calculations, Small Dataset Filesize

    Recommended if you want to build the dataset for a specific model architecture and if you want the dataset to take up a relatively small amount of disk space.

    If you want to train a resnet34, for example, which expects 224x224 input images, then you can set the resolution to 240p when downloading videos since the frames will be scaled before being used for training anyway. However, if you ever want to train a model that expects larger input images, you will have to download and reprocess the entire dataset.

    The modified commands look like this:

    python channel UCEBb1b_L6zDS3xTUrIALZOw --num_pages 2 --min_length_check 20 -f ../mass-download-list.csv
    python --resolution 240
    python -k 20
    python csv --resolution 240
    python auto

    Notice that the resolution was changed to 240 for the second command and the resolution option was added to the fourth command.

    This option can be modified as described in the second method by adding the --no_remove argument to This will increase disk usage but will prevent duplicate calculations and decrease overall bandwidth since videos will not have to be redownloaded.

Mass Dataset Collector Script Help

Output of python --help:

usage: [-h] [-k K] [-nr] [-r RESOLUTION] [-p]

Mass Data Collector

optional arguments:
-h, --help            show this help message and exit
-k K, --top_k K       Add the top `k` most uncertain videos to the videos-
-nr, --no_remove      Don't remove the videos after they have been
                        processed. This makes it faster to manually look
                        through the most uncertain videos since they don't
                        have to be redownloaded, but it will use more disk
                        The resolution of the videos to download. Default is
                        maximum resolution.
-p, --pause           Pause after each video has been processed but before

2. Slides Downloader

Takes a link to a pdf slideshow and downloads it to dataset/slides/pdfs or downloads every entry in dataset/slides-dataset.csv (csv option).

  • Command: python <csv/your_url>

  • Examples:
    • If csv: python csv

    • If your_url: python

  • Required Software: wget

2. Video Downloader

Uses youtube-dl (for youtube videos) and wget (for website videos) to download either a youtube video by id or every video that has not been download in dataset/videos-dataset.csv.

This script can also download the transcripts from YouTube using youtube-dl for each video in dataset/videos-dataset.csv with the --transcript argument..

  • Command: python <csv/youtube –video_id your_youtube_video_id>

  • Examples:
    • If csv: python csv

    • If your_youtube_video_id: python youtube --video_id 1Qws70XGSq4

    • Download all transcripts: python csv --transcript (will not download videos or change dataset/videos-dataset.csv)

  • Required Software: youtube-dl (YT-DL Website/YT-DL Github), wget

Video Downloader Script Help

Output of python --help:

usage: [-h] [--video_id VIDEO_ID] [--transcript]
                            [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Video Downloader

positional arguments:
{csv,youtube}         `csv`: Download all videos that have not been marked
                        as downloaded from the `videos-dataset.csv`.
                        `youtube`: download the specified video from YouTube
                        with id ``--video_id`.

optional arguments:
-h, --help            show this help message and exit
--video_id VIDEO_ID   The YouTube video id to download if `method` is
--transcript          Download the transcript INSTEAD of the video for each
                        entry in `videos-dataset.csv`. This ignores the
                        `downloaded` column in the CSV and will not download
                        The resolution of the videos to download. Default is
                        maximum resolution.
                        Set the logging level (default: 'Info').

3. Frame Extractor

Extracts either every N frames from a video file (selected by id and must be in videos folder) or, in auto mode, every N frames from every video in the dataset that has been downloaded and has not had its frames extracted already. extract_every_x_seconds can be set to auto to use the get_extract_every_x_seconds() function to automatically determine a good number of frames to extract. auto mode uses this feature and allows for exact reconstruction of the dataset. Extracted frames are saved into dataset/videos/[video_id]/frames.

  • Command: python <video_id/auto> <extract_every_x_seconds/auto> <quality>

  • Examples:
    • If video_id: python VT2o4KCEbes 20 5 or to automatically extract a good number of frames: python 63hAHbkzJG4 auto 5

    • If auto: python auto

  • Required Software: ffmpeg (FFmpeg Website/FFmpeg Github)

3. pdf2image

Takes every page in all pdf files in dataset/slides/pdfs, converts them to png images, and saves them in dataset/slides/images/pdf_file_name.

  • Command: python

  • Required Software: poppler-utils (pdftoppm) (Man Page/Website)

4. Auto Sort

Goes through every extracted frame for all videos in the dataset that don’t have sorted frames (based on the presence of the sorted_frames directory) and classifies them using models/slide_classifier. You need either a trained pytorch model to use this. Creates a list of frames that need to be checked for correctness by humans in dataset/to-be-sorted.csv. This script imports certain files from models/slide_classifier so the directory structure must not have been changed from installation.

  • Command: python

4. Sort From File

Creates a CSV of the category assigned to each frame of each video in the dataset or organizes extracted frames from a previously created CSV. The purpose of this script is to exactly reconstruct the dataset without downloading the already sorted images.

There are three options: 1. make: make a file mapping of the category to which each frame belongs by reading data from the dataset/videos directory. 2. make_compiled performs the same task as make but reads from the dataset/classifier-data directory. This is useful if the dataset has been compiled and the dataset/videos folder has been cleared. 3. sort: sort each file in dataset/sort_file_map.csv, moving the respective frame from video_id/frames to video_id/frames_sorted/category.


This script appends to dataset/sort_file_map.csv. It will not overwrite data.

  • Command: python <make/make_compiled/sort>

5. Compile Data

Merges the sorted frames from all the videos and slides in the dataset to dataset/classifier-data.


This script will not erase any data already stored in the dataset/classifier-data dataset folder.

  • Command: python <all/videos/slides>

  • Examples:
    • If videos: python videos, processes only sorted frames from videos

    • If slides: python slides, processes images from slides

    • If all: python all, processes from both videos and slides