Dataset General Information

Directory Structure

classifier-data: Created by 5. Compile Data. Contains all extracted slides and extracted sorted frames from the videos directory. This is the folder that should be given to the model for training.
scraper-scripts: Contains all of the scripts needed to obtain and manipulate the data. See Scraper Scripts for more information.
slides:
- images: The location where slide images extracted from slideshows in pdfs subdirectory are saved (used by 3. pdf2image).
- pdfs: The location where downloaded slideshow PDFs are saved (used by 2. Slides Downloader).
videos: Contains the following directory structure for each downloaded video:
- video_id: The parent folder containing all the files related to the specific video.
  
  frames: All frames extracted from video_id by 3. Frame Extractor.
  
  frames_sorted: Frames from video_id that are grouped into correct classes. 4. Auto Sort can help with this but you must verify correctness. More at 4. Auto Sort.
slides-dataset.csv: A list of all the slide presentations used in the dataset. NOT automatically updated by 2. Slides Downloader. You must manually update this file if you want the dataset to be reproducible.
sort_file_map.csv: A list of filenames and categories. Used exclusively by 4. Sort From File to either make a file mapping of the category to which each frame belongs or to sort each file in sort_file_map.csv, moving the respective frame from video_id/frames to video_id/frames_sorted/category.
to-be-sorted.csv: A list of videos and specific frames that have been sorted by 4. Auto Sort but need to be checked by a human for correctness. When running 4. Auto Sort any frames where the AI model’s confidence level is below a threshold are added to this list as most likely incorrect.
videos-dataset.csv: A list of all videos used in the dataset. Automatically updated by 1. YouTube Scraper and 1. Website Scraper. The provider column is used by 2. Video Downloader to determine how to download the video.

Walkthrough (Step-by-Step Instructions to Create Dataset)

Install Prerequisite Software: youtube-dl, wget, ffmpeg, poppler-utils (see Quick-Install (Copy & Paste))
Download Content:
1. Download all videos: python 2-video_downloader.py csv
2. Download all slides: python 2-slides_downloader.py csv
Data Pre-processing:
1. Convert slide PDFs to PNGs: python 3-pdf2image.py
2. Extract frames from all videos: python 3-frame_extractor.py auto
3. Sort the frames: python 4-sort_from_file.py sort
Compile and merge the data: python 5-compile_data.py

Transcripts WER

Script location: dataset/transcripts_wer.py

This script will calculate the Word Error Rate (WER), Match Error Rate (MER), and Word Information Lost (WIL) for all videos in dataset/videos-dataset.csv that are YouTube videos with manual transcripts added (see the YouTube transcription method for more info about transcripts on YouTube).

There are two modes:

transcribe: Runs speech-to-text with DeepSpeech.
Process: For each transcript in dataset/transcripts:
1. Download the audio for the video
2. Convert the audio to WAV
3. Run DeepSpeech speech-to-text
calc: Calculate the statistics between the YouTube (human, ground-truth) and DeepSpeech (AI, ML transcripts.
Process: For each processed transcript (those with --suffix) in dataset/transcripts:
1. Convert the YouTube captions file to a string
2. Apply pre-processing to the transcripts (to lower case, remove multiple spaces, strip, sentences to list of words, remove empty strings)
3. Compute the statistics using the jiwer package
4. Log the stats
5. When all files are complete then log the average stats

Note

This script does not automatically download the transcripts for the YouTube videos. It just transcribes the YouTube videos in dataset/videos-dataset.csv with DeepSpeech and computes statistics with ground-truth transcripts. This means your ground-truth transcripts can come from a source other than YouTube and this script will still work. To download the transcripts for the videos in dataset/videos-dataset.csv use 2. Video Downloader.

Directions

Step 0: Make sure how have some videos in dataset/videos-dataset.csv. The 1. YouTube Scraper script can be used to add videos to the dataset.

Run python 2-video_downloader.py csv --transcript to download transcripts (in “.vtt” format) for all the YouTube videos in dataset/videos-dataset.csv to the dataset/transcripts folder.
Run python transcripts_wer.py transcribe to transcribe all the videos with ground-truth transcripts using DeepSpeech.
Run python transcripts_wer.py calc to calculate the statistics (including WER) between the DeepSpeech and YouTube transcripts.

Transcripts WER Script Help

usage: transcripts_wer.py [-h] [--transcripts_dir TRANSCRIPTS_DIR]
                        [--deepspeech_dir DEEPSPEECH_DIR] [--suffix SUFFIX]
                        [--no_chunk]
                        [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                        {transcribe,calc_wer}

Word Error Rate (WER) for Transcripts with DeepSpeech

positional arguments:
{transcribe,calc_wer}
                        `transcribe` each video and create a transcript using
                        ML models or use `calc_wer` to compute the WER for the
                        created transcripts

optional arguments:
-h, --help            show this help message and exit
--transcripts_dir TRANSCRIPTS_DIR
                        path to the directory containing transcripts
                        downloaded with 2-video_downloader.py
--deepspeech_dir DEEPSPEECH_DIR
                        path to the directory containing the DeepSpeech models
--suffix SUFFIX       string added after the video id and before the
                        extension in the transcript output from the ML model
--no_chunk            Disable audio chunking by voice activity.
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level (default: 'Info').