Dataset General Information¶
Directory Structure¶
classifier-data: Created by 5. Compile Data. Contains all extracted slides and extracted sorted frames from the videos directory. This is the folder that should be given to the model for training.
scraper-scripts: Contains all of the scripts needed to obtain and manipulate the data. See Scraper Scripts for more information.
- slides:
images: The location where slide images extracted from slideshows in pdfs subdirectory are saved (used by 3. pdf2image).
pdfs: The location where downloaded slideshow PDFs are saved (used by 2. Slides Downloader).
- videos: Contains the following directory structure for each downloaded video:
- video_id: The parent folder containing all the files related to the specific video.
frames: All frames extracted from video_id by 3. Frame Extractor.
frames_sorted: Frames from video_id that are grouped into correct classes. 4. Auto Sort can help with this but you must verify correctness. More at 4. Auto Sort.
slides-dataset.csv: A list of all the slide presentations used in the dataset. NOT automatically updated by 2. Slides Downloader. You must manually update this file if you want the dataset to be reproducible.
sort_file_map.csv: A list of filenames and categories. Used exclusively by 4. Sort From File to either
make
a file mapping of the category to which each frame belongs or tosort
each file insort_file_map.csv
, moving the respective frame fromvideo_id/frames
tovideo_id/frames_sorted/category
.to-be-sorted.csv: A list of videos and specific frames that have been sorted by 4. Auto Sort but need to be checked by a human for correctness. When running 4. Auto Sort any frames where the AI model’s confidence level is below a threshold are added to this list as most likely incorrect.
videos-dataset.csv: A list of all videos used in the dataset. Automatically updated by 1. YouTube Scraper and 1. Website Scraper. The provider column is used by 2. Video Downloader to determine how to download the video.
Walkthrough (Step-by-Step Instructions to Create Dataset)¶
Install Prerequisite Software:
youtube-dl
,wget
,ffmpeg
,poppler-utils
(see Quick-Install (Copy & Paste))- Download Content:
Download all videos:
python 2-video_downloader.py csv
Download all slides:
python 2-slides_downloader.py csv
- Data Pre-processing:
Convert slide PDFs to PNGs:
python 3-pdf2image.py
Extract frames from all videos:
python 3-frame_extractor.py auto
Sort the frames:
python 4-sort_from_file.py sort
Compile and merge the data:
python 5-compile_data.py
Transcripts WER¶
Script location: dataset/transcripts_wer.py
This script will calculate the Word Error Rate (WER), Match Error Rate (MER), and Word Information Lost (WIL) for all videos in dataset/videos-dataset.csv
that are YouTube videos with manual transcripts added (see the YouTube transcription method for more info about transcripts on YouTube).
There are two modes:
transcribe
: Runs speech-to-text with DeepSpeech.Process: For each transcript in
dataset/transcripts
:Download the audio for the video
Convert the audio to WAV
Run DeepSpeech speech-to-text
calc
: Calculate the statistics between the YouTube (human, ground-truth) and DeepSpeech (AI, ML transcripts.Process: For each processed transcript (those with
--suffix
) indataset/transcripts
:Convert the YouTube captions file to a string
Apply pre-processing to the transcripts (to lower case, remove multiple spaces, strip, sentences to list of words, remove empty strings)
Compute the statistics using the jiwer package
Log the stats
When all files are complete then log the average stats
Note
This script does not automatically download the transcripts for the YouTube videos. It just transcribes the YouTube videos in dataset/videos-dataset.csv
with DeepSpeech and computes statistics with ground-truth transcripts. This means your ground-truth transcripts can come from a source other than YouTube and this script will still work. To download the transcripts for the videos in dataset/videos-dataset.csv
use 2. Video Downloader.
Directions¶
Step 0: Make sure how have some videos in dataset/videos-dataset.csv
. The 1. YouTube Scraper script can be used to add videos to the dataset.
Run
python 2-video_downloader.py csv --transcript
to download transcripts (in “.vtt” format) for all the YouTube videos indataset/videos-dataset.csv
to thedataset/transcripts
folder.Run
python transcripts_wer.py transcribe
to transcribe all the videos with ground-truth transcripts using DeepSpeech.Run
python transcripts_wer.py calc
to calculate the statistics (including WER) between the DeepSpeech and YouTube transcripts.
Transcripts WER Script Help¶
usage: transcripts_wer.py [-h] [--transcripts_dir TRANSCRIPTS_DIR]
[--deepspeech_dir DEEPSPEECH_DIR] [--suffix SUFFIX]
[--no_chunk]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
{transcribe,calc_wer}
Word Error Rate (WER) for Transcripts with DeepSpeech
positional arguments:
{transcribe,calc_wer}
`transcribe` each video and create a transcript using
ML models or use `calc_wer` to compute the WER for the
created transcripts
optional arguments:
-h, --help show this help message and exit
--transcripts_dir TRANSCRIPTS_DIR
path to the directory containing transcripts
downloaded with 2-video_downloader.py
--deepspeech_dir DEEPSPEECH_DIR
path to the directory containing the DeepSpeech models
--suffix SUFFIX string added after the video id and before the
extension in the transcript output from the ML model
--no_chunk Disable audio chunking by voice activity.
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Set the logging level (default: 'Info').