E2E General Information¶
The end-to-end approach. One command to take a video file and return summarized notes.
Run python main.py <path to video>
to get a notes file. See Summarizing a lecture video for a brief introduction.
Overall Explanation¶
First, frames are extracted once every second. Each frame is classified using the slide classifier (see Overview). Next, frames that were classified as slide
are processed by a black border removal algorithm, which is a simple program that crop to the largest rectangle in an image if the edge pixel values of the image are all close to zero. Thus, screen-captured slide frames that have black bars on the sides from a presentation created with a 4:3 aspect ratio but recorded at 16:9 can be interpreted correctly.
Frames that were classified as presenter_slide
are perspective cropped through feature matching and contour/hough lines algorithms. This process removes duplicate slides and crops images from the presenter_slide
class to only contain the side. However, to clean up any duplicates that may remain and to find the best representation of each slide, the slide frames are clustered using our custom segment
clustering algorithm.
At this point, the process has identified the set of unique slides presented in the lecture video. The next step of the pipeline is to process these slides by performing an SSA, which is an algorithm that extracts formatted text from each slide. After that, figures are extracted from the slides and attached to the SSA (see Slide Structure Analysis (SSA)) for each slide. A figure is an image, chart, table, or diagram. After the system has extracted the visual content, it begins processing the audio. The audio is transcribed automatically using the Vosk small 36MB model.
After the audio is transcribed, the system has a textual representation of both visual and auditory data, which need to be combined and summarized to create the final output. If the user desires notes then the SSA will be used for formatting, otherwise, there are tens of different ways of combining and summarizing the audio and slide transcripts, which are discussed in Combination and Summarization.
Script Descriptions¶
These descriptions are short and concise. For more information about some of the larger, more complicated files visit their respective pages on the left.
border_removal: The black border removal algorithm is a simple instruction set that finds the largest rectangle in the image if the edge pixel values of the image are all \(<\gamma\) and then crops to that rectangle.
cluster: Provides
lecture2notes.end_to_end.cluster.ClusterFilesystem
class, which clusters images from a directory and saves them to disk in folders corresponding to each centroid.corner_crop_transform: Provides functions to detect the bounding box of a slide in a frame and automatically crop to that bounding box. The
lecture2notes.end_to_end.corner_crop_transform.all_in_folder()
method is used by the main script. See Perspective Cropping & Corner Detection for more information. This is one of the two components used to remove duplicate slides and croppresenter_slide
images to only contain the slide. You can learn more about the overall perspective cropping process at Perspective Cropping.figure_detection:: The figure extraction algorithm identifies and saves images, charts, tables, and diagrams from slide frames so that they can be shown in the final summary. See Figure Detection Algorithm for more information.
frames_extractor: Provides
lecture2notes.end_to_end.frames_extractor.extract_frames()
, which extracts frames frominput_video_path
at quality levelquality
(best quality is 2) everyextract_every_x_seconds
seconds and saves them tooutput_path
.helpers: A small file of helper functions to reduce duplicate code. See Helpers.
imghash: Provides functions to detect near duplicate images using image hashing methods from the
imagehash
library.lecture2notes.end_to_end.imghash.sort_by_duplicates()
will create lists of similar images andlecture2notes.end_to_end.imghash.remove_duplicates()
will remove those duplicates and keep the last file (when sorted alphanumerically descending)main: The master file that brings all of the components in this directory together by calling the functions provided by the components. Implements a
skip_to
variable that can be set to skip to a certain step of the process. This is useful if a pervious step completed but the overall process failed. The--help
is located below.ocr: OCR processing uses the pytesseract (GitHub) package. “Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and ‘read’ the text embedded in images. Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine.” This page has good information to improve results from tesseract. See
ocr.all_in_folder()
andlecture2notes.end_to_end.ocr.write_to_file()
.segment_cluster:
SegmentCluster
iterates through frames in order and splits based on large visual differences. These differences are measured by the cosine difference between the feature vectors (2nd to last layer or right before the softmax) outputted by the slide classifier. This class behaves similarly tolecture2notes.end_to_end.cluster.ClusterFilesystem
in that it also providesextract_and_add_features()
andtransfer_to_filesystem()
.sift_matcher:: One of the components used to remove duplicate slides and crop
presenter_slide
images to only contain the slide. You can learn more about thesift_matcher
at SIFT Matcher & Perspective Cropping and the overall perspective cropping process at Perspective Cropping.slide_classifier: Provides
lecture2notes.end_to_end.slide_classifier.classify_frames()
which automatically sorts images (the extracted frames) using the slide-classifier model. The inference script inmodels/slide_classifier
is used.spell_check: Contains the
SpellChecker
class, which can spell check a string withcheck()
or a list of strings withcheck_all()
. With both functions, the best correction is returned.summarization_approaches: Many summarization models and algorithms for use with
end_to_end/main.py
. Thelecture2notes.end_to_end.summarization_approaches.cluster()
is probably the most interesting method from this file.transcript_downloader: Provides the
lecture2notes.end_to_end.transcript_downloader.TranscriptDownloader
class, which downloads transcripts from YouTube using the YouTube API oryoutube-dl
.youtube-dl
is the recommended method since it does not require an API key and is significantly more reliable than the YouTube API.youtube_api: Function to use YouTube API with key or
client_secret.json
. Seeyoutube_api.init_youtube()
.
Main Script Help¶
Output of python -m lecture2notes.end_to_end.main --help
:
usage: main.py [-h] [-s N] [-d PATH] [-id] [--custom_id CUSTOM_ID] [-rm]
[--extract_frames_quality EXTRACT_FRAMES_QUALITY]
[--extract_every_x_seconds EXTRACT_EVERY_X_SECONDS]
[--slide_classifier_model_path SLIDE_CLASSIFIER_MODEL_PATH]
[--east_path EAST_PATH] [-c {silence,speech,none}] [-rd]
[-cm {normal,segment}]
[-ca {only_asr,concat,full_sents,keyword_based}]
[-sm {none,full_sents} [{none,full_sents} ...]]
[-sx {none,cluster,lsa,luhn,lex_rank,text_rank,edmundson,random}]
[-sa {none,presumm,sshleifer/distilbart-cnn-12-6,patrickvonplaten/bert2bert_cnn_daily_mail,facebook/bart-large-cnn,allenai/led-large-16384-arxiv,patrickvonplaten/led-large-16384-pubmed,google/pegasus-billsum,google/pegasus-cnn_dailymail,google/pegasus-pubmed,google/pegasus-arxiv,google/pegasus-wikihow,google/pegasus-big_patent}]
[-ss {structured_joined,none}]
[--structured_joined_summarization_method {none,abstractive,extractive}]
[--structured_joined_abs_summarizer {presumm,sshleifer/distilbart-cnn-12-6,patrickvonplaten/bert2bert_cnn_daily_mail,facebook/bart-large-cnn,allenai/led-large-16384-arxiv,patrickvonplaten/led-large-16384-pubmed,google/pegasus-billsum,google/pegasus-cnn_dailymail,google/pegasus-pubmed,google/pegasus-arxiv,google/pegasus-wikihow,google/pegasus-big_patent}]
[--structured_joined_ext_summarizer {lsa,luhn,lex_rank,text_rank,edmundson,random}]
[-tm {sphinx,google,youtube,deepspeech,vosk,wav2vec}]
[--transcribe_segment_sentences]
[--custom_transcript_check CUSTOM_TRANSCRIPT_CHECK]
[-sc {ocr,transcript} [{ocr,transcript} ...]] [--video_id ID]
[--transcribe_model_dir DIR] [--abs_hf_api]
[--abs_hf_api_overall] [--tensorboard PATH]
[--bart_checkpoint PATH] [--bart_state_dict_key PATH]
[--bart_fairseq] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
DIR
End-to-End Conversion of Lecture Videos to Notes using ML
positional arguments:
DIR path to video
optional arguments:
-h, --help show this help message and exit
-s N, --skip_to N set to > 0 to skip specific processing steps
-d PATH, --process_dir PATH
path to the proessing directory (where extracted
frames and other files are saved), set to "automatic"
to use the video's folder (default: ./)
-id, --auto_id automatically create a subdirectory in `process_dir`
with a unique id for the video and change
`process_dir` to this new directory
--custom_id CUSTOM_ID
same as `--auto_id` but will create a subdirectory
using this value instead of a random id
-rm, --remove remove `process_dir` once conversion is complete
--extract_frames_quality EXTRACT_FRAMES_QUALITY
ffmpeg quality of extracted frames
--extract_every_x_seconds EXTRACT_EVERY_X_SECONDS
how many seconds between extracted frames
--slide_classifier_model_path SLIDE_CLASSIFIER_MODEL_PATH
path to the slide classification model checkpoint
--east_path EAST_PATH
path to the EAST text detector model
-c {silence,speech,none}, --chunk {silence,speech,none}
split the audio into small chunks on `silence` using
PyDub or voice activity `speech` using py-webrtcvad.
set to 'none' to disable. Recommend 'speech' for
DeepSpeech and 'none' for Vosk. (default: 'none').
-rd, --remove_duplicates
remove duplicate slides before perspective cropping
and before clustering (helpful when `--cluster_method`
is `segment`)
-cm {normal,segment}, --cluster_method {normal,segment}
which clustering method to use. `normal` uses a
clustering algorithm from scikit-learn and `segment`
uses the special method that iterates through frames
in order and splits based on large visual differences
-ca {only_asr,concat,full_sents,keyword_based}, --combination_algo {only_asr,concat,full_sents,keyword_based}
which combination algorithm to use. more information
in documentation.
-sm {none,full_sents} [{none,full_sents} ...], --summarization_mods {none,full_sents} [{none,full_sents} ...]
modifications to perform during summarization process.
each modification is run between the combination and
extractive stages. more information in documentation.
-sx {none,cluster,lsa,luhn,lex_rank,text_rank,edmundson,random}, --summarization_ext {none,cluster,lsa,luhn,lex_rank,text_rank,edmundson,random}
which extractive summarization approach to use. more
information in documentation.
-sa {none,presumm,sshleifer/distilbart-cnn-12-6,patrickvonplaten/bert2bert_cnn_daily_mail,facebook/bart-large-cnn,allenai/led-large-16384-arxiv,patrickvonplaten/led-large-16384-pubmed,google/pegasus-billsum,google/pegasus-cnn_dailymail,google/pegasus-pubmed,google/pegasus-arxiv,google/pegasus-wikihow,google/pegasus-big_patent}, --summarization_abs {none,presumm,sshleifer/distilbart-cnn-12-6,patrickvonplaten/bert2bert_cnn_daily_mail,facebook/bart-large-cnn,allenai/led-large-16384-arxiv,patrickvonplaten/led-large-16384-pubmed,google/pegasus-billsum,google/pegasus-cnn_dailymail,google/pegasus-pubmed,google/pegasus-arxiv,google/pegasus-wikihow,google/pegasus-big_patent}
which abstractive summarization approach/model to use.
more information in documentation.
-ss {structured_joined,none}, --summarization_structured {structured_joined,none}
An additional summarization algorithm that creates a
structured summary with figures, slide content (with
bolded area), and summarized transcript content from
the SSA (Slide Structure Analysis) and transcript JSON
data.
--structured_joined_summarization_method {none,abstractive,extractive}
The summarization method to use during
`structured_joined` summarization.
--structured_joined_abs_summarizer {presumm,sshleifer/distilbart-cnn-12-6,patrickvonplaten/bert2bert_cnn_daily_mail,facebook/bart-large-cnn,allenai/led-large-16384-arxiv,patrickvonplaten/led-large-16384-pubmed,google/pegasus-billsum,google/pegasus-cnn_dailymail,google/pegasus-pubmed,google/pegasus-arxiv,google/pegasus-wikihow,google/pegasus-big_patent}
The abstractive summarizer to use during
`structured_joined` summarization (to create summaries
of each slide) if
`structured_joined_summarization_method` is
'abstractive'.
--structured_joined_ext_summarizer {lsa,luhn,lex_rank,text_rank,edmundson,random}
The extractive summarizer to use during
`structured_joined` summarization (to create summaries
of each slide) if
`--structured_joined_summarization_method` is
'extractive'.
-tm {sphinx,google,youtube,deepspeech,vosk,wav2vec}, --transcription_method {sphinx,google,youtube,deepspeech,vosk,wav2vec}
specify the program that should be used for
transcription. CMU Sphinx: use pocketsphinx Google
Speech Recognition: probably will require chunking
(online, free, max 1 minute chunks) YouTube: download
a video transcript from YouTube based on `--video_id`
DeepSpeech: Use the deepspeech library (fast with good
GPU) Vosk: Use the vosk library (extremely small low-
resource model with great accuracy, this is the
default) Wav2Vec: State-of-the-art speech-to-text
model through the `huggingface/transformers` library.
--transcribe_segment_sentences
Disable DeepSegment automatic sentence boundary
detection. Specifying this option will output
transcripts without punctuation.
--custom_transcript_check CUSTOM_TRANSCRIPT_CHECK
Check if a transcript file (follwed by an extension of
vtt, srt, or sbv) with the specified name is in the
processing folder and use it instead of running
speech-to-text.
-sc {ocr,transcript} [{ocr,transcript} ...], --spell_check {ocr,transcript} [{ocr,transcript} ...]
option to perform spell checking on the ocr results of
the slides or the voice transcript or both
--video_id ID id of youtube video to get subtitles from. set
`--transcription_method` to `youtube` for this
argument to take effect.
--transcribe_model_dir DIR
path containing the model files for Vosk/DeepSpeech if
`--transcription_method` is set to one of those
models. See the documentation for details.
--abs_hf_api use the huggingface inference API for abstractive
summarization tasks
--abs_hf_api_overall use the huggingface inference API for final overall
abstractive summarization task
--tensorboard PATH Path to tensorboard logdir. Tensorboard not used if
not set. Tensorboard only used to visualize cluster
primarily for debugging.
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Set the logging level (default: 'Info').