E2E General Information

The end-to-end approach. One command to take a video file and return summarized notes.

Run python main.py <path to video> to get a notes file. See Summarizing a lecture video for a brief introduction.

Overall Explanation

First, frames are extracted once every second. Each frame is classified using the slide classifier (see Overview). Next, frames that were classified as slide are processed by a black border removal algorithm, which is a simple program that crop to the largest rectangle in an image if the edge pixel values of the image are all close to zero. Thus, screen-captured slide frames that have black bars on the sides from a presentation created with a 4:3 aspect ratio but recorded at 16:9 can be interpreted correctly.

Frames that were classified as presenter_slide are perspective cropped through feature matching and contour/hough lines algorithms. This process removes duplicate slides and crops images from the presenter_slide class to only contain the side. However, to clean up any duplicates that may remain and to find the best representation of each slide, the slide frames are clustered using our custom segment clustering algorithm.

At this point, the process has identified the set of unique slides presented in the lecture video. The next step of the pipeline is to process these slides by performing an SSA, which is an algorithm that extracts formatted text from each slide. After that, figures are extracted from the slides and attached to the SSA (see Slide Structure Analysis (SSA)) for each slide. A figure is an image, chart, table, or diagram. After the system has extracted the visual content, it begins processing the audio. The audio is transcribed automatically using the Vosk small 36MB model.

After the audio is transcribed, the system has a textual representation of both visual and auditory data, which need to be combined and summarized to create the final output. If the user desires notes then the SSA will be used for formatting, otherwise, there are tens of different ways of combining and summarizing the audio and slide transcripts, which are discussed in Combination and Summarization.

Script Descriptions

These descriptions are short and concise. For more information about some of the larger, more complicated files visit their respective pages on the left.

  • border_removal: The black border removal algorithm is a simple instruction set that finds the largest rectangle in the image if the edge pixel values of the image are all \(<\gamma\) and then crops to that rectangle.

  • cluster: Provides lecture2notes.end_to_end.cluster.ClusterFilesystem class, which clusters images from a directory and saves them to disk in folders corresponding to each centroid.

  • corner_crop_transform: Provides functions to detect the bounding box of a slide in a frame and automatically crop to that bounding box. The lecture2notes.end_to_end.corner_crop_transform.all_in_folder() method is used by the main script. See Perspective Cropping & Corner Detection for more information. This is one of the two components used to remove duplicate slides and crop presenter_slide images to only contain the slide. You can learn more about the overall perspective cropping process at Perspective Cropping.

  • figure_detection:: The figure extraction algorithm identifies and saves images, charts, tables, and diagrams from slide frames so that they can be shown in the final summary. See Figure Detection Algorithm for more information.

  • frames_extractor: Provides lecture2notes.end_to_end.frames_extractor.extract_frames(), which extracts frames from input_video_path at quality level quality (best quality is 2) every extract_every_x_seconds seconds and saves them to output_path.

  • helpers: A small file of helper functions to reduce duplicate code. See Helpers.

  • imghash: Provides functions to detect near duplicate images using image hashing methods from the imagehash library. lecture2notes.end_to_end.imghash.sort_by_duplicates() will create lists of similar images and lecture2notes.end_to_end.imghash.remove_duplicates() will remove those duplicates and keep the last file (when sorted alphanumerically descending)

  • main: The master file that brings all of the components in this directory together by calling the functions provided by the components. Implements a skip_to variable that can be set to skip to a certain step of the process. This is useful if a pervious step completed but the overall process failed. The --help is located below.

  • ocr: OCR processing uses the pytesseract (GitHub) package. “Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and ‘read’ the text embedded in images. Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine.” This page has good information to improve results from tesseract. See ocr.all_in_folder() and lecture2notes.end_to_end.ocr.write_to_file().

  • segment_cluster: SegmentCluster iterates through frames in order and splits based on large visual differences. These differences are measured by the cosine difference between the feature vectors (2nd to last layer or right before the softmax) outputted by the slide classifier. This class behaves similarly to lecture2notes.end_to_end.cluster.ClusterFilesystem in that it also provides extract_and_add_features() and transfer_to_filesystem().

  • sift_matcher:: One of the components used to remove duplicate slides and crop presenter_slide images to only contain the slide. You can learn more about the sift_matcher at SIFT Matcher & Perspective Cropping and the overall perspective cropping process at Perspective Cropping.

  • slide_classifier: Provides lecture2notes.end_to_end.slide_classifier.classify_frames() which automatically sorts images (the extracted frames) using the slide-classifier model. The inference script in models/slide_classifier is used.

  • spell_check: Contains the SpellChecker class, which can spell check a string with check() or a list of strings with check_all(). With both functions, the best correction is returned.

  • summarization_approaches: Many summarization models and algorithms for use with end_to_end/main.py. The lecture2notes.end_to_end.summarization_approaches.cluster() is probably the most interesting method from this file.

  • transcript_downloader: Provides the lecture2notes.end_to_end.transcript_downloader.TranscriptDownloader class, which downloads transcripts from YouTube using the YouTube API or youtube-dl. youtube-dl is the recommended method since it does not require an API key and is significantly more reliable than the YouTube API.

  • youtube_api: Function to use YouTube API with key or client_secret.json. See youtube_api.init_youtube().

Main Script Help

Output of python -m lecture2notes.end_to_end.main --help:

usage: main.py [-h] [-s N] [-d PATH] [-id] [--custom_id CUSTOM_ID] [-rm]
[--extract_frames_quality EXTRACT_FRAMES_QUALITY]
[--extract_every_x_seconds EXTRACT_EVERY_X_SECONDS]
[--slide_classifier_model_path SLIDE_CLASSIFIER_MODEL_PATH]
[--east_path EAST_PATH] [-c {silence,speech,none}] [-rd]
[-cm {normal,segment}]
[-ca {only_asr,concat,full_sents,keyword_based}]
[-sm {none,full_sents} [{none,full_sents} ...]]
[-sx {none,cluster,lsa,luhn,lex_rank,text_rank,edmundson,random}]
[-sa {none,presumm,sshleifer/distilbart-cnn-12-6,patrickvonplaten/bert2bert_cnn_daily_mail,facebook/bart-large-cnn,allenai/led-large-16384-arxiv,patrickvonplaten/led-large-16384-pubmed,google/pegasus-billsum,google/pegasus-cnn_dailymail,google/pegasus-pubmed,google/pegasus-arxiv,google/pegasus-wikihow,google/pegasus-big_patent}]
[-ss {structured_joined,none}]
[--structured_joined_summarization_method {none,abstractive,extractive}]
[--structured_joined_abs_summarizer {presumm,sshleifer/distilbart-cnn-12-6,patrickvonplaten/bert2bert_cnn_daily_mail,facebook/bart-large-cnn,allenai/led-large-16384-arxiv,patrickvonplaten/led-large-16384-pubmed,google/pegasus-billsum,google/pegasus-cnn_dailymail,google/pegasus-pubmed,google/pegasus-arxiv,google/pegasus-wikihow,google/pegasus-big_patent}]
[--structured_joined_ext_summarizer {lsa,luhn,lex_rank,text_rank,edmundson,random}]
[-tm {sphinx,google,youtube,deepspeech,vosk,wav2vec}]
[--transcribe_segment_sentences]
[--custom_transcript_check CUSTOM_TRANSCRIPT_CHECK]
[-sc {ocr,transcript} [{ocr,transcript} ...]] [--video_id ID]
[--transcribe_model_dir DIR] [--abs_hf_api]
[--abs_hf_api_overall] [--tensorboard PATH]
[--bart_checkpoint PATH] [--bart_state_dict_key PATH]
[--bart_fairseq] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
DIR

End-to-End Conversion of Lecture Videos to Notes using ML

positional arguments:
DIR                   path to video

optional arguments:
-h, --help            show this help message and exit
-s N, --skip_to N     set to > 0 to skip specific processing steps
-d PATH, --process_dir PATH
        path to the proessing directory (where extracted
        frames and other files are saved), set to "automatic"
        to use the video's folder (default: ./)
-id, --auto_id        automatically create a subdirectory in `process_dir`
        with a unique id for the video and change
        `process_dir` to this new directory
--custom_id CUSTOM_ID
        same as `--auto_id` but will create a subdirectory
        using this value instead of a random id
-rm, --remove         remove `process_dir` once conversion is complete
--extract_frames_quality EXTRACT_FRAMES_QUALITY
        ffmpeg quality of extracted frames
--extract_every_x_seconds EXTRACT_EVERY_X_SECONDS
        how many seconds between extracted frames
--slide_classifier_model_path SLIDE_CLASSIFIER_MODEL_PATH
        path to the slide classification model checkpoint
--east_path EAST_PATH
        path to the EAST text detector model
-c {silence,speech,none}, --chunk {silence,speech,none}
        split the audio into small chunks on `silence` using
        PyDub or voice activity `speech` using py-webrtcvad.
        set to 'none' to disable. Recommend 'speech' for
        DeepSpeech and 'none' for Vosk. (default: 'none').
-rd, --remove_duplicates
        remove duplicate slides before perspective cropping
        and before clustering (helpful when `--cluster_method`
        is `segment`)
-cm {normal,segment}, --cluster_method {normal,segment}
        which clustering method to use. `normal` uses a
        clustering algorithm from scikit-learn and `segment`
        uses the special method that iterates through frames
        in order and splits based on large visual differences
-ca {only_asr,concat,full_sents,keyword_based}, --combination_algo {only_asr,concat,full_sents,keyword_based}
        which combination algorithm to use. more information
        in documentation.
-sm {none,full_sents} [{none,full_sents} ...], --summarization_mods {none,full_sents} [{none,full_sents} ...]
        modifications to perform during summarization process.
        each modification is run between the combination and
        extractive stages. more information in documentation.
-sx {none,cluster,lsa,luhn,lex_rank,text_rank,edmundson,random}, --summarization_ext {none,cluster,lsa,luhn,lex_rank,text_rank,edmundson,random}
        which extractive summarization approach to use. more
        information in documentation.
-sa {none,presumm,sshleifer/distilbart-cnn-12-6,patrickvonplaten/bert2bert_cnn_daily_mail,facebook/bart-large-cnn,allenai/led-large-16384-arxiv,patrickvonplaten/led-large-16384-pubmed,google/pegasus-billsum,google/pegasus-cnn_dailymail,google/pegasus-pubmed,google/pegasus-arxiv,google/pegasus-wikihow,google/pegasus-big_patent}, --summarization_abs {none,presumm,sshleifer/distilbart-cnn-12-6,patrickvonplaten/bert2bert_cnn_daily_mail,facebook/bart-large-cnn,allenai/led-large-16384-arxiv,patrickvonplaten/led-large-16384-pubmed,google/pegasus-billsum,google/pegasus-cnn_dailymail,google/pegasus-pubmed,google/pegasus-arxiv,google/pegasus-wikihow,google/pegasus-big_patent}
        which abstractive summarization approach/model to use.
        more information in documentation.
-ss {structured_joined,none}, --summarization_structured {structured_joined,none}
        An additional summarization algorithm that creates a
        structured summary with figures, slide content (with
        bolded area), and summarized transcript content from
        the SSA (Slide Structure Analysis) and transcript JSON
        data.
--structured_joined_summarization_method {none,abstractive,extractive}
        The summarization method to use during
        `structured_joined` summarization.
--structured_joined_abs_summarizer {presumm,sshleifer/distilbart-cnn-12-6,patrickvonplaten/bert2bert_cnn_daily_mail,facebook/bart-large-cnn,allenai/led-large-16384-arxiv,patrickvonplaten/led-large-16384-pubmed,google/pegasus-billsum,google/pegasus-cnn_dailymail,google/pegasus-pubmed,google/pegasus-arxiv,google/pegasus-wikihow,google/pegasus-big_patent}
        The abstractive summarizer to use during
        `structured_joined` summarization (to create summaries
        of each slide) if
        `structured_joined_summarization_method` is
        'abstractive'.
--structured_joined_ext_summarizer {lsa,luhn,lex_rank,text_rank,edmundson,random}
        The extractive summarizer to use during
        `structured_joined` summarization (to create summaries
        of each slide) if
        `--structured_joined_summarization_method` is
        'extractive'.
-tm {sphinx,google,youtube,deepspeech,vosk,wav2vec}, --transcription_method {sphinx,google,youtube,deepspeech,vosk,wav2vec}
        specify the program that should be used for
        transcription. CMU Sphinx: use pocketsphinx Google
        Speech Recognition: probably will require chunking
        (online, free, max 1 minute chunks) YouTube: download
        a video transcript from YouTube based on `--video_id`
        DeepSpeech: Use the deepspeech library (fast with good
        GPU) Vosk: Use the vosk library (extremely small low-
        resource model with great accuracy, this is the
        default) Wav2Vec: State-of-the-art speech-to-text
        model through the `huggingface/transformers` library.
--transcribe_segment_sentences
        Disable DeepSegment automatic sentence boundary
        detection. Specifying this option will output
        transcripts without punctuation.
--custom_transcript_check CUSTOM_TRANSCRIPT_CHECK
        Check if a transcript file (follwed by an extension of
        vtt, srt, or sbv) with the specified name is in the
        processing folder and use it instead of running
        speech-to-text.
-sc {ocr,transcript} [{ocr,transcript} ...], --spell_check {ocr,transcript} [{ocr,transcript} ...]
        option to perform spell checking on the ocr results of
        the slides or the voice transcript or both
--video_id ID         id of youtube video to get subtitles from. set
        `--transcription_method` to `youtube` for this
        argument to take effect.
--transcribe_model_dir DIR
        path containing the model files for Vosk/DeepSpeech if
        `--transcription_method` is set to one of those
        models. See the documentation for details.
--abs_hf_api          use the huggingface inference API for abstractive
        summarization tasks
--abs_hf_api_overall  use the huggingface inference API for final overall
        abstractive summarization task
--tensorboard PATH    Path to tensorboard logdir. Tensorboard not used if
        not set. Tensorboard only used to visualize cluster
        primarily for debugging.
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
        Set the logging level (default: 'Info').