Main Summarizer Class

class lecture2notes.end_to_end.summarizer_class.LectureSummarizer(params, **kwargs)[source]



lecture2notes.end_to_end.transcribe.transcribe_main.caption_file_to_string(transcript_path, remove_speakers=False)[source]

Converts a .srt, .vtt, or .sbv file saved at transcript_path to a python string. Optionally removes speaker entries by removing everything before “: ” in each subtitle cell.

lecture2notes.end_to_end.transcribe.transcribe_main.check_transcript(generated_transcript, ground_truth_transcript)[source]

Compares generated_transcript to ground_truth_transcript to check for accuracy using spacy similarity measurement. Requires the “en_vectors_web_lg” model to use “real” word vectors.

lecture2notes.end_to_end.transcribe.transcribe_main.chunk_by_silence(audio_path, output_path, silence_thresh_offset=5, min_silence_len=2000)[source]

Split an audio file into chunks on areas of silence

  • audio_path (str) – path to a wave file

  • output_path (str) – path to a folder where wave file chunks will be saved

  • silence_thresh_offset (int, optional) – a value subtracted from the mean dB volume of the file. Default is 5.

  • min_silence_len (int, optional) – the length in milliseconds in which there must be no sound in order to be marked as a splitting point. Default is 2000.

lecture2notes.end_to_end.transcribe.transcribe_main.chunk_by_speech(audio_path, output_path=None, aggressiveness=1, desired_sample_rate=None)[source]

Uses the python interface to the WebRTC Voice Activity Detector (VAD) API to create chunks of audio that contain voice. The VAD that Google developed for the WebRTC project is reportedly one of the best available, being fast, modern and free.

  • audio_path (str) – path to the audio file to process

  • output_path (str, optional) – path to save the chunk files. if not specified then no wave files will be written to disk and the raw pcm data will be returned. Defaults to None.

  • aggressiveness (int, optional) – determines how aggressive filtering out non-speech is. must be an interger between 0 and 3. Defaults to 1.

  • desired_sample_rate (int, optional) – the sample rate of the returned segments. the default is the same rate of the input audio file. Defaults to None.


(segments, sample_rate, audio_length). See vad_segment_generator().

Return type



Convert a deepspeech json transcript from a letter-by-letter format to word-by-word.


transcript_json (dict or str) – The json format transcript as a dictionary or a json string, which will be loaded using json.loads().


The word-by-word transcript json.

Return type


lecture2notes.end_to_end.transcribe.transcribe_main.convert_samplerate(audio_path, desired_sample_rate)[source]

Use SoX to resample wave files to 16 bits, 1 channel, and desired_sample_rate sample rate.

  • audio_path (str) – path to wave file to process

  • desired_sample_rate (int) – sample rate in hertz to convert the wave file to


(desired_sample_rate, output) where desired_sample_rate is the new

sample rate and output is the newly resampled pcm data

Return type


lecture2notes.end_to_end.transcribe.transcribe_main.extract_audio(video_path, output_path)[source]

Extracts audio from video at video_path and saves it to output_path

lecture2notes.end_to_end.transcribe.transcribe_main.get_youtube_transcript(video_id, output_path, use_youtube_dl=True)[source]

Downloads the transcript for video_id and saves it to output_path

lecture2notes.end_to_end.transcribe.transcribe_main.load_deepspeech_model(model_dir, beam_width=500, lm_alpha=None, lm_beta=None)[source]

Load the deepspeech model from model_dir

  • model_dir (str) – path to folder containing the “.pbmm” and optionally “.scorer” files

  • beam_width (int, optional) – beam width for decoding. Default is 500.

  • (float (lm_alpha) – alpha parameter of language model. Default is None.

  • optional} – alpha parameter of language model. Default is None.

  • lm_beta (float, optional) – beta parameter of langage model. Default is None.


the loaded deepspeech model

Return type


lecture2notes.end_to_end.transcribe.transcribe_main.load_model(method, *args, **kwargs)[source]
lecture2notes.end_to_end.transcribe.transcribe_main.load_wav2vec_model(model='facebook/wav2vec2-base-960h', tokenizer='facebook/wav2vec2-base-960h', **kwargs)[source]

Helper function to convert metadata tokens from deepspeech to a dictionary.


Helper function to convert metadata tokens from deepspeech to a string.

lecture2notes.end_to_end.transcribe.transcribe_main.process_chunks(chunk_dir, method='sphinx', model_dir=None)[source]

Performs transcription on every noise activity chunk (audio file) created by chunk_by_silence() in a directory.

lecture2notes.end_to_end.transcribe.transcribe_main.process_segments(segments, model, audio_length='unknown', method='deepspeech', do_segment_sentences=True)[source]

Transcribe a list of byte strings containing pcm data

  • segments (list) – list of byte strings containing pcm data (generated by chunk_by_speech())

  • model (deepspeech model) – a deepspeech model object or a path to a folder containing the model files (see load_deepspeech_model()).

  • audio_length (str, optional) – the length of the audio file if known (used for logging statements) Default is “unknown”.

  • method (str, optional) – The model to use to perform speech-to-text. Supports ‘deepspeech’ and ‘vosk’. Defaults to “deepspeech”.

  • do_segment_sentences (bool, optional) – Find sentence boundaries using segment_sentences(). Defaults to True.


(full_transcript, full_transcript_json) The combined transcript of all the items in segments as a string and as dictionary/json.

Return type


lecture2notes.end_to_end.transcribe.transcribe_main.read_wave(path, desired_sample_rate=None, force=False)[source]

Reads a “.wav” file and converts to desired_sample_rate with one channel.

  • path (str) – path to wave file to load

  • desired_sample_rate (int, optional) – resample the loaded pcm data from the wave file to this sample rate. Default is None, no resampling.

  • force (bool, optional) – Force the audio to be converted even if it is detected to meet the necessary criteria.


(PCM audio data, sample rate, duration)

Return type



Resolve directory path for deepspeech models and fetch each of them.


dir_name (str) – Path to the directory containing pre-trained models


a tuple containing each of the model files (pb, scorer)

Return type


lecture2notes.end_to_end.transcribe.transcribe_main.segment_sentences(text, text_json=None, do_capitalization=True)[source]

Detect sentence boundaries without punctuation or capitalization.

  • text (str) – The string to segment by sentence.

  • text_json (str or dict, optional) – If the detected sentence boundaries should be applied to the JSON format of a transcript. Defaults to None.

  • do_capitalization (bool, optiona) – If the first letter of each detected sentence should be capitalized. Defaults to True.


The punctuated (and optionally capitalized) string

Return type


lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio(audio_path, method='sphinx', **kwargs)[source]

Transcribe audio using DeepSpeech, Vosk, or a method offered by transcribe_audio_generic().

  • audio_path (str) – Path to the audio file to transcribe.

  • method (str, optional) – The method to use for transcription. Defaults to “sphinx”.

  • **kwargs – Passed to the transcription function.


(transcript_text, transcript_json)

Return type


lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio_deepspeech(audio_path_or_data, model, raw_audio_data=False, json_num_transcripts=None, **kwargs)[source]

Transcribe an audio file or pcm data with the deepspeech model

  • audio_path_or_data (str or byte string) – a path to a wave file or a byte string containing pcm data from a wave file. set raw_audio_data to True if pcm data is used.

  • model (deepspeech model or str) – a deepspeech model object or a path to a folder containing the model files (see load_deepspeech_model())

  • raw_audio_data (bool, optional) – must be True if audio_path_or_data is raw pcm data. Defaults to False.

  • json_num_transcripts (str, optional) – Specify this value to generate multiple transcipts in json format.


(transcript_text, transcript_json) the transcribed audio file in string format and the transcript in json

Return type


lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio_generic(audio_path, method='sphinx', **kwargs)[source]

Transcribe an audio file using CMU Sphinx or Google through the speech_recognition library

  • audio_path (str) – audio file path

  • method (str, optional) – which service to use for transcription (“google” or “sphinx”). Default is “sphinx”.


the transcript of the audio file

Return type


lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio_vosk(audio_path_or_chunks, model='../vosk_models', chunks=False, desired_sample_rate=16000, chunk_size=2000, **kwargs)[source]

Transcribe audio using a vosk model.

  • audio_path_or_chunks (str or generator) – Path to an audio file or a generator of chunks created by chunk_by_speech()

  • model (str or vosk.Model, optional) – Path to the directory containing the vosk models or loaded vosk.Model. Defaults to “../vosk_models”.

  • chunks (bool, optional) – If the audio_path_or_chunks is chunks. Defaults to False.

  • desired_sample_rate (int, optional) – The sample rate that the model requires to convert audio to. Defaults to 16000.

  • chunk_size (int, optional) – The number of wave frames per loop. Amount of audio data transcribed at a time. Defaults to 2000.


(text_transcript, results_json) The transcript as a string and as JSON.

Return type


lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio_wav2vec(audio_path_or_chunks, model=None, chunks=False, desired_sample_rate=16000)[source]
lecture2notes.end_to_end.transcribe.transcribe_main.write_to_file(transcript, transcript_save_file, transcript_json=None, transcript_json_save_path=None)[source]

Write transcript to transcript_save_file and transcript_json to transcript_json_save_path.

lecture2notes.end_to_end.transcribe.transcribe_main.write_wave(path, audio, sample_rate)[source]

Writes a .wav file.

Takes path, PCM audio data, and sample rate.


class lecture2notes.end_to_end.transcribe.webrtcvad_utils.Frame(bytes, timestamp, duration)[source]

Represents a “frame” of audio data.

lecture2notes.end_to_end.transcribe.webrtcvad_utils.frame_generator(frame_duration_ms, audio, sample_rate)[source]

Generates audio frames from PCM audio data.

Takes the desired frame duration in milliseconds, the PCM data, and the sample rate.

Yields Frames of the requested duration.

lecture2notes.end_to_end.transcribe.webrtcvad_utils.vad_collector(sample_rate, frame_duration_ms, padding_duration_ms, vad, frames)[source]

Filters out non-voiced audio frames.

Given a webrtcvad.Vad and a source of audio frames, yields only the voiced audio.

Uses a padded, sliding window algorithm over the audio frames. When more than 90% of the frames in the window are voiced (as reported by the VAD), the collector triggers and begins yielding audio frames. Then the collector waits until 90% of the frames in the window are unvoiced to detrigger.

The window is padded at the front and back to provide a small amount of silence or the beginnings/endings of speech around the voiced frames.

  • sample_rate – The audio sample rate, in Hz.

  • frame_duration_ms – The frame duration in milliseconds.

  • padding_duration_ms – The amount to pad the window, in milliseconds.

  • vad – An instance of webrtcvad.Vad.

  • frames – a source of audio frames (sequence or generator).


A generator that yields PCM audio data.

Return type


lecture2notes.end_to_end.transcribe.webrtcvad_utils.vad_segment_generator(wavFile, aggressiveness, desired_sample_rate=None)[source]

Generate VAD segments. Filters out non-voiced audio frames.


waveFile (str) – Path to input wav file to run VAD on.


segments: a bytearray of multiple smaller audio frames (The longer audio split into multiple smaller one’s)

sample_rate: Sample rate of the input audio file

audio_length: Duration of the input audio file

Return type



class lecture2notes.end_to_end.transcribe.mic_vad_streaming.Audio(callback=None, device=None, input_rate=16000, file=None)[source]

Streams raw audio from microphone. Data is received in a separate thread, and stored in a buffer, to be read from.

property frame_duration_ms

Return a block of audio data, blocking if necessary.


Return a block of audio data resampled to 16000hz, blocking if necessary.

resample(data, input_rate)[source]

Microphone may not support our native processing sampling rate, so resample from input_rate to RATE_PROCESS here for webrtcvad and deepspeech

  • data (binary) – Input audio stream

  • input_rate (int) – Input audio rate to resample from

write_wav(filename, data)[source]
class lecture2notes.end_to_end.transcribe.mic_vad_streaming.VADAudio(aggressiveness=3, device=None, input_rate=None, file=None)[source]

Filter & segment audio with voice activity detection.


Generator that yields all audio frames from microphone.

vad_collector(padding_ms=300, ratio=0.75, frames=None)[source]

Generator that yields series of consecutive audio frames comprising each utterance, separated by yielding a single None. Determines voice activity by ratio of frames in padding_ms. Uses a buffer to include padding_ms prior to being triggered.

Example: (frame, ..., frame, None, frame, ..., frame, None, ...)
        |---utterance---|        |---utterance---|


class lecture2notes.end_to_end.cluster.ClusterFilesystem(slides_dir, algorithm_name='kmeans', num_centroids=20, preference=None, damping=0.5, max_iter=200, model_path='model_best.ckpt')[source]

Clusters images from a directory and saves them to disk in folders corresponding to each centroid.


Extracts features from the images in slides_dir and saves feature vectors with super().add()

transfer_to_filesystem(copy=True, create_best_samples_folder=True)[source]

Uses move_list from super() to take all images in directory slides_dir and save each cluster to a subfolder in cluster_dir (directory in parent of slides_dir)

Corner Crop Transform

lecture2notes.end_to_end.corner_crop_transform.all_in_folder(path, remove_original=False, **kwargs)[source]

Perform perspective cropping on every file in folder and return new paths. **kwargs is passed to crop().

lecture2notes.end_to_end.corner_crop_transform.cluster_points(points, nclusters)[source]

Perform KMeans clustering (using cv2.kmeans) on points, creating nclusters clusters. Returns the centroids of the clusters.

lecture2notes.end_to_end.corner_crop_transform.contour_offset(cnt, offset)[source]

Offset contour because of 5px border

lecture2notes.end_to_end.corner_crop_transform.crop(img_path, output_path=None, mode='automatic', debug_output_imgs=False, save_debug_imgs=False, create_debug_gif=False, debug_gif_optimize=True, debug_path='debug_imgs')[source]

Main method to perspective crop an image to the slide.

  • img_path (str) – path to the image to load

  • output_path (str, optional) – path to save the image. Defaults to [filename]_cropped.[ext].

  • mode (str, optional) –

    There are three modes available. Defaults to “automatic”.

    • contours: uses find_page_contours() to extract contours from an edge map of the image. is ineffective if there are any gaps or obstructions in the outline around the slide.

    • hough_lines: uses hough_lines_corners() to get corners by looking for horizontal and vertical lines, finding the intersection points, and clustering the intersection points.

    • automatic: tries to use contours and falls back to hough_lines if contours reports a failure.

  • debug_output_imgs (bool or dict, optional) – if dictionary, modifies the dictionary by adding (image file name, image data) pairs. if boolean and True, creates a dictionary in the same way as if a dictionary was passed. Defaults to False.

  • save_debug_imgs (bool, optional) – uses write_debug_imgs() to save the debug_output_imgs to disk. Requires debug_output_imgs to not be False. Defaults to False.

  • create_debug_gif (bool, optional) – create a gif of the debug images. Requires debug_output_imgs to not be False. Defaults to False.

  • debug_gif_optimize (bool, optional) – optimize the gif produced by enabling the create_debug_gif option using pygifsicle. Defaults to True.

  • debug_path (str, optional) – location to save the debug images and debug gif. Defaults to “debug_imgs”.


path to cropped image and failed (True if no slide bounding box found, false otherwise)

Return type


lecture2notes.end_to_end.corner_crop_transform.edges_det(img, min_val, max_val, debug_output_imgs=None)[source]

Preprocessing (gray, thresh, filter, border) & Canny edge detection

  • img (image) – the image loaded using cv2.imread.

  • min_val (int) – minimum value for cv2.Canny.

  • max_val (int) – maximum value for cv2.Canny.

  • debug_output_imgs (dict, optional) – modifies this dictionary by adding (image file name, image data) pairs. Defaults to None.


(dilated, total_border), dialted edges and total border width added

Return type


lecture2notes.end_to_end.corner_crop_transform.find_intersection(line1, line2)[source]

Find the intersection between line1 and line2.

lecture2notes.end_to_end.corner_crop_transform.find_page_contours(edges, img, border_size=11, min_area_mult=0.3, debug_output_imgs=None)[source]

Find corner points of page contour

  • edges (image) – edges extracted from img by edges_det().

  • img (image) – the image loaded by cv2.imread.

  • border_size (int, optional) – the size of the borders added by edges_det(). Defaults to 11.

  • min_area_mult (float, optional) – the minimum percentage of the image area that a contour’s area must be greater than to be considered as the slide. Defaults to 0.5.


contour is the set of coordinates of the corners sorted

by four_corners_sort() or returns None when no contour meets the criteria.

Return type

[contour or NoneType]


Sort corners: top-left, bot-left, bot-right, top-right

lecture2notes.end_to_end.corner_crop_transform.horizontal_vertical_edges_det(img, thresh_blurred, debug_output_imgs=None)[source]

Detects horizontal and vertical edges and merges them together.

  • img (image) – the image as provided by cv2.imread

  • thresh_blurred (image) – the image processed by thresholding. see edges_det().

  • debug_output_imgs (dict, optional) – modifies this dictionary by adding (image file name, image data) pairs. Defaults to None.


result image with a black background and white edges

Return type


lecture2notes.end_to_end.corner_crop_transform.hough_lines_corners(img, edges_img, min_line_length, border_size=11, debug_output_imgs=None)[source]
Uses cv2.HoughLinesP to find horizontal and vertical lines, finds the intersection

points, and finally clusters those points using KMeans.

  • img (image) – the image as loaded by cv2.imread.

  • edges_img (image) – edges extracted from img by edges_det().

  • min_line_length (int) – the shortest line length to consider as a valid line

  • border_size (int, optional) – the size of the borders added by edges_det(). Defaults to 11.

  • debug_output_imgs (dict, optional) – modifies this dictionary by adding (image file name, image data) pairs. Defaults to None.


The corner coordinates as sorted by four_corners_sort().

Return type


lecture2notes.end_to_end.corner_crop_transform.persp_transform(img, s_points)[source]

Transform perspective of img from start points to target points.

lecture2notes.end_to_end.corner_crop_transform.remove_contours(edges, contour_removal_threshold)[source]

Remove contours from an edge map by deleting contours shorter than contour_removal_threshold.

lecture2notes.end_to_end.corner_crop_transform.resize(img, height=800, allways=False)[source]

Resize image to given height.

lecture2notes.end_to_end.corner_crop_transform.segment_lines(lines, delta)[source]

Groups lines from cv2.HoughLinesP into vertical and horizontal bins.

  • lines (list) – the data returned from cv2.HoughLinesP

  • delta (int) – how far away the x and y coordinates can differ before they’re marked as different lines


(h_lines, v_lines) the horizontal and vertical lines, respectively. each line in each list is formatted as (x1, y1, x2, y2).

Return type


lecture2notes.end_to_end.corner_crop_transform.straight_lines_in_contour(contour, delta=100)[source]

Returns True if contour contains lines that are horizontal or vertical. delta allows the lines to tilt by a certain number of pixels. For instance, if a line is vertical, its y values can change by delta pixels before it is considered not vertical.

lecture2notes.end_to_end.corner_crop_transform.write_debug_imgs(debug_output_imgs, base_path='debug_imgs')[source]

Saves images from debug_output_imgs to disk in base_path.

  • debug_output_imgs (dict) – dictionary in format {image file name: image data}

  • base_path (str, optional) – the directory to store the debug images. Defaults to “debug_imgs”.

Text Detection

lecture2notes.end_to_end.text_detection.get_text_bounding_boxes(image, net, min_confidence=0.5, resized_width=320, resized_height=320)[source]

Determine the locations of text in an image.

  • image (np.array) – The image to be processed.

  • net (cv2.dnn_Net) – The EAST model loaded with load_east().

  • min_confidence (float, optional) – Minimum probability required to inspect a region. Defaults to 0.5.

  • resized_width (int, optional) – Resized image width (should be multiple of 32). Defaults to 320.

  • resized_height (int, optional) – Resized image height (should be multiple of 32). Defaults to 320.


The coordinates of bounding boxes containing text.

Return type



Load the pre-trained EAST model.


east_path (str, optional) – Path to the EAST model file. Defaults to “frozen_east_text_detection.pb”.

Figure Detection

lecture2notes.end_to_end.figure_detection.add_figures_to_ssa(ssa, figures_path)[source]
lecture2notes.end_to_end.figure_detection.all_in_folder(path, remove_original=False, east='frozen_east_text_detection.pb', do_text_check=True, **kwargs)[source]

Perform figure detection on every file in folder and return new paths. **kwargs is passed to detect_figures().

lecture2notes.end_to_end.figure_detection.area_of_overlapping_rectangles(a, b)[source]

Find the overlapping area of two rectangles a and b. Inspired by https://stackoverflow.com/a/27162334.

lecture2notes.end_to_end.figure_detection.detect_color_image(image, thumb_size=40, MSE_cutoff=22, adjust_color_bias=True)[source]

Detect if an image contains color, is black and white, or is grayscale. Based on this StackOverflow answer.

  • image (np.array) – Input image

  • thumb_size (int, optional) – Resize image to this size to speed up calculation. Defaults to 40.

  • MSE_cutoff (int, optional) – A larger value requires more color for an image to be labeled as “color”. Defaults to 22.

  • adjust_color_bias (bool, optional) – Mean color bias adjustment, which improves the prediction. Defaults to True.


Either “grayscale”, “color”, “b&w” (black and white), or “unknown”.

Return type


lecture2notes.end_to_end.figure_detection.detect_figures(image_path, output_path=None, east='frozen_east_text_detection.pb', text_area_overlap_threshold=0.32, figure_max_area_percentage=0.6, text_max_area_percentage=0.3, large_box_detection=True, do_color_check=True, do_text_check=True, entropy_check=2.5, do_remove_subfigures=True, do_rlsa=False)[source]

Detect figures located in a slide.

  • image_path (str) – Path to the image to process.

  • output_path (str, optional) – Path to save the figures. Defaults to [filename]_figure_[index].[ext].

  • east (str or cv2.dnn_Net, optional) – Path to the EAST model file or the pre-trained EAST model loaded with load_east(). do_text_check must be true for this option to take effect. Defaults to “frozen_east_text_detection.pb”.

  • text_area_overlap_threshold (float, optional) – The percentage of the figure that can contain text. If the area of the text in the figure is greater than this value, the figure is discarded. do_text_check must be true for this option to take effect. Defaults to 0.10.

  • figure_max_area_percentage (float, optional) – The maximum percentage of the area of the original image that a figure can take up. If the figure uses more area than original_image_area*figure_max_area_percentage then the figure will be discarded. Defaults to 0.70.

  • text_max_area_percentage (float, optional) – The maximum percentage of the area of the original image that a block of text (as identified by the EAST model) can take up. If the text block uses more area than original_image_area*text_max_area_percentage then that text block will be ignored. do_text_check must be true for this option to take effect. Defaults to 0.30.

  • large_box_detection (bool, optional) – Detect edges and classify large rectangles as figures. This will ignore do_color_check and do_text_check. This is useful for finding tables for example. Defaults to True.

  • do_color_check (bool, optional) – Check that potential figures contain color. This helps to remove large quantities of black and white text form the potential figure list. Defaults to True.

  • do_text_check (bool, optional) – Check that only text_area_overlap_threshold of potential figures contains text. This is useful to remove blocks of text that are mistakenly classified as figures. Checking for text increases processing time so be careful if processing a large number of files. Defaults to True.

  • entropy_check (float, optional) – Check that the entropy of all potential figures is above this value. Figures with a shannon_entropy lower than this value will be removed. Set to False to disable this check. The shannon_entropy implementation is from skimage.measure.entropy. IMPORTANT: This check applies to both the regular tests and large_box_detection, which most check do not apply to. Defaults to 3.5.

  • do_remove_subfigures (bool, optional) – Check that there are no overlapping figures. If an overlapping figure is detected, the smaller figure will be deleted. This is useful to have enabled when using large_box_detection since large_box_detection will commonly mistakenly detect subfigures. Defaults to True.

  • do_rlsa (bool, optional) – Use RLSA (Run Length Smoothing Algorithm) instead of dilation. Does not apply to large_box_detection. Defaults to False.


(figures, output_paths) A list of figures extracted from the input slide image and a list of paths to those figures on disk.

Return type


Frames Extractor

lecture2notes.end_to_end.frames_extractor.extract_frames(input_video_path, quality, output_path, extract_every_x_seconds)[source]

Extracts frames from input_video_path at quality level quality (best quality is 2) every extract_every_x_seconds seconds and saves them to output_path


lecture2notes.end_to_end.helpers.copy_all(list_path_files, output_dir, move=False)[source]

Copy (or move) every path in list_path_files if list or all files in a path if path to output_dir

lecture2notes.end_to_end.helpers.frame_number_filename_mapping(path, filenames_only=True)[source]
lecture2notes.end_to_end.helpers.gen_unique_id(input_data, k)[source]

Returns the first k characters of the sha1 of input_data


Makes directory path if it does not exist

Image Hash


Returns a hash function from the imagehash library.

Hash Methods:
  • ahash: Average hash

  • phash: Perceptual hash

  • dhash: Difference hash

  • whash-haar: Haar wavelet hash

  • whash-db4: Daubechies wavelet hash

lecture2notes.end_to_end.imghash.remove_duplicates(img_dir, images)[source]

Remove duplicate frames/slides from disk.

  • img_dir (str) – path to directory containing image files

  • images (dict) – dictionary in format {image hash: image filenames} provided by sort_by_duplicates().

lecture2notes.end_to_end.imghash.sort_by_duplicates(img_dir, hash_func='phash')[source]

Find duplicate images in a directory.

  • img_dir (str) – path to folder containing images to scan for duplicates

  • hash_func (str, optional) – the hash function to use as given by get_hash_func(). Defaults to “phash”.


dictionary in format {image hash: image filenames}

Return type




Perform OCR using pytesseract on every file in folder and return results

lecture2notes.end_to_end.ocr.write_to_file(results, save_file)[source]

Write everything stored in results to file at path save_file. Used to write results from all_in_folder() to save_file.

Segment Cluster

class lecture2notes.end_to_end.segment_cluster.SegmentCluster(slides_dir, model_path='model_best.ckpt')[source]

Iterates through frames in order and splits based on large visual differences (measured by the cosine difference between the feature vectors from the slide classifier)


Extracts features from the images in slides_dir and saves feature vectors

transfer_to_filesystem(copy=True, create_best_samples_folder=True)[source]

Takes all images in directory slides_dir and saves each cluster to a subfolder in cluster_dir (directory in parent of slides_dir)

SIFT Matcher

lecture2notes.end_to_end.sift_matcher.does_camera_move(old_frame, frame, gamma=10, border_ratios=(10, 19), bottom=False)[source]

Detects camera movement between two frames by tracking features in the borders of the image. Only the borders are used because the center of the image probably contains a slide. Thus, tracking features of the slide is not robust since those features will disappear when the slide changes.

  • old_frame (np.array) – First frame/image as loaded with cv2.imread()

  • frame (np.array) – Second frame/image as loaded with cv2.imread()

  • gamma (int, optional) – The threshold pixel movement value. If the camera moves more than this value, then there is assumed to be camera movement between the two frames. Defaults to 10.

  • border_ratios (tuple, optional) – The ratios of the height and width respectively of the first frame to be searched for features. Only the borders are searched for features. these values specify how much of the image should be counted as a border. Defaults to (10, 19).

  • bottom (bool, optional) – Whether to find features in the bottom border. This is not recommended because ‘presenter_slide’ images may have the peoples’ heads at the bottom, which will move and do not represent camera motion. Defaults to False.


(total_movement > gamma, total_movement) If there is camera movement between the two frames and the total movement between the frames.

Return type



Runs does_camera_move() on all the files in a folder and calculates statistics about camera movement within those files.


folder_path (str) – Directory containing the files to be processed.


(movement_detection_percentage, average_move_value, max_move_value) A float representing the precentage of frames where movement was detected from the previous frame. The average of the total_movement values returned from does_camera_move(). The maximum of the the total_movement values returned from does_camera_move().

Return type


lecture2notes.end_to_end.sift_matcher.is_content_added(first, second, first_area_modifier=0.7, second_area_modifier=0.4, gamma=0.09, dilation_amount=22)[source]

Detect if second contains more content than first and how much more content it adds. This algorithm dilates both images and finds contours. It then computes the total area of those contours. If gamma% more than the area of the first image’s contours is greater than the area of the second image’s contours then it is assumed more content is added.

  • first (np.array) – Image loaded using cv2.imread() belonging to the ‘slide’ class

  • second (np.array) – Image loaded using cv2.imread() belonging to the ‘presenter_slide’ class

  • first_area_modifier (float, optional) – The maximum percent area of the first image that a contour can take up before it is excluded. Defaults to 0.70.

  • second_area_modifier (float, optional) – The maximum percent area of the second image that a contour can take up before it is excluded. Images belonging to the ‘presenter_slide’ class are more likely to have mistaken large contours. Defaults to 0.40.

  • gamma (float, optional) – The percentage increase in content area necessary for second` to be classified as having more content than first. Defaults to 0.09.

  • dilation_amount (int, optional) – How much the canny edge maps of each both images first and second should be dilated. This helps to combine multiple components of one object into a single contour. Defaults to 22.


(content_is_added, amount_of_added_content) Boolean if second contains more content than first and float describing the difference in content from first to second. amount_of_added_content can be negative.

Return type


lecture2notes.end_to_end.sift_matcher.match_features(slide_path, presenter_slide_path, min_match_count=33, min_area_percent=0.37, do_motion_detection=True)[source]

Match features between images in slide_path and presenter_slide_path. The images in slide_path are the queries to the matching algorithm and the images in presenter_slide_path are the train/searched images.

  • slide_path (str) – Path to the images classified as “slide” or any directory containing query images.

  • presenter_slide_path (str) – Path to the images classified as “presenter_slide” or any directory containing train images.

  • min_match_count (int, optional) – The minimum number of matches returned by sift_flann_match() required for the image pair to be considered as containing the same slide. Defaults to 33.

  • min_area_percent (float, optional) – Percentage of the area of the train image (images belonging to the ‘presenter_slide’ category) that a matched slide must take up to be counted as a legitimate duplicate slide. This removes incorrect matches that can result in crops to small portions of the train image. Defaults to 0.37.

  • do_motion_detection (bool, optional) – Whether motion detection using does_camera_move_all_in_folder() should be performed. If set to False then it is assumed that there is movement since assuming no movement leaves room for a lot of false positives. If no camera motion is detected and this option is enabled then all slides that are unique to the “presenter_slide” category (they have no matches in the “slide” category) will automatically be cropped to contain just the slide. They will be saved to the originating folder but with the string defined by the variable OUTPUT_PATH_MODIFIER in their filename. Even if does_camera_move_all_in_folder() detects no movement it is still possible that movement is detected while running this function since a check is performed to make sure all slide bounding boxes found contain 80% overlapping area with all previously found bounding boxes. Defaults to True.


(non_unique_presenter_slides, transformed_image_paths) non_unique_presenter_slides: The images in the “presenter_slide” category that are not unique and should be deleted transformed_image_paths: The paths to the cropped images if do_motion_detection was enabled and no motion was detected.

Return type


lecture2notes.end_to_end.sift_matcher.ransac_transform(sift_matches, kp1, kp2, img1, img2, draw_matches=False)[source]

Use data from sift_flann_match() to find the coordinates of img1 in img2. sift_matches, kp1, kp2, img1, and img2 are all the outputs of meth:~sift_matcher.sift_flann_match. If draw_matches is enabled then the features matches will be drawn and shown on the screen.


The corner coordinates of the quadrilateral representing img1 within img2.

Return type


lecture2notes.end_to_end.sift_matcher.sift_flann_match(query_image, train_image, algorithm='orb', num_features=1000)[source]

Locate query_image within train_image using algorithm for feature detection/description and FLANN (Fast Library for Approximate Nearest Neighbors) for matching. You can read more about matching in the OpenCV “Feature Matching” documentation or about homography on the OpenCV Python Tutorial “Feature Matching + Homography to find Objects”

  • query_image (np.array) – Image to find. Loading using cv2.imread().

  • train_image (np.array) – Image to search. Loading using cv2.imread().

  • algorithm (str, optional) – The feature detection/description algorithm. Can be one of ORB, (ORB Class Reference) SIFT, (SIFT Class Reference) or FAST. (FAST Class Reference) Defaults to “orb”.

  • num_features (int, optional) – The maximum number of features to retain when using ORB and SIFT. Does not take effect when using the FAST detection algorithm. Setting to 0 for SIFT is a good starting point. The default for ORB is 500, but it was increased to 1000 to improve accuracy. Defaults to 1000.


(good, kp1, kp2, img1, img2) The good matches as per Lowe’s ratio test, the key points from image 1, the key points from image 2, modified image 1, and modified image 2.

Return type


Slide Classifier

lecture2notes.end_to_end.slide_classifier.classify_frames(frames_dir, do_move=True, incorrect_threshold=0.6, model_path='model_best.ckpt')[source]

Classifies images in a directory using the slide classifier model.

  • frames_dir (str) – path to directory containing images to classify

  • do_move (bool, optional) – move the images to their sorted folders instead of copying them. Defaults to True.

  • incorrect_threshold (float, optional) – the certainty value that the model must be below for a prediction to be marked “probably incorrect”. Defaults to 0.60.


(frames_sorted_dir, certainties, percent_wrong)

Return type


Slide Structure Analysis

lecture2notes.end_to_end.slide_structure_analysis.all_in_folder(path, do_rename=True, **kwargs)[source]

Perform structure analysis and OCR on every file in folder using analyze_structure().


(raw_texts, json_texts) A list of the raw text for each slide and a list of the json structure analysis data for each slide.

Return type


lecture2notes.end_to_end.slide_structure_analysis.analyze_structure(image, to_json=None, return_unstructured_text=True, gamma=0.1, beta=0.2, orient='index', extra_json=None)[source]

Perform slide structure analysis.

  • image (np.array) – Image to be processed as loaded with cv2.imread().

  • to_json (str or bool, optional) – Path to write json output or a boolean to return json data as a string. The default return value is a pd.DataFrame. Defaults to None.

  • return_unstructured_text (bool, optiona) – If the raw recognized text should be returned in addition to the other return values.

  • gamma (float, optional) – The percentage greater than or less than the average stroke width that a text line must meet to be classified as bold/subtitle or small text repsectively. Defaults to 0.1.

  • beta (float, optional) – The percentage greater than or less than the average height that a text line must meet to be classified as bold/subtitle or small text repsectively. This is greater than gamma because height is on a larger scale than gamma. Defaults to 0.2.

  • orient (str, optional) – The format of the output json data if to_json is set. The acceptable values can be found on the pandas.DataFrame.to_json documentation. Defaults to “index”.

  • extra_json (dict, optional) – Additional keys and values to add to the json output if to_json is enabled. Defaults to None.


The default is to return a pd.DataFrame. However, setting to_json to a string will instead write json data to to_json and return the path to the data. Setting to_json to True will return the json data as a string. Setting return_unstructured_text returns the previously described data and the raw recognized text as a tuple. Will return None is no text is detected.

Return type

pd.DataFrame or str or tuple or None

lecture2notes.end_to_end.slide_structure_analysis.identify_title(tesseract_df, image, left_start_maximum=0.77, character_limit=3, enabled_checks=None)[source]

Determine the average stroke length in an image. Inspired by: https://stackoverflow.com/a/61914060.

Other Links:

lecture2notes.end_to_end.slide_structure_analysis.write_to_file(raw_texts, json_texts, raw_save_file, json_save_file)[source]

Write the raw text in raw_texts to raw_save_file and the json data in json_texts to json_save_file. Used to write results from all_in_folder() to disk.

  • raw_texts (list) – List of raw text outputs from analyze_structure().

  • json_texts (list) – List of json ssa outputs from analyze_structure().

  • raw_save_file (str) – The path to save the raw text. A “.txt” file.

  • json_save_file (str) – The path to save the json output. A “.json” file.

Spell Check

class lecture2notes.end_to_end.spell_check.SpellChecker(max_edit_distance_dictionary=2, max_edit_distance_lookup=2, prefix_length=7)[source]

A spell checker.


Checks an input string for spelling mistakes


input_term (str) – the sequence to check for spelling errors


the best corrected string

Return type



Spell check multiple sequences by calling check() for each item in input_terms.


input_terms (list) – a list of strings to be corrected with spell checking


a list of corrected strings

Return type


Summarization Approaches

lecture2notes.end_to_end.summarization_approaches.cluster(text, coverage_percentage=0.7, final_sort_by=None, cluster_summarizer='extractive', title_generation=False, num_topics=10, minibatch=False, hf_inference_api=False, feature_extraction='neural_sbert', **kwargs)[source]

Summarize text to coverage_percentage length of the original document by extracting features from the text, clustering based on those features, and finally summarizing each cluster. See the scikit-learn documentation on clustering text for more information since several sections of this function were borrowed from that example.


  • **kwargs is passed to the feature extraction function, which is either extract_features_bow() or extract_features_neural() depending on the feature_extraction argument.

  • text (str) – a string of text to summarize

  • coverage_percentage (float, optional) – The length of the summary as a percentage of the original document. Defaults to 0.70.

  • final_sort_by (str, optional) – If cluster_summarizer is extractive and title_generation is False then this argument is available. If specified, it will sort the final cluster summaries by the specified string. Options are ["order", "rating"]. Defaults to None.

  • cluster_summarizer (str, optional) – Which summarization method to use to summarize each individual cluster. “Extractive” uses the same approach as keyword_based_ext() but instead of using keywords from another document, the keywords are calculated in the TfidfVectorizer or HashingVectorizer. Each keyword is a feature in the document-term matrix, thus the number of words to use is specified by the n_features parameter. Options are ["extractive", "abstractive"]. Defaults to “extractive”.

  • title_generation (bool, optional) – Option to generate titles for each cluster. Can not be used if final_sort_by is set. Generates titles by summarizing the text using BART finetuned on XSum (a dataset of news articles and one sentence summaries aka headline generation) and forcing results to be from 1 to 10 words long. Defaults to False.

  • num_topics (int, optional) – The number of clusters to create. This should be set to the number of topics discussed in the lecture if generating good titles is desired. If separating into groups is not very important and a final summary is desired then this parameter is not incredibly important, it just should not be set super low (3) or super high (50) unless your document in super short or long. Defaults to 10.

  • minibatch (bool, optional) – Two clustering algorithms are used: ordinary k-means and its more scalable cousin minibatch k-means. Setting this to True will use minibatch k-means with a batch size set to the number of clusters set in num_topics. Defaults to False.

  • hf_inference_api (bool, optional) – Use the huggingface inference API for abstractive summarization. Defaults to False.

  • feature_extraction (str, optional) –

    Specify how features should be extracted from the text.

    • neural_hf: uses a huggingface/transformers pipeline with the roberta model by default

    • neural_sbert: special bert and roberta models fine-tuned to extract sentence embeddings

    • spacy: uses spacy model. All other options use the small spacy model to split

      the text into sentences since sentence detection does not improve with larger models. However, if spacy is specified for feature_selection than the en_core_web_lg model will be used to extract high-quality embeddings

    • bow: bow = “bag of words”. this method is extremely fast since it is based on

      word frequencies throughout the input text. The extract_features_bow() function contains more details on recommended parameters that you can pass to this function because of **kwargs.

    Options are ["neural_hf", "neural_sbert", "spacy", "bow"] Default is “neural_sbert”.


Exception – If incorrect parameters are passed.


The summarized text as a normal string. Line breaks will be included if title_generation is true.

Return type


lecture2notes.end_to_end.summarization_approaches.compute_ranks(sigma, v_matrix)[source]
lecture2notes.end_to_end.summarization_approaches.create_sumy_summarizer(algorithm, language='english')[source]
lecture2notes.end_to_end.summarization_approaches.extract_features_bow(data, return_lsa_svd=False, use_hashing=False, use_idf=True, n_features=10000, lsa_num_components=False)[source]

Extract features using a bag of words statistical word-frequency approach.

  • data (list) – List of sentences to extract features from

  • return_lsa_svd (bool, optional) – Return the features and lsa_svd. See “Returns” section below. Defaults to False.

  • use_hashing (bool, optional) – Use a HashingVectorizer instead of a CountVectorizer. Defaults to False. A HashingVectorizer should only be used with large datasets. Large to the degree that you’ll probably never pass enough data through this function to warrent the usage of a HashingVectorizer. HashingVectorizers use very little memory and are thus scalable to large datasets because there is no need to store a vocabulary dictionary in memory. More information can be found in the HashingVectorizer scikit-learn documentation.

  • use_idf (bool, optional) – Option to use inverse document-frequency. Defaults to True. In the case of use_hasing a TfidfTransformer will be appended in a pipeline after the HashingVectorizer. If not use_hashing then the use_idf parameter of the TfidfVectorizer will be set to use_idf. This step is important because, as explained by the scikit-learn documentation: “In a large text corpus, some words will be very present (e.g. ‘the’, ‘a’, ‘is’ in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms. In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.”

  • n_features (int, optional) – Specifies the number of features/words to use in the vocabulary (which are the rows of the document-term matrix). In the case of the TfidfVectorizer the n_features acts as a maximum since the max_df and min_df parameters choose words to add to the vocabulary (to use as features) that occur within the bounds specified by these parameters. This value should probably be lowered if use_hasing is set to True. Defaults to 10000.

  • lsa_num_components (int, optional) – If set then preprocess the data using latent semantic analysis to reduce the dimensionality to lsa_num_components components. Defaults to False.


list of features extracted and optionally the u, sigma, and v of the svd calculation on the document-term matrix. only returns if return_lsa_svd set to True.

Return type

[list or tuple]

lecture2notes.end_to_end.summarization_approaches.extract_features_neural_hf(sentences, model='roberta-base', tokenizer='roberta-base', n_hidden=768, squeeze=True, **kwargs)[source]

Extract features using a transformer model from the huggingface/transformers library

lecture2notes.end_to_end.summarization_approaches.extract_features_neural_sbert(sentences, model='roberta-base-nli-mean-tokens')[source]

Extract features using Sentence-BERT (SBERT) or SRoBERTa from the sentence-transformers library

lecture2notes.end_to_end.summarization_approaches.full_sents(ocr_text, transcript_text, remove_newlines=True, cut_off=0.7)[source]
lecture2notes.end_to_end.summarization_approaches.generic_abstractive(to_summarize, summarizer=None, min_length=None, max_length=None, hf_inference_api=False, *args, **kwargs)[source]
lecture2notes.end_to_end.summarization_approaches.generic_abstractive_hf_api(to_summarize, summarizer='facebook/bart-large-cnn', *args, **kwargs)[source]
lecture2notes.end_to_end.summarization_approaches.generic_extractive_sumy(text, coverage_percentage=0.7, algorithm='text_rank', language='english')[source]
lecture2notes.end_to_end.summarization_approaches.get_best_sentences(sentences, count, rating, *args, **kwargs)[source]
lecture2notes.end_to_end.summarization_approaches.get_complete_sentences(text, return_string=False)[source]
lecture2notes.end_to_end.summarization_approaches.get_sentences(text, model='en_core_web_sm')[source]
lecture2notes.end_to_end.summarization_approaches.initialize_abstractive_model(sum_model, use_hf_pipeline=True, *args, **kwargs)[source]
lecture2notes.end_to_end.summarization_approaches.keyword_based_ext(ocr_text, transcript_text, coverage_percentage=0.7)[source]
lecture2notes.end_to_end.summarization_approaches.structured_joined_sum(ssa_path, transcript_json_path, frame_every_x=1, ending_char='.', first_slide_frame_num=0, to_json=False, summarization_method='abstractive', max_summarize_len=50, abs_summarizer='sshleifer/distilbart-cnn-12-6', ext_summarizer='text_rank', hf_inference_api=False, *args, **kwargs)[source]

Summarize slides by combining the Slide Structure Analysis (SSA) and transcript json to create a per slide summary of the transcript. The content from the beginning of one slide to the start of the next to the nearest ending_char is considered the transcript that belongs to that slide. The summarized transcript content is organized in a dictionary where the slide titles are keys. This dictionary can be returned as json or written to a json file.

  • ssa_path (str) – Path to the SSA JSON file.

  • transcript_json_path (str) – Path to the transcript JSON file.

  • frame_every_x (int, optional) – How often frames were extracted from the video that the SSA was conducted on. This is used to convert frame numbers to time (seconds). Defaults to 1.

  • ending_char (str, optional) – The character that the transcript belonging to each slide will be extended to. For instance, if the next slide appears in the middle of a word, the transcript content will continue to be added to the previous slide until the ending_char is reached. It is recommended to use periods or a special end of sentence token if present. These can be generated with lecture2notes.end_to_end.transcribe.transcribe_main.segment_sentences() Defaults to " " (nearest complete word).

  • first_slide_frame_num (int, optional) – The frame number of the first slide. Used to create a ‘preface’ (aka an introduction) if the first slide is not immediately shown. Defaults to 0.

  • to_json (bool or str, optional) – If the output dictionary should be returned as a JSON string. This can also be set to a path as a string and the JSON data will be dumped to the file at that path. Defaults to False.

  • summarization_method (str, optional) – The method to use to summarize each slide’s transcript content. Options include “abstractive”, “extractive”, or “none”. Defaults to “abstractive”.

  • max_summarize_len (int, optional) – Text longer than this many tokens will be summarized. Defaults to 50.

  • abs_summarizer (str, optional) – The abstractive summarization model to use if summarization_method is “abstractive”. Defaults to “sshleifer/distilbart-cnn-12-6”.

  • hf_inference_api (bool, optional) – Use the huggingface inference API for abstractive summarization. Defaults to False.

  • function (*args and **kwargs are passed to the summarization) – generic_abstractive() or generic_extractive_sumy() depending on summarization_method.

  • either (which is) – generic_abstractive() or generic_extractive_sumy() depending on summarization_method.


A dictionary containing the slide titles as keys and the summarized transcript content for each slide as values. A string will be returned when to_json is set. If to_json is True (boolean) the JSON data formatted as a string will be returned. If to_json is a path (string), then the JSON data will be dumped to the file specified and the path to the file will be returned.

Return type

dict or str

Transcript Downloader

class lecture2notes.end_to_end.transcript_downloader.TranscriptDownloader(youtube=None, ytdl=True)[source]

Download transcripts from YouTube using the YouTube API or youtube-dl.

static check_suffix(output_path)[source]

Gets the file extension from output_path and verifies it is either “.srt”, “.vtt”, or it is not present in output_path. The default is “.vtt”.

download(video_id, output_path)[source]

Convenience function to download transcript with one call. If self.ytdl is False, calls get_caption_id() and passes result to get_transcript(). If self.ytdl is True, calls get_transcript_ytdl().

get_caption_id(video_id, lang='en')[source]

Gets the caption id with language land for a video on YouTube with id video_id.

get_transcript_api(caption_id, output_path)[source]

Downloads a caption track by id directly from the YouTube API.

  • caption_id (str) – the id of the caption track to download

  • output_path (str) – path to save the captions. file extensions are parsed by check_suffix()


the path where the transcript was saved (may not be the same as the output_path parameter)

Return type


get_transcript_ytdl(video_id, output_path)[source]

Gets the transcript for video_id using youtube-dl and saves it to output_path. The extension from output_path will be the --sub-format that is passed to the youtube-dl command.

YouTube API


Initialize the YouTube API. If oauth then use the oauth client_secret.json located in the current directory, otherwise use the YT_API_KEY environment variable.