E2E API
Main Summarizer Class
Transcribe
transcribe_main
- lecture2notes.end_to_end.transcribe.transcribe_main.caption_file_to_string(transcript_path, remove_speakers=False)[source]
Converts a .srt, .vtt, or .sbv file saved at
transcript_pathto a python string. Optionally removes speaker entries by removing everything before “: ” in each subtitle cell.
- lecture2notes.end_to_end.transcribe.transcribe_main.check_transcript(generated_transcript, ground_truth_transcript)[source]
Compares
generated_transcripttoground_truth_transcriptto check for accuracy using spacy similarity measurement. Requires the “en_vectors_web_lg” model to use “real” word vectors.
- lecture2notes.end_to_end.transcribe.transcribe_main.chunk_by_silence(audio_path, output_path, silence_thresh_offset=5, min_silence_len=2000)[source]
Split an audio file into chunks on areas of silence
- Parameters:
audio_path (str) – path to a wave file
output_path (str) – path to a folder where wave file chunks will be saved
silence_thresh_offset (int, optional) – a value subtracted from the mean dB volume of the file. Default is 5.
min_silence_len (int, optional) – the length in milliseconds in which there must be no sound in order to be marked as a splitting point. Default is 2000.
- lecture2notes.end_to_end.transcribe.transcribe_main.chunk_by_speech(audio_path, output_path=None, aggressiveness=1, desired_sample_rate=None)[source]
Uses the python interface to the WebRTC Voice Activity Detector (VAD) API to create chunks of audio that contain voice. The VAD that Google developed for the WebRTC project is reportedly one of the best available, being fast, modern and free.
- Parameters:
audio_path (str) – path to the audio file to process
output_path (str, optional) – path to save the chunk files. if not specified then no wave files will be written to disk and the raw pcm data will be returned. Defaults to None.
aggressiveness (int, optional) – determines how aggressive filtering out non-speech is. must be an interger between 0 and 3. Defaults to 1.
desired_sample_rate (int, optional) – the sample rate of the returned segments. the default is the same rate of the input audio file. Defaults to None.
- Returns:
(segments, sample_rate, audio_length). See
vad_segment_generator().- Return type:
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.convert_deepspeech_json(transcript_json)[source]
Convert a deepspeech json transcript from a letter-by-letter format to word-by-word.
- Parameters:
transcript_json (dict or str) – The json format transcript as a dictionary or a json string, which will be loaded using
json.loads().- Returns:
The word-by-word transcript json.
- Return type:
dict
- lecture2notes.end_to_end.transcribe.transcribe_main.convert_samplerate(audio_path, desired_sample_rate)[source]
Use SoX to resample wave files to 16 bits, 1 channel, and
desired_sample_ratesample rate.
- Parameters:
audio_path (str) – path to wave file to process
desired_sample_rate (int) – sample rate in hertz to convert the wave file to
- Returns:
- (desired_sample_rate, output) where
desired_sample_rateis the newsample rate and
outputis the newly resampled pcm data- Return type:
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.extract_audio(video_path, output_path)[source]
Extracts audio from video at
video_pathand saves it tooutput_path
- lecture2notes.end_to_end.transcribe.transcribe_main.get_youtube_transcript(video_id, output_path, use_youtube_dl=True)[source]
Downloads the transcript for
video_idand saves it tooutput_path
- lecture2notes.end_to_end.transcribe.transcribe_main.load_deepspeech_model(model_dir, beam_width=500, lm_alpha=None, lm_beta=None)[source]
Load the deepspeech model from
model_dir
- Parameters:
model_dir (str) – path to folder containing the “.pbmm” and optionally “.scorer” files
beam_width (int, optional) – beam width for decoding. Default is 500.
(float (lm_alpha) – alpha parameter of language model. Default is None.
optional} – alpha parameter of language model. Default is None.
lm_beta (float, optional) – beta parameter of langage model. Default is None.
- Returns:
the loaded deepspeech model
- Return type:
deepspeech.Model
- lecture2notes.end_to_end.transcribe.transcribe_main.load_fasterwhisper_model(model_name_or_path='small.en')[source]
- lecture2notes.end_to_end.transcribe.transcribe_main.load_wav2vec_model(model='facebook/wav2vec2-base-960h', tokenizer='facebook/wav2vec2-base-960h', **kwargs)[source]
- lecture2notes.end_to_end.transcribe.transcribe_main.load_whispercpp_model(model_name_or_path='small.en')[source]
- lecture2notes.end_to_end.transcribe.transcribe_main.metadata_to_json(candidate_transcript)[source]
Helper function to convert metadata tokens from deepspeech to a dictionary.
- lecture2notes.end_to_end.transcribe.transcribe_main.metadata_to_string(metadata)[source]
Helper function to convert metadata tokens from deepspeech to a string.
- lecture2notes.end_to_end.transcribe.transcribe_main.process_chunks(chunk_dir, method='sphinx', model_dir=None)[source]
Performs transcription on every noise activity chunk (audio file) created by
chunk_by_silence()in a directory.
- lecture2notes.end_to_end.transcribe.transcribe_main.process_segments(segments, model, audio_length='unknown', method='deepspeech', do_segment_sentences=True)[source]
Transcribe a list of byte strings containing pcm data
- Parameters:
segments (list) – list of byte strings containing pcm data (generated by
chunk_by_speech())model (deepspeech model) – a deepspeech model object or a path to a folder containing the model files (see
load_deepspeech_model()).audio_length (str, optional) – the length of the audio file if known (used for logging statements) Default is “unknown”.
method (str, optional) – The model to use to perform speech-to-text. Supports ‘deepspeech’ and ‘vosk’. Defaults to “deepspeech”.
do_segment_sentences (bool, optional) – Find sentence boundaries using
segment_sentences(). Defaults to True.- Returns:
(full_transcript, full_transcript_json) The combined transcript of all the items in
segmentsas a string and as dictionary/json.- Return type:
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.read_wave(path, desired_sample_rate=None, force=False)[source]
Reads a “.wav” file and converts to
desired_sample_ratewith one channel.
- Parameters:
path (str) – path to wave file to load
desired_sample_rate (int, optional) – resample the loaded pcm data from the wave file to this sample rate. Default is None, no resampling.
force (bool, optional) – Force the audio to be converted even if it is detected to meet the necessary criteria.
- Returns:
(PCM audio data, sample rate, duration)
- Return type:
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.resolve_deepspeech_models(dir_name)[source]
Resolve directory path for deepspeech models and fetch each of them.
- Parameters:
dir_name (str) – Path to the directory containing pre-trained models
- Returns:
a tuple containing each of the model files (pb, scorer)
- Return type:
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.segment_sentences(text, text_json=None, do_capitalization=True)[source]
Detect sentence boundaries without punctuation or capitalization.
- Parameters:
text (str) – The string to segment by sentence.
text_json (str or dict, optional) – If the detected sentence boundaries should be applied to the JSON format of a transcript. Defaults to None.
do_capitalization (bool, optiona) – If the first letter of each detected sentence should be capitalized. Defaults to True.
- Returns:
The punctuated (and optionally capitalized) string
- Return type:
str
- lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio(audio_path, method='sphinx', **kwargs)[source]
Transcribe audio using DeepSpeech, Vosk, or a method offered by
transcribe_audio_generic().
- Parameters:
audio_path (str) – Path to the audio file to transcribe.
method (str, optional) – The method to use for transcription. Defaults to “sphinx”.
**kwargs – Passed to the transcription function.
- Returns:
(transcript_text, transcript_json)
- Return type:
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio_deepspeech(audio_path_or_data, model, raw_audio_data=False, json_num_transcripts=None, **kwargs)[source]
Transcribe an audio file or pcm data with the deepspeech model
- Parameters:
audio_path_or_data (str or byte string) – a path to a wave file or a byte string containing pcm data from a wave file. set
raw_audio_datato True if pcm data is used.model (deepspeech model or str) – a deepspeech model object or a path to a folder containing the model files (see
load_deepspeech_model())raw_audio_data (bool, optional) – must be True if
audio_path_or_datais raw pcm data. Defaults to False.json_num_transcripts (str, optional) – Specify this value to generate multiple transcipts in json format.
- Returns:
(transcript_text, transcript_json) the transcribed audio file in string format and the transcript in json
- Return type:
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio_fasterwhisper(audio_path, model=None)[source]
- lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio_generic(audio_path, method='sphinx', **kwargs)[source]
Transcribe an audio file using CMU Sphinx or Google through the speech_recognition library
- Parameters:
audio_path (str) – audio file path
method (str, optional) – which service to use for transcription (“google” or “sphinx”). Default is “sphinx”.
- Returns:
the transcript of the audio file
- Return type:
str
- lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio_vosk(audio_path_or_chunks, model='../vosk_models', chunks=False, desired_sample_rate=16000, chunk_size=2000, **kwargs)[source]
Transcribe audio using a
voskmodel.
- Parameters:
audio_path_or_chunks (str or generator) – Path to an audio file or a generator of chunks created by
chunk_by_speech()model (str or vosk.Model, optional) – Path to the directory containing the
voskmodels or loadedvosk.Model. Defaults to “../vosk_models”.chunks (bool, optional) – If the audio_path_or_chunks is chunks. Defaults to False.
desired_sample_rate (int, optional) – The sample rate that the model requires to convert audio to. Defaults to 16000.
chunk_size (int, optional) – The number of wave frames per loop. Amount of audio data transcribed at a time. Defaults to 2000.
- Returns:
(text_transcript, results_json) The transcript as a string and as JSON.
- Return type:
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio_wav2vec(audio_path_or_chunks, model=None, chunks=False, desired_sample_rate=16000)[source]
- lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio_whispercpp(audio_path, model=None)[source]
- lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_with_time(self, data, num_proc: int = 1, strict: bool = False)[source]
webrtcvad_utils
- class lecture2notes.end_to_end.transcribe.webrtcvad_utils.Frame(bytes, timestamp, duration)[source]
Represents a “frame” of audio data.
- lecture2notes.end_to_end.transcribe.webrtcvad_utils.frame_generator(frame_duration_ms, audio, sample_rate)[source]
Generates audio frames from PCM audio data.
Takes the desired frame duration in milliseconds, the PCM data, and the sample rate.
Yields Frames of the requested duration.
- lecture2notes.end_to_end.transcribe.webrtcvad_utils.vad_collector(sample_rate, frame_duration_ms, padding_duration_ms, vad, frames)[source]
Filters out non-voiced audio frames.
Given a webrtcvad.Vad and a source of audio frames, yields only the voiced audio.
Uses a padded, sliding window algorithm over the audio frames. When more than 90% of the frames in the window are voiced (as reported by the VAD), the collector triggers and begins yielding audio frames. Then the collector waits until 90% of the frames in the window are unvoiced to detrigger.
The window is padded at the front and back to provide a small amount of silence or the beginnings/endings of speech around the voiced frames.
- Parameters:
sample_rate – The audio sample rate, in Hz.
frame_duration_ms – The frame duration in milliseconds.
padding_duration_ms – The amount to pad the window, in milliseconds.
vad – An instance of webrtcvad.Vad.
frames – a source of audio frames (sequence or generator).
- Returns:
A generator that yields PCM audio data.
- Return type:
[generator]
- lecture2notes.end_to_end.transcribe.webrtcvad_utils.vad_segment_generator(wavFile, aggressiveness, desired_sample_rate=None)[source]
Generate VAD segments. Filters out non-voiced audio frames.
- Parameters:
waveFile (str) – Path to input wav file to run VAD on.
- Returns:
segments: a bytearray of multiple smaller audio frames (The longer audio split into multiple smaller one’s)
sample_rate: Sample rate of the input audio file
audio_length: Duration of the input audio file- Return type:
[tuple]
mic_vad_streaming
- class lecture2notes.end_to_end.transcribe.mic_vad_streaming.Audio(callback=None, device=None, input_rate=16000, file=None)[source]
Streams raw audio from microphone. Data is received in a separate thread, and stored in a buffer, to be read from.
- BLOCKS_PER_SECOND = 50
- CHANNELS = 1
- RATE_PROCESS = 16000
- property frame_duration_ms
- class lecture2notes.end_to_end.transcribe.mic_vad_streaming.VADAudio(aggressiveness=3, device=None, input_rate=None, file=None)[source]
Filter & segment audio with voice activity detection.
- vad_collector(padding_ms=300, ratio=0.75, frames=None)[source]
Generator that yields series of consecutive audio frames comprising each utterance, separated by yielding a single None. Determines voice activity by ratio of frames in padding_ms. Uses a buffer to include padding_ms prior to being triggered.
Example: (frame, ..., frame, None, frame, ..., frame, None, ...) |---utterance---| |---utterance---|
Cluster
- class lecture2notes.end_to_end.cluster.ClusterFilesystem(slides_dir, algorithm_name='kmeans', num_centroids=20, preference=None, damping=0.5, max_iter=200, model_path='model_best.ckpt')[source]
Clusters images from a directory and saves them to disk in folders corresponding to each centroid.
Corner Crop Transform
- lecture2notes.end_to_end.corner_crop_transform.all_in_folder(path, remove_original=False, **kwargs)[source]
Perform perspective cropping on every file in folder and return new paths.
**kwargsis passed tocrop().
- lecture2notes.end_to_end.corner_crop_transform.cluster_points(points, nclusters)[source]
Perform KMeans clustering (using
cv2.kmeans) onpoints, creatingnclustersclusters. Returns the centroids of the clusters.
- lecture2notes.end_to_end.corner_crop_transform.contour_offset(cnt, offset)[source]
Offset contour because of 5px border
- lecture2notes.end_to_end.corner_crop_transform.crop(img_path, output_path=None, mode='automatic', debug_output_imgs=False, save_debug_imgs=False, create_debug_gif=False, debug_gif_optimize=True, debug_path='debug_imgs')[source]
Main method to perspective crop an image to the slide.
- Parameters:
img_path (str) – path to the image to load
output_path (str, optional) – path to save the image. Defaults to
[filename]_cropped.[ext].mode (str, optional) –
There are three modes available. Defaults to “automatic”.
contours: usesfind_page_contours()to extract contours from an edge map of the image. is ineffective if there are any gaps or obstructions in the outline around the slide.
hough_lines: useshough_lines_corners()to get corners by looking for horizontal and vertical lines, finding the intersection points, and clustering the intersection points.
automatic: tries to usecontoursand falls back tohough_linesifcontoursreports a failure.debug_output_imgs (bool or dict, optional) – if dictionary, modifies the dictionary by adding
(image file name, image data)pairs. if boolean and True, creates a dictionary in the same way as if a dictionary was passed. Defaults to False.save_debug_imgs (bool, optional) – uses
write_debug_imgs()to save the debug_output_imgs to disk. Requiresdebug_output_imgsto not be False. Defaults to False.create_debug_gif (bool, optional) – create a gif of the debug images. Requires
debug_output_imgsto not be False. Defaults to False.debug_gif_optimize (bool, optional) – optimize the gif produced by enabling the
create_debug_gifoption usingpygifsicle. Defaults to True.debug_path (str, optional) – location to save the debug images and debug gif. Defaults to “debug_imgs”.
- Returns:
path to cropped image and failed (True if no slide bounding box found, false otherwise)
- Return type:
[tuple]
- lecture2notes.end_to_end.corner_crop_transform.edges_det(img, min_val, max_val, debug_output_imgs=None)[source]
Preprocessing (gray, thresh, filter, border) & Canny edge detection
- Parameters:
img (image) – the image loaded using
cv2.imread.min_val (int) – minimum value for
cv2.Canny.max_val (int) – maximum value for
cv2.Canny.debug_output_imgs (dict, optional) – modifies this dictionary by adding
(image file name, image data)pairs. Defaults to None.- Returns:
(dilated, total_border), dialted edges and total border width added
- Return type:
[tuple]
- lecture2notes.end_to_end.corner_crop_transform.find_intersection(line1, line2)[source]
Find the intersection between
line1andline2.
- lecture2notes.end_to_end.corner_crop_transform.find_page_contours(edges, img, border_size=11, min_area_mult=0.3, debug_output_imgs=None)[source]
Find corner points of page contour
- Parameters:
edges (image) – edges extracted from
imgbyedges_det().img (image) – the image loaded by
cv2.imread.border_size (int, optional) – the size of the borders added by
edges_det(). Defaults to 11.min_area_mult (float, optional) – the minimum percentage of the image area that a contour’s area must be greater than to be considered as the slide. Defaults to 0.5.
- Returns:
contouris the set of coordinates of the corners sortedby
four_corners_sort()or returns None when no contour meets the criteria.- Return type:
[contour or NoneType]
- lecture2notes.end_to_end.corner_crop_transform.four_corners_sort(pts)[source]
Sort corners: top-left, bot-left, bot-right, top-right
- lecture2notes.end_to_end.corner_crop_transform.horizontal_vertical_edges_det(img, thresh_blurred, debug_output_imgs=None)[source]
Detects horizontal and vertical edges and merges them together.
- Parameters:
img (image) – the image as provided by
cv2.imreadthresh_blurred (image) – the image processed by thresholding. see
edges_det().debug_output_imgs (dict, optional) – modifies this dictionary by adding
(image file name, image data)pairs. Defaults to None.- Returns:
result image with a black background and white edges
- Return type:
[image]
- lecture2notes.end_to_end.corner_crop_transform.hough_lines_corners(img, edges_img, min_line_length, border_size=11, debug_output_imgs=None)[source]
- Uses
cv2.HoughLinesPto find horizontal and vertical lines, finds the intersectionpoints, and finally clusters those points using KMeans.
- Parameters:
img (image) – the image as loaded by
cv2.imread.edges_img (image) – edges extracted from
imgbyedges_det().min_line_length (int) – the shortest line length to consider as a valid line
border_size (int, optional) – the size of the borders added by
edges_det(). Defaults to 11.debug_output_imgs (dict, optional) – modifies this dictionary by adding
(image file name, image data)pairs. Defaults to None.- Returns:
The corner coordinates as sorted by
four_corners_sort().- Return type:
[list]
- lecture2notes.end_to_end.corner_crop_transform.persp_transform(img, s_points)[source]
Transform perspective of
imgfrom start points to target points.
- lecture2notes.end_to_end.corner_crop_transform.remove_contours(edges, contour_removal_threshold)[source]
Remove contours from an edge map by deleting contours shorter than
contour_removal_threshold.
- lecture2notes.end_to_end.corner_crop_transform.resize(img, height=800, allways=False)[source]
Resize image to given height.
- lecture2notes.end_to_end.corner_crop_transform.segment_lines(lines, delta)[source]
Groups lines from
cv2.HoughLinesPinto vertical and horizontal bins.
- Parameters:
lines (list) – the data returned from
cv2.HoughLinesPdelta (int) – how far away the x and y coordinates can differ before they’re marked as different lines
- Returns:
(h_lines, v_lines) the horizontal and vertical lines, respectively. each line in each list is formatted as (x1, y1, x2, y2).
- Return type:
[tuple]
- lecture2notes.end_to_end.corner_crop_transform.straight_lines_in_contour(contour, delta=100)[source]
Returns True if
contourcontains lines that are horizontal or vertical.deltaallows the lines to tilt by a certain number of pixels. For instance, if a line is vertical, its y values can change bydeltapixels before it is considered not vertical.
- lecture2notes.end_to_end.corner_crop_transform.write_debug_imgs(debug_output_imgs, base_path='debug_imgs')[source]
Saves images from
debug_output_imgsto disk inbase_path.
- Parameters:
debug_output_imgs (dict) – dictionary in format {image file name: image data}
base_path (str, optional) – the directory to store the debug images. Defaults to “debug_imgs”.
Text Detection
- lecture2notes.end_to_end.text_detection.get_text_bounding_boxes(image, net, min_confidence=0.5, resized_width=320, resized_height=320)[source]
Determine the locations of text in an image.
- Parameters:
image (np.array) – The image to be processed.
net (cv2.dnn_Net) – The EAST model loaded with
load_east().min_confidence (float, optional) – Minimum probability required to inspect a region. Defaults to 0.5.
resized_width (int, optional) – Resized image width (should be multiple of 32). Defaults to 320.
resized_height (int, optional) – Resized image height (should be multiple of 32). Defaults to 320.
- Returns:
The coordinates of bounding boxes containing text.
- Return type:
list
Figure Detection
- lecture2notes.end_to_end.figure_detection.all_in_folder(path, remove_original=False, east='frozen_east_text_detection.pb', do_text_check=True, **kwargs)[source]
Perform figure detection on every file in folder and return new paths.
**kwargsis passed todetect_figures().
- lecture2notes.end_to_end.figure_detection.area_of_overlapping_rectangles(a, b)[source]
Find the overlapping area of two rectangles
aandb. Inspired by https://stackoverflow.com/a/27162334.
- lecture2notes.end_to_end.figure_detection.detect_color_image(image, thumb_size=40, MSE_cutoff=22, adjust_color_bias=True)[source]
Detect if an image contains color, is black and white, or is grayscale. Based on this StackOverflow answer.
- Parameters:
image (np.array) – Input image
thumb_size (int, optional) – Resize image to this size to speed up calculation. Defaults to 40.
MSE_cutoff (int, optional) – A larger value requires more color for an image to be labeled as “color”. Defaults to 22.
adjust_color_bias (bool, optional) – Mean color bias adjustment, which improves the prediction. Defaults to True.
- Returns:
Either “grayscale”, “color”, “b&w” (black and white), or “unknown”.
- Return type:
str
- lecture2notes.end_to_end.figure_detection.detect_figures(image_path, output_path=None, east='frozen_east_text_detection.pb', text_area_overlap_threshold=0.32, figure_max_area_percentage=0.6, text_max_area_percentage=0.3, large_box_detection=True, do_color_check=True, do_text_check=True, entropy_check=2.5, do_remove_subfigures=True, do_rlsa=False)[source]
Detect figures located in a slide.
- Parameters:
image_path (str) – Path to the image to process.
output_path (str, optional) – Path to save the figures. Defaults to
[filename]_figure_[index].[ext].east (str or cv2.dnn_Net, optional) – Path to the EAST model file or the pre-trained EAST model loaded with
load_east().do_text_checkmust be true for this option to take effect. Defaults to “frozen_east_text_detection.pb”.text_area_overlap_threshold (float, optional) – The percentage of the figure that can contain text. If the area of the text in the figure is greater than this value, the figure is discarded.
do_text_checkmust be true for this option to take effect. Defaults to 0.10.figure_max_area_percentage (float, optional) – The maximum percentage of the area of the original image that a figure can take up. If the figure uses more area than
original_image_area*figure_max_area_percentagethen the figure will be discarded. Defaults to 0.70.text_max_area_percentage (float, optional) – The maximum percentage of the area of the original image that a block of text (as identified by the EAST model) can take up. If the text block uses more area than
original_image_area*text_max_area_percentagethen that text block will be ignored.do_text_checkmust be true for this option to take effect. Defaults to 0.30.large_box_detection (bool, optional) – Detect edges and classify large rectangles as figures. This will ignore do_color_check and do_text_check. This is useful for finding tables for example. Defaults to True.
do_color_check (bool, optional) – Check that potential figures contain color. This helps to remove large quantities of black and white text form the potential figure list. Defaults to True.
do_text_check (bool, optional) – Check that only text_area_overlap_threshold of potential figures contains text. This is useful to remove blocks of text that are mistakenly classified as figures. Checking for text increases processing time so be careful if processing a large number of files. Defaults to True.
entropy_check (float, optional) – Check that the entropy of all potential figures is above this value. Figures with a
shannon_entropylower than this value will be removed. Set toFalseto disable this check. Theshannon_entropyimplementation is fromskimage.measure.entropy. IMPORTANT: This check applies to both the regular tests andlarge_box_detection, which most check do not apply to. Defaults to 3.5.do_remove_subfigures (bool, optional) – Check that there are no overlapping figures. If an overlapping figure is detected, the smaller figure will be deleted. This is useful to have enabled when using large_box_detection since large_box_detection will commonly mistakenly detect subfigures. Defaults to True.
do_rlsa (bool, optional) – Use RLSA (Run Length Smoothing Algorithm) instead of dilation. Does not apply to large_box_detection. Defaults to False.
- Returns:
(figures, output_paths) A list of figures extracted from the input slide image and a list of paths to those figures on disk.
- Return type:
tuple
Frames Extractor
Helpers
- lecture2notes.end_to_end.helpers.copy_all(list_path_files, output_dir, move=False)[source]
Copy (or move) every path in list_path_files if list or all files in a path if path to output_dir
Image Hash
- lecture2notes.end_to_end.imghash.get_hash_func(hashmethod='phash')[source]
Returns a hash function from the
imagehashlibrary.
- Hash Methods:
ahash: Average hash
phash: Perceptual hash
dhash: Difference hash
whash-haar: Haar wavelet hash
whash-db4: Daubechies wavelet hash
- lecture2notes.end_to_end.imghash.remove_duplicates(img_dir, images)[source]
Remove duplicate frames/slides from disk.
- Parameters:
img_dir (str) – path to directory containing image files
images (dict) – dictionary in format {image hash: image filenames} provided by
sort_by_duplicates().
- lecture2notes.end_to_end.imghash.sort_by_duplicates(img_dir, hash_func='phash')[source]
Find duplicate images in a directory.
- Parameters:
img_dir (str) – path to folder containing images to scan for duplicates
hash_func (str, optional) – the hash function to use as given by
get_hash_func(). Defaults to “phash”.- Returns:
dictionary in format {image hash: image filenames}
- Return type:
[dict]
OCR
Segment Cluster
- class lecture2notes.end_to_end.segment_cluster.SegmentCluster(slides_dir, model_path='model_best.ckpt')[source]
Iterates through frames in order and splits based on large visual differences (measured by the cosine difference between the feature vectors from the slide classifier)
SIFT Matcher
- lecture2notes.end_to_end.sift_matcher.does_camera_move(old_frame, frame, gamma=10, border_ratios=(10, 19), bottom=False)[source]
Detects camera movement between two frames by tracking features in the borders of the image. Only the borders are used because the center of the image probably contains a slide. Thus, tracking features of the slide is not robust since those features will disappear when the slide changes.
- Parameters:
old_frame (np.array) – First frame/image as loaded with
cv2.imread()frame (np.array) – Second frame/image as loaded with
cv2.imread()gamma (int, optional) – The threshold pixel movement value. If the camera moves more than this value, then there is assumed to be camera movement between the two frames. Defaults to 10.
border_ratios (tuple, optional) – The ratios of the height and width respectively of the first frame to be searched for features. Only the borders are searched for features. these values specify how much of the image should be counted as a border. Defaults to (10, 19).
bottom (bool, optional) – Whether to find features in the bottom border. This is not recommended because ‘presenter_slide’ images may have the peoples’ heads at the bottom, which will move and do not represent camera motion. Defaults to False.
- Returns:
(total_movement > gamma, total_movement) If there is camera movement between the two frames and the total movement between the frames.
- Return type:
tuple
- lecture2notes.end_to_end.sift_matcher.does_camera_move_all_in_folder(folder_path)[source]
Runs
does_camera_move()on all the files in a folder and calculates statistics about camera movement within those files.
- Parameters:
folder_path (str) – Directory containing the files to be processed.
- Returns:
(movement_detection_percentage, average_move_value, max_move_value) A float representing the precentage of frames where movement was detected from the previous frame. The average of the
total_movementvalues returned fromdoes_camera_move(). The maximum of the thetotal_movementvalues returned fromdoes_camera_move().- Return type:
tuple
- lecture2notes.end_to_end.sift_matcher.is_content_added(first, second, first_area_modifier=0.7, second_area_modifier=0.4, gamma=0.09, dilation_amount=22)[source]
Detect if
secondcontains more content thanfirstand how much more content it adds. This algorithm dilates both images and finds contours. It then computes the total area of those contours. Ifgamma% more than the area of the first image’s contours is greater than the area of the second image’s contours then it is assumed more content is added.
- Parameters:
first (np.array) – Image loaded using
cv2.imread()belonging to the ‘slide’ classsecond (np.array) – Image loaded using
cv2.imread()belonging to the ‘presenter_slide’ classfirst_area_modifier (float, optional) – The maximum percent area of the
firstimage that a contour can take up before it is excluded. Defaults to 0.70.second_area_modifier (float, optional) – The maximum percent area of the
secondimage that a contour can take up before it is excluded. Images belonging to the ‘presenter_slide’ class are more likely to have mistaken large contours. Defaults to 0.40.gamma (float, optional) – The percentage increase in content area necessary for second` to be classified as having more content than
first. Defaults to 0.09.dilation_amount (int, optional) – How much the canny edge maps of each both images
firstandsecondshould be dilated. This helps to combine multiple components of one object into a single contour. Defaults to 22.- Returns:
(content_is_added, amount_of_added_content) Boolean if
secondcontains more content thanfirstand float describing the difference in content fromfirsttosecond.amount_of_added_contentcan be negative.- Return type:
tuple
- lecture2notes.end_to_end.sift_matcher.match_features(slide_path, presenter_slide_path, min_match_count=33, min_area_percent=0.37, do_motion_detection=True)[source]
Match features between images in slide_path and presenter_slide_path. The images in slide_path are the queries to the matching algorithm and the images in presenter_slide_path are the train/searched images.
- Parameters:
slide_path (str) – Path to the images classified as “slide” or any directory containing query images.
presenter_slide_path (str) – Path to the images classified as “presenter_slide” or any directory containing train images.
min_match_count (int, optional) – The minimum number of matches returned by
sift_flann_match()required for the image pair to be considered as containing the same slide. Defaults to 33.min_area_percent (float, optional) – Percentage of the area of the train image (images belonging to the ‘presenter_slide’ category) that a matched slide must take up to be counted as a legitimate duplicate slide. This removes incorrect matches that can result in crops to small portions of the train image. Defaults to 0.37.
do_motion_detection (bool, optional) – Whether motion detection using
does_camera_move_all_in_folder()should be performed. If set to False then it is assumed that there is movement since assuming no movement leaves room for a lot of false positives. If no camera motion is detected and this option is enabled then all slides that are unique to the “presenter_slide” category (they have no matches in the “slide” category) will automatically be cropped to contain just the slide. They will be saved to the originating folder but with the string defined by the variableOUTPUT_PATH_MODIFIERin their filename. Even ifdoes_camera_move_all_in_folder()detects no movement it is still possible that movement is detected while running this function since a check is performed to make sure all slide bounding boxes found contain 80% overlapping area with all previously found bounding boxes. Defaults to True.- Returns:
(non_unique_presenter_slides, transformed_image_paths)
non_unique_presenter_slides: The images in the “presenter_slide” category that are not unique and should be deletedtransformed_image_paths: The paths to the cropped images if do_motion_detection was enabled and no motion was detected.- Return type:
tuple
- lecture2notes.end_to_end.sift_matcher.ransac_transform(sift_matches, kp1, kp2, img1, img2, draw_matches=False)[source]
Use data from
sift_flann_match()to find the coordinates ofimg1inimg2.sift_matches,kp1,kp2,img1, andimg2are all the outputs of meth:~sift_matcher.sift_flann_match. Ifdraw_matchesis enabled then the features matches will be drawn and shown on the screen.
- Returns:
The corner coordinates of the quadrilateral representing
img1withinimg2.- Return type:
np.array
- lecture2notes.end_to_end.sift_matcher.sift_flann_match(query_image, train_image, algorithm='orb', num_features=1000)[source]
Locate
query_imagewithintrain_imageusingalgorithmfor feature detection/description and FLANN (Fast Library for Approximate Nearest Neighbors) for matching. You can read more about matching in the OpenCV “Feature Matching” documentation or about homography on the OpenCV Python Tutorial “Feature Matching + Homography to find Objects”
- Parameters:
query_image (np.array) – Image to find. Loading using
cv2.imread().train_image (np.array) – Image to search. Loading using
cv2.imread().algorithm (str, optional) – The feature detection/description algorithm. Can be one of ORB, (ORB Class Reference) SIFT, (SIFT Class Reference) or FAST. (FAST Class Reference) Defaults to “orb”.
num_features (int, optional) – The maximum number of features to retain when using ORB and SIFT. Does not take effect when using the FAST detection algorithm. Setting to 0 for SIFT is a good starting point. The default for ORB is 500, but it was increased to 1000 to improve accuracy. Defaults to 1000.
- Returns:
(good, kp1, kp2, img1, img2) The good matches as per Lowe’s ratio test, the key points from image 1, the key points from image 2, modified image 1, and modified image 2.
- Return type:
tuple
Slide Classifier
- lecture2notes.end_to_end.slide_classifier.classify_frames(frames_dir, do_move=True, incorrect_threshold=0.6, model_path='model_best.ckpt')[source]
Classifies images in a directory using the slide classifier model.
- Parameters:
frames_dir (str) – path to directory containing images to classify
do_move (bool, optional) – move the images to their sorted folders instead of copying them. Defaults to True.
incorrect_threshold (float, optional) – the certainty value that the model must be below for a prediction to be marked “probably incorrect”. Defaults to 0.60.
- Returns:
(frames_sorted_dir, certainties, percent_wrong)
- Return type:
[tuple]
Slide Structure Analysis
- lecture2notes.end_to_end.slide_structure_analysis.all_in_folder(path, do_rename=True, **kwargs)[source]
Perform structure analysis and OCR on every file in folder using
analyze_structure().
- Parameters:
path (str) – Directory containing images to process.
do_rename (str, optional) – Rename files to just their frame number. Defaults to True.
**kwargs (dict, optional) –
lecture2notes.end_to_end.slide_structure_analysis.analyze_structure().- Returns:
(raw_texts, json_texts) A list of the raw text for each slide and a list of the json structure analysis data for each slide.
- Return type:
tuple
- lecture2notes.end_to_end.slide_structure_analysis.analyze_structure(image, to_json=None, return_unstructured_text=True, gamma=0.1, beta=0.2, orient='index', extra_json=None)[source]
Perform slide structure analysis.
- Parameters:
image (np.array) – Image to be processed as loaded with
cv2.imread().to_json (str or bool, optional) – Path to write json output or a boolean to return json data as a string. The default return value is a pd.DataFrame. Defaults to None.
return_unstructured_text (bool, optiona) – If the raw recognized text should be returned in addition to the other return values.
gamma (float, optional) – The percentage greater than or less than the average stroke width that a text line must meet to be classified as bold/subtitle or small text repsectively. Defaults to 0.1.
beta (float, optional) – The percentage greater than or less than the average height that a text line must meet to be classified as bold/subtitle or small text repsectively. This is greater than
gammabecause height is on a larger scale than gamma. Defaults to 0.2.orient (str, optional) – The format of the output json data if
to_jsonis set. The acceptable values can be found on the pandas.DataFrame.to_json documentation. Defaults to “index”.extra_json (dict, optional) – Additional keys and values to add to the json output if
to_jsonis enabled. Defaults to None.- Returns:
The default is to return a pd.DataFrame. However, setting
to_jsonto a string will instead write json data toto_jsonand return the path to the data. Settingto_jsontoTruewill return the json data as a string. Settingreturn_unstructured_textreturns the previously described data and the raw recognized text as a tuple. Will returnNoneis no text is detected.- Return type:
pd.DataFrame or str or tuple or
None
- lecture2notes.end_to_end.slide_structure_analysis.identify_title(tesseract_df, image, left_start_maximum=0.77, character_limit=3, enabled_checks=None)[source]
- lecture2notes.end_to_end.slide_structure_analysis.stroke_width(image)[source]
Determine the average stroke length in an image. Inspired by: https://stackoverflow.com/a/61914060.
Other Links:
- lecture2notes.end_to_end.slide_structure_analysis.write_to_file(raw_texts, json_texts, raw_save_file, json_save_file)[source]
Write the raw text in
raw_textstoraw_save_fileand the json data injson_textstojson_save_file. Used to write results fromall_in_folder()to disk.
- Parameters:
raw_texts (list) – List of raw text outputs from
analyze_structure().json_texts (list) – List of json ssa outputs from
analyze_structure().raw_save_file (str) – The path to save the raw text. A “.txt” file.
json_save_file (str) – The path to save the json output. A “.json” file.
Spell Check
- class lecture2notes.end_to_end.spell_check.SpellChecker(max_edit_distance_dictionary=2, max_edit_distance_lookup=2, prefix_length=7)[source]
A spell checker.
Summarization Approaches
- lecture2notes.end_to_end.summarization_approaches.cluster(text, coverage_percentage=0.7, final_sort_by=None, cluster_summarizer='extractive', title_generation=False, num_topics=10, minibatch=False, hf_inference_api=False, feature_extraction='neural_sbert', **kwargs)[source]
Summarize
texttocoverage_percentagelength of the original document by extracting features from the text, clustering based on those features, and finally summarizing each cluster. See the scikit-learn documentation on clustering text for more information since several sections of this function were borrowed from that example.Notes
**kwargsis passed to the feature extraction function, which is eitherextract_features_bow()orextract_features_neural()depending on thefeature_extractionargument.
- Parameters:
text (str) – a string of text to summarize
coverage_percentage (float, optional) – The length of the summary as a percentage of the original document. Defaults to 0.70.
final_sort_by (str, optional) – If cluster_summarizer is extractive and title_generation is False then this argument is available. If specified, it will sort the final cluster summaries by the specified string. Options are
["order", "rating"]. Defaults to None.cluster_summarizer (str, optional) – Which summarization method to use to summarize each individual cluster. “Extractive” uses the same approach as
keyword_based_ext()but instead of using keywords from another document, the keywords are calculated in theTfidfVectorizerorHashingVectorizer. Each keyword is a feature in the document-term matrix, thus the number of words to use is specified by the n_features parameter. Options are["extractive", "abstractive"].Defaults to “extractive”.title_generation (bool, optional) – Option to generate titles for each cluster. Can not be used if
final_sort_byis set. Generates titles by summarizing the text using BART finetuned on XSum (a dataset of news articles and one sentence summaries aka headline generation) and forcing results to be from 1 to 10 words long. Defaults to False.num_topics (int, optional) – The number of clusters to create. This should be set to the number of topics discussed in the lecture if generating good titles is desired. If separating into groups is not very important and a final summary is desired then this parameter is not incredibly important, it just should not be set super low (3) or super high (50) unless your document in super short or long. Defaults to 10.
minibatch (bool, optional) – Two clustering algorithms are used: ordinary k-means and its more scalable cousin minibatch k-means. Setting this to True will use minibatch k-means with a batch size set to the number of clusters set in
num_topics. Defaults to False.hf_inference_api (bool, optional) – Use the huggingface inference API for abstractive summarization. Defaults to False.
feature_extraction (str, optional) –
Specify how features should be extracted from the text.
neural_hf: uses a huggingface/transformers pipeline with the roberta model by default
neural_sbert: special bert and roberta models fine-tuned to extract sentence embeddings
spacy: uses spacy model. All other options use the small spacy model to splitthe text into sentences since sentence detection does not improve with larger models. However, if spacy is specified for feature_selection than the en_core_web_lg model will be used to extract high-quality embeddings
bow: bow = “bag of words”. this method is extremely fast since it is based onword frequencies throughout the input text. The
extract_features_bow()function contains more details on recommended parameters that you can pass to this function because of**kwargs.Options are
["neural_hf", "neural_sbert", "spacy", "bow"]Default is “neural_sbert”.- Raises:
Exception – If incorrect parameters are passed.
- Returns:
The summarized text as a normal string. Line breaks will be included if
title_generationis true.- Return type:
[str]
- lecture2notes.end_to_end.summarization_approaches.create_sumy_summarizer(algorithm, language='english')[source]
- lecture2notes.end_to_end.summarization_approaches.extract_features_bow(data, return_lsa_svd=False, use_hashing=False, use_idf=True, n_features=10000, lsa_num_components=False)[source]
Extract features using a bag of words statistical word-frequency approach.
- Parameters:
data (list) – List of sentences to extract features from
return_lsa_svd (bool, optional) – Return the features and
lsa_svd. See “Returns” section below. Defaults to False.use_hashing (bool, optional) – Use a HashingVectorizer instead of a CountVectorizer. Defaults to False. A HashingVectorizer should only be used with large datasets. Large to the degree that you’ll probably never pass enough data through this function to warrent the usage of a HashingVectorizer. HashingVectorizers use very little memory and are thus scalable to large datasets because there is no need to store a vocabulary dictionary in memory. More information can be found in the HashingVectorizer scikit-learn documentation.
use_idf (bool, optional) – Option to use inverse document-frequency. Defaults to True. In the case of
use_hasinga TfidfTransformer will be appended in a pipeline after the HashingVectorizer. If notuse_hashingthen theuse_idfparameter of the TfidfVectorizer will be set to use_idf. This step is important because, as explained by the scikit-learn documentation: “In a large text corpus, some words will be very present (e.g. ‘the’, ‘a’, ‘is’ in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms. In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.”n_features (int, optional) – Specifies the number of features/words to use in the vocabulary (which are the rows of the document-term matrix). In the case of the TfidfVectorizer the
n_featuresacts as a maximum since the max_df and min_df parameters choose words to add to the vocabulary (to use as features) that occur within the bounds specified by these parameters. This value should probably be lowered ifuse_hasingis set to True. Defaults to 10000.lsa_num_components (int, optional) – If set then preprocess the data using latent semantic analysis to reduce the dimensionality to
lsa_num_componentscomponents. Defaults to False.- Returns:
list of features extracted and optionally the u, sigma, and v of the svd calculation on the document-term matrix. only returns if
return_lsa_svdset to True.- Return type:
[list or tuple]
- lecture2notes.end_to_end.summarization_approaches.extract_features_neural_hf(sentences, model='roberta-base', tokenizer='roberta-base', n_hidden=768, squeeze=True, **kwargs)[source]
Extract features using a transformer model from the huggingface/transformers library
- lecture2notes.end_to_end.summarization_approaches.extract_features_neural_sbert(sentences, model='roberta-base-nli-mean-tokens')[source]
Extract features using Sentence-BERT (SBERT) or SRoBERTa from the sentence-transformers library
- lecture2notes.end_to_end.summarization_approaches.full_sents(ocr_text, transcript_text, remove_newlines=True, cut_off=0.7)[source]
- lecture2notes.end_to_end.summarization_approaches.generic_abstractive(to_summarize, summarizer=None, min_length=None, max_length=None, hf_inference_api=False, *args, **kwargs)[source]
- lecture2notes.end_to_end.summarization_approaches.generic_abstractive_hf_api(to_summarize, summarizer='facebook/bart-large-cnn', *args, **kwargs)[source]
- lecture2notes.end_to_end.summarization_approaches.generic_extractive_sumy(text, coverage_percentage=0.7, algorithm='text_rank', language='english')[source]
- lecture2notes.end_to_end.summarization_approaches.get_best_sentences(sentences, count, rating, *args, **kwargs)[source]
- lecture2notes.end_to_end.summarization_approaches.get_complete_sentences(text, return_string=False)[source]
- lecture2notes.end_to_end.summarization_approaches.get_sentences(text, model='en_core_web_sm')[source]
- lecture2notes.end_to_end.summarization_approaches.initialize_abstractive_model(sum_model, use_hf_pipeline=True, *args, **kwargs)[source]
- lecture2notes.end_to_end.summarization_approaches.keyword_based_ext(ocr_text, transcript_text, coverage_percentage=0.7)[source]
- lecture2notes.end_to_end.summarization_approaches.structured_joined_sum(ssa_path, transcript_json_path, frame_every_x=1, ending_chars=['.', '!', '?'], first_slide_frame_num=0, to_json=False, summarization_method='abstractive', max_summarize_len=50, abs_summarizer='sshleifer/distilbart-cnn-12-6', ext_summarizer='text_rank', hf_inference_api=False, *args, **kwargs)[source]
Summarize slides by combining the Slide Structure Analysis (SSA) and transcript json to create a per slide summary of the transcript. The content from the beginning of one slide to the start of the next to the nearest
ending_charis considered the transcript that belongs to that slide. The summarized transcript content is organized in a dictionary where the slide titles are keys. This dictionary can be returned as json or written to a json file.
- Parameters:
ssa_path (str) – Path to the SSA JSON file.
transcript_json_path (str) – Path to the transcript JSON file.
frame_every_x (int, optional) – How often frames were extracted from the video that the SSA was conducted on. This is used to convert frame numbers to time (seconds). Defaults to 1.
ending_char (str, optional) – The character that the transcript belonging to each slide will be extended to. For instance, if the next slide appears in the middle of a word, the transcript content will continue to be added to the previous slide until the
ending_charis reached. It is recommended to use periods or a special end of sentence token if present. These can be generated withlecture2notes.end_to_end.transcribe.transcribe_main.segment_sentences()Defaults to" "(nearest complete word).first_slide_frame_num (int, optional) – The frame number of the first slide. Used to create a ‘preface’ (aka an introduction) if the first slide is not immediately shown. Defaults to 0.
to_json (bool or str, optional) – If the output dictionary should be returned as a JSON string. This can also be set to a path as a string and the JSON data will be dumped to the file at that path. Defaults to False.
summarization_method (str, optional) – The method to use to summarize each slide’s transcript content. Options include “abstractive”, “extractive”, or “none”. Defaults to “abstractive”.
max_summarize_len (int, optional) – Text longer than this many tokens will be summarized. Defaults to 50.
abs_summarizer (str, optional) – The abstractive summarization model to use if summarization_method is “abstractive”. Defaults to “sshleifer/distilbart-cnn-12-6”.
hf_inference_api (bool, optional) – Use the huggingface inference API for abstractive summarization. Defaults to False.
function (*args and **kwargs are passed to the summarization) –
generic_abstractive()orgeneric_extractive_sumy()depending onsummarization_method.either (which is) –
generic_abstractive()orgeneric_extractive_sumy()depending onsummarization_method.- Returns:
A dictionary containing the slide titles as keys and the summarized transcript content for each slide as values. A string will be returned when
to_jsonis set. Ifto_jsonisTrue(boolean) the JSON data formatted as a string will be returned. Ifto_jsonis a path (string), then the JSON data will be dumped to the file specified and the path to the file will be returned.- Return type:
dict or str
Transcript Downloader
- class lecture2notes.end_to_end.transcript_downloader.TranscriptDownloader(youtube=None, ytdl=True)[source]
Download transcripts from YouTube using the YouTube API or
youtube-dl.
- static check_suffix(output_path)[source]
Gets the file extension from
output_pathand verifies it is either “.srt”, “.vtt”, or it is not present inoutput_path. The default is “.vtt”.
- download(video_id, output_path)[source]
Convenience function to download transcript with one call. If
self.ytdlis False, callsget_caption_id()and passes result toget_transcript(). Ifself.ytdlis True, callsget_transcript_ytdl().
- get_caption_id(video_id, lang='en')[source]
Gets the caption id with language
landfor a video on YouTube with idvideo_id.
- get_transcript_api(caption_id, output_path)[source]
Downloads a caption track by id directly from the YouTube API.
- Parameters:
caption_id (str) – the id of the caption track to download
output_path (str) – path to save the captions. file extensions are parsed by
check_suffix()- Returns:
the path where the transcript was saved (may not be the same as the
output_pathparameter)- Return type:
[str]