Skip to content

Options

Fauzan F A edited this page Jan 2, 2024 · 23 revisions

The documentation below is for detailed on customizing the app. If you are just getting started you can check the getting started guide at getting started

📌 Main Window

image

Window Menubar

File

You can choose the following action:

  • Stay on top
  • Hide
  • Exit

View

You can choose the following action:

  • Settings (Shortcut: F2)
    Open the settings menu
  • Log (Shortcut: Ctrl + F1)
    Open log window
  • Export Directory
    Open export directory
  • Log Directory
    Open log directory
  • Model Directory
    Open model directory

Show

You can choose the following action:

  • Transcribed speech subtitle window (Shortcut: F3)
    Shows the result of the transcription in recording session but in a detached window just like a subtitle box.

  • Translated speech subtitle window (Shortcut: F4)
    Shows the result of the translation in recording session but in a detached window just like a subtitle box.

Preview:

Windows user can further customize it to remove the background by right clicking the window and choosing the clickthrough/transparent option.

image

Help

You can choose the following action:

  • About (Shortcut: F1)
  • Open Documentation / Wiki
  • Visit Repository
  • Check for Update

Transcribe

Select the model for transcription, you can choose between the following:

  • Tiny
  • Base
  • Small
  • Medium
  • Large

Each model have different requirements and produce different result. for more information you can check it directly at the whisper repository.

Translate

Select the method for translation.

  • Whisper (To english only from 99 language available)
  • Google Translate (133 target language with 94 of it have compatibility with whisper as source language)
  • LibreTranslate v1.5.1 (45 target language with 43 of it have compatibility with whisper as source language)
  • MyMemoryTranslator (127 target language with 93 of it have compatibility with whisper as source language)

From

Set language to translate from. The selection of the language in this option will be different depending on the selected method in the Translate option.

To

Set language to translate to. The selection of the language in this option will be different depending on the selected method in the Translate option.

Swap

Swap the language in From and To option. Will also swap the textbox result.

Clear

Clear the textbox result.

HostAPI

Set the device Host API for recording.

Microphone

Set the mic device for recording. This option will be different depending on the selected Host API in the HostAPI option.

Speaker

Set the speaker device for recording. This option will be different depending on the selected Host API in the HostAPI option. (Only on windows 8 and above)

Task

Set the task to do for recording. The task available are:

  • Transcribe
  • Translate

Input

Set the input for recording. The input available are:

  • Microphone
  • Speaker

Copy

Copy the textbox result.

Tool

Open the tool dropdown menu. The tool available are:

  • Export recorded results
  • Align results
  • Refine results
  • Translate results

Record

Start recording. The button will change to Stop when recording.

Import File

Import file to transcribe, will open its own modal window.

📌 General Options

image

Application related

Check for update on start

Wether to check if there is any new update or not on every app startup. (Default checked)

Ask confirmation before recording

Wethere to ask for confirmation when the recording button is pressed

Supress hidden to tray notif

Wether to suppress the notification to show that the app is now hidden to tray. (Default unchecked)

Supress device warning

Wether to supress any warning that might show up related to device. (Default unchecked)

Show audio visualizer (record)

Wether to show the audio input visualizer when recording. (Default checked)

Show audio visualizer (setting)

Wether to show the audio input visualizer in the setting menu. (Default checked)

Theme

By default, the app is bundled with the sun valley custom theme. You should also be able to add custom theme with some limitation and instruction located in the readme in the theme folder.

Logging

Log Directory

Set log folder location, to do it press the button on the right. Action available:

  • Open folder
  • Change log folder
  • Set back to default
  • Empty log folder

Verbose logging for whisper

Wether to log the record session verbosely. (Default unchecked)

Keep log files

Wether to keep the log files or not. If not checked, the log files will be deleted everytime the app runs. (Default unchecked)

Log level

Set log level. (Default DEBUG)

Debug recording

Wether to show debug log for record session. Setting this on might slow down the app. (Default unchecked)

Debug recorded audio

Wether to save recorded audio in the record session into the debug folder. The debug folder is located in the speechtranslate/debug.

The audio here will be saved as .wav in the debug folder, and if unchecked will be deleted automatically every run. Setting this on might slow down the app (Default unchecked)

Debug translate

Wethere to show debug log for translate session. (Default unchecked)

Model

Model directory

Set model folder location, to do it press the button on the right. Action available:

  • Open folder
  • Change model folder
  • Set back to default
  • Empty model folder
  • Download model

Auto check model on start

Wether to automatically check if the model is available on first opening setting menu. (Default uncheked)

Model

You can download the model by pressing the download button. Each model have different requirements and produce different result. You can read more about it here.

Size Parameters English-only model Multilingual model Required VRAM Relative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

📌 Device - Record Options

image

Device Parameters

Note

Speaker input only works on windows 8 and above.
Alternatively, you can make a loopback to capture your system audio as virtual input (like mic input) by using this guide/tool: (Voicemeeter on Windows) - (YT Tutorial) - (pavucontrol on Ubuntu with PulseAudio) - (blackhole on MacOS)

Sample Rate

Set sample rate for the input device. (Default 16000)

Channels

Set channels for the input device. (Default 1)

Chunk Size

Set chunk size for the input device. (Default 1024)

Auto Sample Rate

Wether to automatically set the sample rate based on the input device. (Default unchecked for microphone and checked for speaker)

Auto channels value

Wether to automatically set the channels based on the input device. (Default unchecked for microphone and checked for speaker)

Recording

Transcribe Rate (ms)

Set the rate for transcribing the audio in milisecond. (Default 300)

Audio Processing

Conversion

Conversion method to feed to the whisper model. (Default is using Numpy Array)

Numpy array is the default and recommended method. It is faster and more efficient, but if there are any errors related to device or conversion in the record session, try using the temporary wav file method. Temporary wav file is a little slower and less efficient but might be more accurate in some cases. When using wav file, the I/O process of the recorded wav file might slow down the performance of the app significantly, especially on long buffers. Both setting will resample the audio to a 16k hz sample rate. Difference is, numpy array uses scipy to resample the audio while temporary wav file uses default from whisper.

Numpy Array

Use numpy array to feed to the model. This method is faster because of no need to write out the audio to wav file.

Temporary wav file

Use temporary wav file to feed to the model. Using this might slow down the process because of the File IO operation. Using this might help fix error related to device (which rarely happens). When both VAD and Demucs are enabled in record session, this option will be used automatically.

Min Buffer

Set minimum buffer input (in seconds) for the input to be considered as a valid input. This means that the input must be at least x seconds long before being passed to Whisper to get the result. (Default: 0.4)

Max Buffer

Set the maximum buffer size for the audio in seconds. (Default 10)

Max Sentences

Set max number of sentences. One sentence equals to one buffer. So if max buffer is 10 seconds, the words that are in those 10 seconds is the sentence. (Default 5)

Set no limit to sentences

If enabled will remove the limit for the result saved in memory when recording

Enable Threshold

Wether to enable threshold or not. If enabled, the app will only transcribe the audio if the audio is above the threshold. (Default checked)

Auto Threshold

If set to auto, will use VAD (voice activity detection) for the threshold. The VAD is using WebRTC VAD through py-webrtcvad. (Default checked)

If set to auto, the user will need to select the VAD sensitivity to filter out the noise. The higher the sensitivity, the more noise will be filtered out. If not set to auto, the user will need to set the threshold manually.

Break buffer on silence

Wether to automatically break the buffer when silence is found for more than 1 second.

Use silero

Wether to use Silero Vad alongside WebRTC VAD (Note that this option might not be available on every device option)

Result

Text Separator

Set the separator for the text result. (Default \n)

📌 Whisper Options

image

Whisper Options

Use Faster Whisper

Wether to use faster whisper or not. (Default checked)

Decoding

Decoding Preset

Set the decoding preset. (Default Beam Search). You can choose between the following:

  • Greedy, greedy will set the temperature parameter to 0.0 with both best, beam size, and patience set to none
  • Beam Search, beam search will set the temperature parameter with fallback of 0.2, so the temperature is 0.0 0.2 0.4 0.6 0.8 1.0, both best of and beam size are set to 3 and patience is set to 1
  • Custom, set your own decoding option
Temperature

Temperature to use for sampling

Best of

Number of candidates when sampling with non-zero temperature

Beam Size

Number of beams in beam search, only applicable when temperature is zero

Threshold

Compression Ratio Threshold

If the gzip compression ratio is higher than this value, treat the decoding as failed. (Default is 2.4)

Log Probability Threshold

If the average log probability is lower than this value, treat the decoding as failed. (Default is -1.0)

No Speech Threshold

If the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to logprob_threshold, consider the segment as silence. (Default is 0.72)

Initial Prompt

Optional text to provide as a prompt for the first window. (Default is empty)

Prefix

Optional text to prefix the current context. (Default is empty)

Supress Token

Comma-separated list of token ids to suppress during sampling. '-1' will suppress most special characters except common punctuations. (Default is empty)

Max Initial Timestamp

Maximum initial timestamp to use for the first window. (Default is 1.0)

Suppress Blank

If true will suppress blank output. (Default is checked)

Condition on previous text

if True, provide the previous output of the model as a prompt for the next window, disabling may make the text inconsistent across windows. (Default is checked)

FP16

If true, will use fp16 for inference. (Default is checked)

Raw Arguments

Command line arguments / parameters to be used. It has the same options as when using stable-ts with CLI but with some parameter removed because it is set in the app / GUI. All of the parameter are:

# [device]
* description: device to use for PyTorch inference (A Cuda compatible GPU and PyTorch with CUDA support are still required for GPU / CUDA)
* type: str, default cuda
* usage: --device cpu

# [cpu_preload]
* description: load model into CPU memory first then move model to specified device; this reduces GPU memory usage when loading model.
* type: bool, default True
* usage: --cpu_preload True

# [dynamic_quantization]
* description: whether to apply Dynamic Quantization to model to reduce memory usage (~half less) and increase inference speed at cost of slight decrease in accuracy; Only for CPU; NOTE: overhead might make inference slower for models smaller than 'large'
* type: bool, default False
* usage: --dynamic_quantization

# [prepend_punctuations]
* description: Punctuations to prepend to the next word
* type: str, default "'“¿([{-"
* usage: --prepend_punctuations "<punctuation>"

# [append_punctuations]
* description: Punctuations to append to the previous word
* type: str, default "\"'.。,,!!??::”)]}、"
* usage: --append_punctuations "<punctuation>"

# [gap_padding]
* description: padding to prepend to each segment for word timing alignment; used to reduce the probability of the model predicting timestamps earlier than the first utterance
* type: str, default " ..."
* usage: --gap_padding "padding"

# [word_timestamps]
* description: extract word-level timestamps using the cross-attention pattern and dynamic time warping, and include the timestamps for each word in each segment; disabling this will prevent segments from splitting/merging properly.
* type: bool, default True
* usage: --word_timestamps True

# [regroup]
* description: whether to regroup all words into segments with more natural boundaries; specify a string for customizing the regrouping algorithm; ignored if [word_timestamps]=False.
* type: str, default "True"
* usage: --regroup "regroup_option"

# [ts_num]
* description: number of extra inferences to perform to find the mean timestamps
* type: int, default 0
* usage: --ts_num <number>

# [ts_noise]
* description: percentage of noise to add to audio_features to perform inferences for [ts_num]
* type: float, default 0.1
* usage: --ts_noise 0.1

# [suppress_silence]
* description: whether to suppress timestamps where audio is silent at segment-level and word-level if [suppress_word_ts]=True
* type: bool, default True
* usage: --suppress_silence True

# [suppress_word_ts]
* description: whether to suppress timestamps where audio is silent at word-level; ignored if [suppress_silence]=False
* type: bool, default True
* usage: --suppress_word_ts True

# [suppress_ts_tokens]
* description: whether to use silence mask to suppress silent timestamp tokens during inference; increases word accuracy in some cases, but tends to reduce 'verbatimness' of the transcript; ignored if [suppress_silence]=False
* type: bool, default False
* usage: --suppress_ts_tokens True

# [q_levels]
* description: quantization levels for generating timestamp suppression mask; acts as a threshold to marking sound as silent; fewer levels will increase the threshold of volume at which to mark a sound as silent
* type: int, default 20
* usage: --q_levels <number>

# [k_size]
* description: Kernel size for average pooling waveform to generate suppression mask; recommend 5 or 3; higher sizes will reduce detection of silence
* type: int, default 5
* usage: --k_size 5

# [time_scale]
* description: factor for scaling audio duration for inference; greater than 1.0 'slows down' the audio; less than 1.0 'speeds up' the audio; 1.0 is no scaling
* type: float
* usage: --time_scale <value>

# [vad]
* description: whether to use Silero VAD to generate timestamp suppression mask; Silero VAD requires PyTorch 1.12.0+; Official repo: https://github.com/snakers4/silero-vad
* type: bool, default False
* usage: --vad True

# [vad_threshold]
* description: threshold for detecting speech with Silero VAD. (Default: 0.35); low threshold reduces false positives for silence detection
* type: float, default 0.35
* usage: --vad_threshold 0.35

# [vad_onnx]
* description: whether to use ONNX for Silero VAD
* type: bool, default False
* usage: --vad_onnx True

# [min_word_dur]
* description: only allow suppressing timestamps that result in word durations greater than this value
* type: float, default 0.1
* usage: --min_word_dur 0.1

# [demucs]
* description: whether to reprocess the audio track with Demucs to isolate vocals/remove noise; Demucs official repo: https://github.com/facebookresearch/demucs
* type: bool, default False
* usage: --demucs True

# [demucs_output]
* path(s) to save the vocals isolated by Demucs as WAV file(s); ignored if [demucs]=False
* type: str
* usage: --demucs_output "<path>"

# [only_voice_freq]
* description: whether to only use sound between 200 - 5000 Hz, where the majority of human speech is.
* type: bool
* usage: --only_voice_freq True

# [strip]
* description: whether to remove spaces before and after text on each segment for output
* type: bool, default True
* usage: --strip True

# [tag]
* description: a pair of tags used to change the properties of a word at its predicted time; SRT Default: '<font color=\"#00ff00\">', '</font>'; VTT Default: '<u>', '</u>'; ASS Default: '{\\1c&HFF00&}', '{\\r}'
* type: str
* usage: --tag "<start_tag> <end_tag>"

# [reverse_text]
* description: whether to reverse the order of words for each segment of text output
* type: bool, default False
* usage: --reverse_text True

# [font]
* description: word font for ASS output(s)
* type: str, default 'Arial'
* usage: --font "<font_name>"

# [font_size]
* description: word font size for ASS output(s)
* type: int, default 48
* usage: --font_size 48

# [karaoke]
* description: whether to use progressive filling highlights for karaoke effect (only for ASS outputs)
* type: bool, default False
* usage: --karaoke True

# [threads]
* description: number of threads used by torch for CPU inference; supercedes MKL_NUM_THREADS/OMP_NUM_THREADS
* type: int
* usage: --threads <value>

# [mel_first]
* description: process the entire audio track into a log-Mel spectrogram first instead in chunks
* type: bool
* usage: --mel_first

# [demucs_option]
* description: Extra option(s) to use for Demucs; Replace True/False with 1/0; E.g. --demucs_option "shifts=3" --demucs_option "overlap=0.5"
* type: str
* usage: --demucs_option "<option>"

# [refine_option]
* description: Extra option(s) to use for refining timestamps; Replace True/False with 1/0; E.g. --refine_option "steps=sese" --refine_option "rel_prob_decrease=0.05"
* type: str
* usage: --refine_option "<option>"

# [model_option]
* description: Extra option(s) to use for loading the model; Replace True/False with 1/0; E.g. --model_option "in_memory=1" --model_option "cpu_threads=4"
* type: str
* usage: --model_option "<option>"

# [transcribe_option]
* description: Extra option(s) to use for transcribing/alignment; Replace True/False with 1/0; E.g. --transcribe_option "ignore_compatibility=1"
* type: str
* usage: --transcribe_option "<option>"

# [save_option]
* description: Extra option(s) to use for text outputs; Replace True/False with 1/0; E.g. --save_option "highlight_color=ffffff"
* type: str
* usage: --save_option "<option>"

Filter Hallucination

Filter record / file import

Wether to enable postprocessing result by filtering it

Filter path

The path to the filter file (.json) containing all the filter in differetent languages supported by Whisper. Base filter file is provided by default, user can customize it if they want.

Ignore Punctuations

Punctuation to ignore when filtering. (Default "',.?!)

Strip

Wether to strip any space when filtering. (Default is checked)

Case sensitive

Wether the case of the string needs to match. (Default is unchecked)

Similarity rate

Similarity rate to use when not using "Exact match" for comparing the filter an the result in the segment. (Default is 0.75)

Exact match

Wethere the string need to be exactly the same as the reult in the segmenet to be removed. (Default is unchecked for record and checked for file import)

📌 File Export Options

image

Mode

Mode

Set the mode for export. You can choose between the following:

  • Segment level
  • Word level

segment_level=True + word_level=True

00:00:07.760 --> 00:00:09.900
But<00:00:07.860> when<00:00:08.040> you<00:00:08.280> arrived<00:00:08.580> at<00:00:08.800> that<00:00:09.000> distant<00:00:09.400> world,

segment_level=True + word_level=False

00:00:07.760 --> 00:00:09.900
But when you arrived at that distant world,

segment_level=False + word_level=True

00:00:07.760 --> 00:00:07.860
But

00:00:07.860 --> 00:00:08.040
when

00:00:08.040 --> 00:00:08.280
you

00:00:08.280 --> 00:00:08.580
arrived

...

Export to

Can choose between the following:

  • Text
  • Json
  • SRT
  • ASS
  • VTT
  • TSV
  • CSV

It is recommended to have the json output always enabled just in case you want to further modify the results with the tool menu in the main menu

Visualize Supression

Wether to visualize visualize which parts of the audio will likely be suppressed (i.e. marked as silent)

Export folder

Set the export folder location

Auto open

Wether to auto open the export folder for file import

Result Modification

Remove repetition

Wether to enable remove words that repeat consecutively.

Example 1: "This is is is a test." -> "This is a test." If you set max words to 1, it will remove the last two "is".

Example 2: "This is is is a test this is a test." -> "This is a test." If you set max words to 4, it will remove the second " is" and third " is", then remove the last "this is a test". "this is a test" will get remove ' because it consists of 4 words and the max words is 4.

Limit Per Segment

Max Words

Set the maximum number of words allowed in each segment. (Default unset)

Max Chars

Set the maximum number of characters allowed in each segment. (Default unset)

Separate Method

Wether to use line break or splitting into separate segments on split points. (Default Split)

Even Split

Whether to evenly split a segment in length if it exceeds max_chars or max_words.

Naming Format

Slice file start

Amount to slice the filename from the start

Slice file end

Amount to slice the filename from the end

Export format

Set the filename export format. It is recommended to always have one of the task format set because without it the file might get mixed up and could be overwritten. The following are the options for all the export format:

Default value: %Y-%m-%d %f {file}/{task-lang}
To folderize the result you can use / in the format. Example: {file}/{task-lang-with}

Available parameters:

----- Parameters that can be used in any situation -----

{strftime format such as %Y %m %d %H %M %f ...}
To see the full list of strftime format, see https://strftime.org/

{file}
Will be replaced with the file name

{lang-source}
Will be replaced with the source language if available. 
Example: english

{lang-target}
Will be replaced with the target language if available. 
Example: french

{transcribe-with}
Will be replaced with the transcription model name if available. 
Example: tiny

{translate-with}
Will be replaced with the translation engine name if available. 
Example: google translate

----------- Parameters only related to task ------------

{task}
Will be replaced with the task name. 
Example: transcribed or translated

{task-lang}
Will be replaced with the task name alongside the language. 
Example: transcribed english or translated english to french

{task-with}
Will be replaced with the task name alongside the model or engine name. 
Example: transcribed with tiny or translated with google translate

{task-lang-with}
Will be replaced with the task name alongside the language and model or engine name. 
Example: transcribed english with tiny or translated english to french with google translate

{task-short}
Will be replaced with the task name but shorten. 
Example: tc or tl

{task-short-lang}
Will be replaced with the task name but shorten and alongside the language. 
Example: tc english or tl english to french

{task-short-with}
Will be replaced with the task name but shorten and alongside the model or engine name. 
Example: tc tiny or tl google translate

{task-short-lang-with}
Will be replaced with the task name but shorten and alongside the language and model or engine name. 
Example: tc english with tiny or tl english to french with google translate

📌 Translate Options

image

Options

Proxies List

HTTPS

Set the proxies list for HTTPS. Each proxies is separated by new line tab or space.

HTTP

Set the proxies list for HTTP. Each proxies is separated by new line tab or space.

Libre Translate Setting

Host

Set the host for libre translate. Example:

  • If you are hosting it locally you can set it to http://127.0.0.1:5000.
  • If you are using the official instance you can set it to https://libretranslate.com

API Key

Set the API key for libre translate.

Supress empty API key warning

Wether to supress the warning if the API key is empty.

📌 Textbox Options

image

Each Window Textbox

Max Length

Set the max character shown in the textbox.

Max Per Line

Set the max character shown per line in the textbox.

Font

Set the font for the textbox.

Colorize text based on confidence value when available

Wether to colorize the text based on confidence value when available. (Default checked)

Auto scroll

Wether to automatically scroll to the bottom when new text is added

Other

Confidence Setting

Low Confidence Color

Set the color for low confidence value. (Default #ff0000)

High Confidence Color

Set the color for high confidence value. (Default #00ff00)

Colorize per

Set the colorize per. You can choose between the following:

  • Segment
  • Word

You can only choose one of the option.

Clone this wiki locally