public final class WakewordTrigger extends Object implements SpeechProcessor
WakewordTrigger is a speech pipeline component that provides wakeword detection for activating downstream components. It uses a Tensorflow-Lite binary classifier to detect keyword phrases. Once a wakeword phrase is detected, the pipeline is activated. The pipeline remains active until the user stops talking or the activation timeout is reached.
The incoming raw audio signal is first normalized and then converted to the magnitude Short-Time Fourier Transform (STFT) representation over a hopped sliding window. This linear spectrogram is then converted to a mel spectrogram via a "filter" Tensorflow model. These mel frames are batched together into a sliding window.
The mel spectrogram represents the features to be passed to the autoregressive encoder (usually an rnn or crnn), which is implemented in an "encode" Tensorflow model. This encoder outputs an encoded vector and a state vector. The encoded vectors are batched together into a sliding window for classification, and the state vector is used to perform the running autoregressive transduction over the mel frames.
The "detect" Tensorflow model takes the encoded sliding window and outputs a single posterior value in the range [0, 1]. Values closer to 1 indicate a detected keyword phrase, values closer to 0 indicate non-keyword speech. This classifier is commonly implemented as an attention mechanism over the encoder window.
The detector's outputs are then compared against a configured threshold, in order to determine whether to activate the pipeline. If the posterior is greater than the threshold, the activation occurs.
Activations have configurable minimum/maximum lengths. The minimum length prevents the activation from being aborted if the user pauses after saying the wakeword (which untriggers the VAD). The maximum activation length allows the activation to timeout if the user doesn't say anything after saying the wakeword.
The wakeword detector can be used in a multi-turn dialogue system. In such an environment, the user is not expected to say the wakeword during each turn. Therefore, an application can manually activate the pipeline by calling setActive (after a system turn), and the wakeword detector will apply its minimum/maximum activation lengths to control the duration of the activation.
This pipeline component supports the following configuration properties:
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_FFT_HOP_LENGTH
default fft-hop-length configuration value.
|
static int |
DEFAULT_FFT_WINDOW_SIZE
default fft-window-size configuration value.
|
static String |
DEFAULT_FFT_WINDOW_TYPE
default fft-window-type configuration value.
|
static int |
DEFAULT_MEL_FRAME_LENGTH
default mel-frame-length configuration value.
|
static int |
DEFAULT_MEL_FRAME_WIDTH
default mel-frame-width configuration value.
|
static float |
DEFAULT_PRE_EMPHASIS
default pre-emphasis configuration value.
|
static float |
DEFAULT_RMS_ALPHA
default rms-alpha configuration value.
|
static float |
DEFAULT_RMS_TARGET
default rms-target configuration value.
|
static int |
DEFAULT_WAKE_ENCODE_LENGTH
default wake-encode-length configuration value.
|
static int |
DEFAULT_WAKE_ENCODE_WIDTH
default wake-encode-width configuration value.
|
static float |
DEFAULT_WAKE_THRESHOLD
default wake-threshold value.
|
static String |
FFT_WINDOW_TYPE_HANN
the hann fft-window-type.
|
Constructor and Description |
---|
WakewordTrigger(SpeechConfig config)
constructs a new trigger instance.
|
WakewordTrigger(SpeechConfig config,
TensorflowModel.Loader loader)
constructs a new trigger instance, for testing.
|
Modifier and Type | Method and Description |
---|---|
void |
close()
releases resources associated with the wakeword detector.
|
void |
process(SpeechContext context,
ByteBuffer buffer)
processes a frame of audio.
|
void |
reset()
resets all state internal to the stage.
|
public static final String FFT_WINDOW_TYPE_HANN
public static final String DEFAULT_FFT_WINDOW_TYPE
public static final float DEFAULT_RMS_TARGET
public static final float DEFAULT_RMS_ALPHA
public static final float DEFAULT_PRE_EMPHASIS
public static final int DEFAULT_FFT_WINDOW_SIZE
public static final int DEFAULT_FFT_HOP_LENGTH
public static final int DEFAULT_MEL_FRAME_LENGTH
public static final int DEFAULT_MEL_FRAME_WIDTH
public static final int DEFAULT_WAKE_ENCODE_LENGTH
public static final int DEFAULT_WAKE_ENCODE_WIDTH
public static final float DEFAULT_WAKE_THRESHOLD
public WakewordTrigger(SpeechConfig config)
config
- the pipeline configuration instancepublic WakewordTrigger(SpeechConfig config, TensorflowModel.Loader loader)
config
- the pipeline configuration instanceloader
- tensorflow model loaderpublic void close() throws Exception
close
in interface AutoCloseable
Exception
- on errorpublic void reset()
SpeechProcessor
reset
in interface SpeechProcessor
public void process(SpeechContext context, ByteBuffer buffer) throws Exception
process
in interface SpeechProcessor
context
- the current speech contextbuffer
- the audio frame to detectException
- on errorCopyright © 2021. All rights reserved.