LLM Voice Assistant

Framework for running Ollama or GPT4All models with voice recognition and text-to-speech output. Designed to work completely offline and even on a Raspberry Pi.

STTTS (speech-to-text-to-speech) is a voice assistant framework for running large language models with voice recognition input and speech synthesis output. Several approaches come already implemented and only need to be plugged together using a single configuration file. The internal multi-processing interface is simple, bases merely on bare text or raw PCM data, and scales down even to a Raspberry Pi.

As the use-case is to run completely offline, STTTS only involves projects that can work with (automatically) downloaded local models:

STT speech-to-text recognizers: Vosk or OpenAI Whisper
LLM large-language-model processors: GPT4All or Ollama
TTS text-to-speech synthesizers: eSpeak or Coqui
Audio I/O: ALSA, PulseAudio, or PyAudio

Mode of Operation

In full pipeline mode, STTTS manages three independent Python processes, one for STT, LLM, and TTS each. This decouples dependent library imports and leverages individual process scheduling for performance. As communication merely consists of prompt and token text – i.e. no audio, there is little IPC overhead through the pipe-backed queues. In single-process “CLI” mode, each part can also be directly invoked with an interactive prompt session.

Audio data is internally mostly handled as 16bit little-endian integer mono PCM buffers, which all involved i/o libraries seem to agree on. Certain processing is eased by or needs a conversion to single-precision floating point numpy arrays, though. The sampling rate depends on the chosen recognizer or synthesizer, does not require resampling, and typically is 16000 or 22050 Hz.

--+-------------------|-------+---------------------------+---+-------------------------+--
Q |                   v       |                           | Q |                         | Q
+-+---[Start]--->AudioSource  |       [Token/Feedback]--->+-+-+----->State---[MSG]----->+-+
|                  |     ^    |            |   |            |                    v        |
|              [PCM]     |    |            |   |            |               OutputFilter  |
|                  v     |    |            |   |            |                 |      |    |
|       SpeechSegmenter  |    |      Processor |            |       [Utterance]      |    |
|              |         |    |            ^   |            |                 v      |    |
|          [PCM]    [Stop]    |            |   |            |   SentenceSegmenter    |    |
|              v         |    |            |   |            |        |           [PCM]    |
|          Recognizer    |    |     [Prompt]   |            |        [Sentence]      |    |
|              |         |    |            |   |            |        v               |    |
|            State-------+    |            |   |            |   Synthesizer          |    |
|              |              |            |   |            |        |               v    |
|     [Keyword/Utterance]-->+-+-+--------->State            |        [PCM]---->AudioSink  |
|STT                        | Q |                        LLM|TTS                   |      |
+---------------------------+---+---------------------------+----------------------v------+

Usage

The full voice-activated pipeline or individual CLI prompts can be run by providing a configuration file in YAML format, see below for examples and further details.

usage: sttts [-h] --config YAML [--log-level LEVEL] [--cli MODE]

Framework for running Ollama or GPT4All models with voice recognition and text-to-speech output.

options:
  --config YAML      configuration file
  --log-level LEVEL  logging level (DEBUG, INFO, WARNING, or ERROR) (default: INFO)
  --cli MODE         run certain cli instead of full pipeline (stt, llm, or tts) (default: None)

For initial startup, tweaking, debugging, or simply as quick LLM prompt interface, the single-process modes are recommended:

stt activates the configured audio source with speech recognition and continuously prints transcribed utterances.
llm runs the configured model from within an interactive prompt/reply session.
tts can play input text from a prompt by using the configured synthesizer and audio sink.

Note that at least audio settings and the local models to use must be provided. Convenient automatic model download is supported but opt-in. Part of the audio configuration is also voice activity detection, with thresholds that might need to be adjusted according to the ambient recording situation.

Interaction with the voice assistant bases on certain keywords, which can be adjusted at will for the language in use:

start: Initial hotword, start listening and further voice recognition.
reset: Cancel the current operation, discard the pending utterance, and wait for start again.
commit: Finish the current prompt utterance, stop listening, and run the model with synthesized speech playback. Start listening hotwords again afterwards.
stop: Hotword to initiate overall exit.

Example 1: Desktop

A decent Linux desktop with a dated graphics card (i5-12600K, 32GB DDR5, NVMe SSD, GTX 1050 Ti 4GB) should be able to run all the even advanced models.

make EXTRA=pulse,rosa,whisper,gpt4all,coqui clean install
./venv/bin/sttts --config config.yaml

Note that this might need corresponding native system libraries, too. The configuration file for this local “virtual environment” installation could look like:

---
source: pulse
sink: pulse

speech_segmenter: band
recognizer: whisper
processor: gpt4all
synthesizer: coqui

band_speech_segmenter:
  threshold: 1.0

whisper_recognizer:
  model_name: "small.en"
  download: true

gpt4all_processor:
  model_name: "Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf"
  download: true

coqui_synthesizer:
  model_name: "tts_models/en/ljspeech/tacotron2-DDC"
  download: true

For the first run that requires model downloads and possibly verifying audio settings, the single-process CLI modes are recommended. When running, after the signal tone, utterances are accepted by using the default hotwords:

“start” <pause> “Hello, there!” <pause> “commit” <wait> “stop” (or Ctrl^C)

In total, running these models can be done while occupying ~8GB RAM plus GPU memory. The logged LLM’s tokens-per-second rate should not be critical as long as it outpaces the speech synthesis, which only consumes very few tokens/second. Ideally, the TTS process should however be able to generate more than one second audio per second.

Example 2: Raspberry Pi

In contrast to the first example, everything can also be scaled down to run even on a Raspberry Pi – with certain compromises. Also, this experiment will use German as non-English language use-case. Ingredients:

Raspberry Pi 4B 4GB
Reasonable active or passive cooling solution
Cheap USB microphone dongle
3.5mm jack speakers or headphones
Ubuntu Noble 24.04 LTS (runs with some cleanup at ~100MB)

For supporting alsa and espeak on a fresh system, installing the corresponding libraries using apt-get might be required. A local Ollama binary and virtual environment for dependencies can be created by:

curl -L https://ollama.com/download/ollama-linux-arm64.tgz -o ollama-linux-arm64.tgz && tar xzf ollama-linux-arm64.tgz && ./bin/ollama serve --help
make EXTRA=alsa,vosk,ollama,espeak clean install

Again, using the individual CLI modes for model downloads and audio settings might be useful, otherwise start with:

./venv/bin/sttts --config config.yaml

A configuration file with StableLM2 for German, but “only” eSpeak could look like:

---
source: alsa
sink: alsa

alsa_source:
  device: "default:CARD=Device"
alsa_sink:
  device: "default:CARD=Headphones"

speech_segmenter: simple
recognizer: vosk
processor: ollama
synthesizer: espeak

simple_speech_segmenter:
  threshold: 0.02

vosk_recognizer:
  model_name: "vosk-model-small-de-0.15"
  download: true

ollama_processor:
  model_name: "stablelm2:1.6b"
  download: true
  device: cpu
  serve: true
  serve_exe: "./ollama"
  system_prompt: "Du bist ein hilfreicher Assistent."
  num_ctx: 512

espeak_synthesizer:
  model_name: "german"

keywords:
  start: "start"
  reset: "korrektur"
  commit: "los"
  stop: "ende"

This setup should run a bit too slow but relatively robust with ~2GB memory usage overall. Additionally enabling Coqui TTS with tts_models/de/css10/vits-neon seems pretty risky in this regard – while barely reaching 1sec/sec synthesizer speed.

Example 3: Raspberry Pi

As second example on a Raspberry Pi, when downgrading the model to tinyllama, there’s enough headroom to experiment with nicer text-to-speech synthesis in English. Then, during installation, the coqui extra should be added or used instead of espeak. (But see the Python version caveat below.)

---
source: alsa
sink: alsa

alsa_source:
  device: "default:CARD=Device"
alsa_sink:
  device: "default:CARD=Headphones"

speech_segmenter: simple
recognizer: vosk
processor: ollama
synthesizer: coqui 

simple_speech_segmenter:
  threshold: 0.02

vosk_recognizer:
  model_name: "vosk-model-small-en-us-0.15" 
  download: true

ollama_processor:
  model_name: "tinyllama"
  download: true
  device: cpu
  serve: true
  serve_exe: "./ollama"

coqui_synthesizer:
  model_name: "tts_models/en/ljspeech/glow-tts"
  download: true
  device: cpu

This setup should also run a bit too slow but relatively robust cpu-bound with ~2GB memory usage overall.

Excourse: Backporting Python in Ubuntu 24.04

At the time of writing, Coqui TSS requires Python below the most recent 3.12 version, which is shipped with Ubuntu 24.04. The official repositories do not provide other major versions, but the deadsnakes PPA allows to install additional Python environments, which can co-exist.

echo 'deb https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu noble main' | \
    sudo tee /etc/apt/sources.list.d/deadsnakes.list
curl -o - 'https://keyserver.ubuntu.com/pks/lookup?op=get&search=0xf23c5a6cf475977595c89f51ba6932366a755776' | \
    sudo tee /etc/apt/trusted.gpg.d/deadsnakes.asc
sudo apt-get update
sudo apt-get install python3.11-minimal python3.11-venv python3.11-dev

For such cases, the Makefile accepts a Python interpreter argument to be used in the virtual environment lateron.

make PYTHON=python3.11 EXTRA=alsa,vosk,ollama,coqui clean install

When running into SIGILL on a Raspberry Pi for torch 2.4.0, sticking its version to torch==2.3.1 in requirements.in might be needed.

Installation

In order to make the large amount of dependencies manageable, STTTS makes use of extras to opt-in certain implementations. Available tags are:

alsa, pulse, pyaudio,
whisper, vosk,
gpt4all, ollama,
espeak, coqui,
sphinx, sbd,
all of the above
dev for linting and docs

Either full or requirements-only installation inside a virtual environment is wrapped by the Makefile, which accepts PYTHON and EXTRA arguments. For example:

make EXTRA=pulse,vosk,ollama,coqui install # create and install in venv -> ./venv/bin/sttts
pip install .[pulse,vosk,ollama,coqui] # install system- or user-wide -> sttts or python3 -m sttts
make PYTHON=python3.10 EXTRA=dev,all deps # create and install dependencies in venv -> ./venv/bin/python3 -m sttts

Note that several packages also require native system libraries (as documented below), which need to be installed separately beforehand. Further make targets apart from deps and install are clean, check, and docs.

Configuration Reference

A single configuration file in YAML format contains all relevant settings. Technically, the following fields are required:

source: alsa, pulse, pyaudio, or wave
sink: alsa, pulse, pyaudio, or wave
recognizer: sphinx, vosk, or whisper
processor: noop, ollama, or gpt4all
synthesizer: espeak or coqui

Corresponding to these choices, there are respective own config sections, e.g., for choosing and downloading the model to use. Similarly, but optional, as defaults exist:

speech_segmenter: simple, median, band, or sphinx, default simple (which might need adjusting its threshold)
sentence_segmenter: split or sbd, default split
feedback: noop, speech, or beep, default beep

Other global configuration objects possibly of interest are:

keywords: hotwords for start, reset, commit, or stop
logging

A more structured, code-based read-the-docs style sphinx HTML documentation can be generated with:

make EXTRA=dev clean docs

Audio I/O

Audio sources provide the input for speech recognition as interface for interaction, typically a microphone. Must be configured by source (alsa, pulse, pyaudio, wave) and the corresponding per-class objects alsa_source, pulse_source, pyaudio_source, or wave_source.

Sinks receive the text-to-speech audio data as generated by the synthesizer, typically speaker or headphone audio devices. Must be configured by sink (alsa, pulse, pyaudio, wave), and the per-class alsa_sink, pulse_sink, pyaudio_sink, or wave_sink objects, respectively.

Alsa

ALSA listening source and playback sink, using PyAlsaAudio. This should be the most low-level sound system implementation available per default for example even on a Raspberry Pi.

For installation, the corresponding native library and headers are needed, such as from the libasound2-dev package. For configuration, the amixer, aplay, and arecord CLI commands from the alsa-utils package might be useful, too. Also note that the invoking user must typically be in the audio group.

Source: Alsa

str device: Capture PCM to use, as obtained by arecord -L, for example default:CARD=Device. Default default.
float buffer_length: Adjust read size in seconds, default 250ms.
int warmup: Skip the first reads, in case of microphone auto-gaining, default 4, thus 1 second.
int periods: ALSA periods.
kwargs: Extra options passed to alsaaudio.PCM.

Sink: Alsa

str device: Playback PCM to use, as obtained by aplay -L, for example default:CARD=Headphones. Default default.
float buffer_length: Output buffer length in seconds, default 5. Generous to unblock the synthesizer running in parallel.
int period_size: ALSA period size in frames.
kwargs: Extra options passed to alsaaudio.PCM.

Pulse

Listening source and playback sink for PulseAudio servers, using PaSimple, that in turn requires the native libpulse-simple.so.0 library. This sound server implementation should be in use per default on various Linux desktop distributions.

Source: Pulse

str device: Recording device to use, none for default.
float buffer_length: Adjust read size in seconds, the default 250ms.
int warmup: Skip the first reads, in case of microphone auto-gaining, default 4, thus 1 second.
kwargs: Extra options passed to PaSimple.

Sink: Pulse

str device: Playback device to use, none for default.
int buffer_length: Output buffer length in seconds, default 5. Generous to unblock the synthesizer running in parallel.
kwargs: Extra options passed to PaSimple.

PyAudio

Listening source and playback sink using PyAudio, which relies on the cross-platform PortAudio library.

Building requires the portaudio19-dev package or similar. At the time of writing, on Ubuntu 22.04, this conflicts with jackd and the pre-built python3-pyaudio binary package 0.2.11 is broken for Python 3.10. Also, problems might arise for non-default microphone sampling rates using ALSA.

This option provides the most high-level abstraction and compatibility if neither ALSA nor PulseAudio is supported.

Source: PyAudio

str device: Recording device to use such as USB PnP Sound Device: Audio (hw:1,0), none for default. If invalid, error out with a list of available devices.
int buffer_length: Read size in seconds, default 250ms.
kwargs: Extra options passed to pyaudio.Stream.

Sink: PyAudio

str device: Playback device to use, none for default, for example bcm2835 Headphones: – (hw:0,0). If invalid, error out with a list of available devices.
int buffer_length: Requested output buffer length in seconds, default 5. Note that the actually applied buffer size might be lower.
kwargs: Extra options passed to pyaudio.Stream.

Wave

Mostly for debugging purposes, audio can be directly read from or written to S16LE mono *.wav files, as configured by filename. When reading, the sample rate must match the internally chosen one.

Speech Segmenters

Not all voice recognition implementations provide support for a fully streamed operation, i.e., are able to continuously receive audio frames, detect silence or activity, and transcribe speech on-the-fly. Thus, this explicit and exchangeable pre-processing step monitors input audio and yields buffers that contain a whole utterance, as separated by short breaks of silence.

As this particular aspect of the pipeline largely depends on environmental conditions, choosing an implementation and its config might need some trial-and-error approach.

Configured by speech_segmenter (simple as default, median, band, sphinx) and the corresponding per-class objects simple_speech_segmenter, median_speech_segmenter, band_speech_segmenter, or sphinx_speech_segmenter.

The speech_buffer_limit (30.0) and speech_pause_limit (30.0) configuration values limit the allowed utterance and silence durations. This should prevent excessive buffering in case of mis-detected spurious speech activity detection or “start” keyword.

Segmenter: Simple

Determine silence/speech audio by a simple absolute RMS/volume threshold, which can require tweaking and good recording environments.

float frames: Length of the sliding look behind window in seconds (2.0).
float threshold: RMS threshold, smaller values will be considered silent. Default 0.2.

Segmenter: Median

Determine silence/speech audio by comparing the RMS with the median (percentile) energy. Idea: If the median is smaller than the average, there’s peaks, i.e., a flat noise/silence distribution.

This simple method should be self-adaptive wrt. background noise to automatically detect volume outliers. The calculation is applied to a sliding window of past audio frames, with a change from speech to silence leading to returning the buffered utterance as a whole.

float frames: Length of the sliding look behind window in seconds (2.0).
int percentile: Percentile that is compared with the RMS energy, for example 50 for median (default).
float threshold: Percentile by RMS factor, greater will be considered as silence. Default 0.5.

Segmenter: Band

Use the librosa STFT FFT implementation as simple band-pass filter. The average contribution of typical voice frequencies is compared against other frequencies in the spectrum. This gives a voice-vs-noise estimate, with a configurable threshold.

float frames: Length of the sliding look behind window in seconds (1.0). This also directly influences the possible FFT resolution.
float threshold: Average voice frequency compared to other frequencies, default 1.0.
int freq_start: Lower band-pass, where the human voice typically starts (256).
int freq_end: Upper band-pass, where the human voice typically ends (4096).

Segmenter: Sphinx

Use PocketSphinx Endpointer for VAD voice activity detection, similar to the basic Segmenter.

int mode: Aggressiveness of voice activity detection (0-3, loose-strict, default 0).
float window: Length in seconds of window for decision (0.3).
float ratio: Fraction of window that must be speech or non-speech to make a transition (0.9).
kwargs: Extra options passed to pocketsphinx.Endpointer.

STT Recognizers

Voice recognizers transcribe speech from audio buffers. Apart from generic prompts, utterances that only consist of a single keyword are detected. By the keywords object, alternate hotwords for start, reset, commit, or stop can be configured.

Must be configured by recognizer (sphinx, vosk, whisper) and the corresponding per-class sphinx_recognizer, vosk_recognizer, or whisper_recognizer objects.

Recognizer: Sphinx

Use the Python bindings of the PocketSphinx speech recognizer package. Direct pocketsphinx.Decoder access is possible without the oversimplified wrappers for audio or live speech, as the SphinxSegmenter implements a pocketsphinx.Endpointer beforehand.

Generic transcription capabilities seem to be rather poor for nowadays standards, making it more suitable for specific speech detection on a limited dictionary. More recent models than the en-us one that the package ships with might be available on the PocketSphinx project page or from SpeechRecognition.

str model_path: Common base path for the hmm, lm, and dct arguments. Default to use the en-us model that ships with pocketsphinx.
str hmm: Sub-path to the directory containing acoustic model files, such as acoustic-model.
str lm: Sub-path to the N-Gram language model, such as language-model.lm.bin.
str dct: Sub-path to the pronunciation dictionary, such as pronounciation-dictionary.dict.
kwargs: Extra Config parameters passed to Decoder.

Recognizer: Vosk

Use the Vosk speech recognition toolkit, with a wide range of models available. This implementation seems to provide good detection capabilities that also runs on low-end hardware.

str model_name: Model to use, for example vosk-model-small-en-us-0.15. Omitted to list available models if download is enabled.
bool download: Opt-in model search and automatic download. Otherwise, ensure the model exists beforehand.
int sample_rate: Accepted input sampling rate, default 16000. Might be changed if 16K is not supported by the input recording device.

Recognizer: Whisper

Audio transcription using the OpenAI Whisper speech recognition models, which can be problematic on low-end hardware.

str model_name: Name of the model to use, for example base.en. Omitted to list available ones.
str language: Indicate input language, default en. Especially important for multilingual models.
bool download: Opt-in automatic downloading of models to ~/.cache/whisper/. Otherwise, ensure the model exists beforehand.
str device: torch device to use, default cuda if available, otherwise cpu.
kwargs: Extra arguments passed to whisper.Whisper.transcribe().

LLM Processors

Processors are the core functionality, formed by LLMs, which receive transcribed prompts and yield tokens to be synthesized to speech output. Must be configured by processor (noop, ollama, gpt4all) and the corresponding ollama_processor or gpt4all_processor objects, respectively.

Processor: GPT4All

Run language models through the GPT4All Python client around the Nomic and/or llama.cpp backends.

str model_name: Model to use, for example Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf. Omitted to list remote models if download is enabled.
str model_path: Path for model files, default ~/.cache/gpt4all/.
bool download: Opt-in model search and automatic download.
str device: Explicit processing unit, such as cpu, automatic per default.
int max_tokens: The maximum number of tokens to generate (200).
int n_ctx: Maximum size of context window (2048).
str system_prompt: Override initial instruction for the model.
kwargs: Extra options passed to GPT4All.generate().

Processor: Ollama

Run language models on a remote or locally started Ollama server, using the client provided by the ollama package.

If no installation as system daemon is needed, the self-contained binary can simply be downloaded, for example from https://ollama.com/download/ollama-linux-amd64.tgz.

str model_name: Model to use, for example llama3. Omitted to list locally available ones. Remotely available models can be browsed in the official model library.
str host: API host, default 127.0.0.1:11434.
bool download: Opt-in automatic model pull, usually to ~/.ollama/models/.
bool serve: Run ollama serve in an own subprocess.
str serve_exe: Path to local binary when using internal serving instead of ollama.
dict serve_env: Extra environment variables when using internal serving, see ollama serve --help.
str device: Disable CUDA when using internal serving and set to cpu.
str system_prompt: Override system message from what is defined in the Modelfile.
kwargs: Extra ollama.Options passed to ollama.Client.generate().

Sentence Segmenters

Not all synthesizers support a streaming operation, i.e., are able to continuously receive text/token input while yielding internally buffered chunks of audio. In its simplest form, sentence segmenters thus combine and flush tokens until certain boundaries are found, for example full stop periods. By this means, playback can start as soon as the first sentence is available, while further tokens and synthesized output is still generated in the background.

Configured by sentence_segmenter (split as default, sbd) and the per-class config objects split_sentence_segmenter, or sbd_sentence_segmenter, respectively.

Segmenter: Split

Split streamed text into sentences by applying a simple expression that recognizes newlines or certain punctuation characters followed by space.

str delimiter_chars: Characters that end sentences if followed by a space, default .!！?？:：;.

Segmenter: Boundary

Use the pySBD module for sentence boundary disambiguation.

str language: Implementation to use, default en.

Output Filters

In a post-processing step, output filters can opt to add either further text or PCM to be played. For example, beep sounds can indicate readiness or end of output. Configured by feedback (noop, speech, beep as default).

TTS Synthesizers

As actual text-to-speech implementation, synthesizers receive tokens/sentences and yield audio buffer streams. Must be configured by synthesizer (espeak, coqui) and the per-class espeak_synthesizer or coqui_synthesizer objects, respectively.

Synthesizer: Espeak

Speech synthesizer using the eSpeak bindings from pyttsx3. Only the actual C library wrappers are directly used, bypassing the provided loop and ffmpeg-based PCM output.

As dependency, usually the espeak (or at least libespeak1) package need to be installed beforehand, this usually also makes a wide range of languages (voices) is available. Results are understandable but typically sound rather mechanical than natural by today’s standards. However, it is a viable alternative that runs even on low-end hardware.

str model_name: Voice name, for example default or english-us. Omitted to list available ones.
str model_path: Directory which contains the espeak-data directory, omitted for default location.
float buffer_length: Length in seconds of sound buffers that are passed to the callback (0.25).

Synthesizer: Coqui

Use the TTS text-to-speech library from Coqui. Originally forked from Mozilla, both seem to be discontinued, though. A wide range of natural sounding models is available, some examples can be found at Coqui-TTS Voice Samples.

Comes with lots of additional dependencies, such as espeak, ffmpeg, libav, or rustc.

str model_name: Model to use, in the format type/language/dataset/model with tts_models type. For example tts_models/en/ljspeech/tacotron2-DDC. Omitted to list available models.
bool download: Opt-in automatic downloading of models to ~/.local/share/tts/. Otherwise, ensure the model exists beforehand.
str device: Device to use, default cuda if available, otherwise cpu.
kwargs: Extra options passed to TTS.tts().

Logging

Some libraries show the bad habit of directly using print statements. STTTS tries to unify this to a certain extent by intercepting warnings and standard streams, forwarding to the logging subsystem instead.

The overall log level is set by the command line. For more complex scenarios, or simply for excluding too verbose loggers when using DEBUG, a logging config can be provided, which then gets applied by each process instead, such as:

logging:
  version: 1
  formatters:
    colored:
      (): colorlog.ColoredFormatter
      format: "%(asctime)s %(log_color)s%(levelname)-8s%(reset)s %(name)s: %(message)s"
      datefmt: '%Y-%m-%d %H:%M:%S'
  handlers:
    console:
      class: logging.StreamHandler
      formatter: colored
  loggers:
    root:
      handlers:
        - console
    asyncio:
      level: INFO
    numba:
      level: INFO
    torio:
      level: INFO