Extracting text from audio with Python

Extracting text from audio—commonly referred to as speech-to-text (STT) or automatic speech recognition (ASR)—is a transformative capability that enables applications such as:

Transcribing interviews, meetings, and podcasts
Building voice-controlled assistants
Generating subtitles for videos
Enabling accessibility features

In Python, a rich ecosystem of libraries and services allows developers to integrate STT into their projects with just a few lines of code. This article will guide you through:

Understanding the fundamentals of STT
Setting up your Python environment
Basic transcription with the SpeechRecognition library
Advanced transcription using OpenAI’s Whisper model
Offline, real‑time, and streaming transcription
Improving accuracy with preprocessing and language models
Handling large‑scale and multi‑language scenarios
Performance, privacy, and best practices

Let’s dive in!

1. Fundamentals of Speech‑to‑Text

At its core, STT involves two main stages:

Acoustic Modeling
Transforms raw audio waveforms into sequences of phonetic or acoustic features.
Language Modeling
Converts those phonetic sequences into words or sentences, leveraging probabilities of word sequences to improve accuracy.

Modern systems—especially deep‑learning–based ones—tend to combine these stages into an end‑to‑end neural network (e.g., OpenAI Whisper, Google’s Speech-to-Text API, Mozilla DeepSpeech).

2. Environment Setup

Before writing code, ensure you have:

Python 3.8+ installed
A virtual environment (recommended)
Basic tools: pip, ffmpeg (for audio format conversion)

# Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate      # Linux/Mac
venv\Scripts\activate         # Windows

# Upgrade pip
pip install --upgrade pip

Install the primary libraries we’ll use:

pip install SpeechRecognition pydub
pip install git+https://github.com/openai/whisper.git  # Whisper
pip install torch torchvision torchaudio                # required by Whisper

SpeechRecognition: A high-level wrapper supporting multiple back‑ends (Google, Sphinx, etc.).
pydub: Simplifies audio manipulation (format conversion, slicing).
whisper: OpenAI’s versatile ASR model, capable of multilingual and robust transcription.

Install FFmpeg on your system if you haven’t already; it’s required by pydub:

macOS: brew install ffmpeg
Ubuntu: sudo apt-get install ffmpeg
Windows: Download static build from ffmpeg.org.

3. Basic Transcription with SpeechRecognition

The SpeechRecognition library provides an easy interface:

import speech_recognition as sr

def transcribe_with_google(audio_path: str) -> str:
    r = sr.Recognizer()
    with sr.AudioFile(audio_path) as source:
        audio = r.record(source)            # read the entire audio file
    try:
        text = r.recognize_google(audio)    # uses Google’s free web API
        return text
    except sr.UnknownValueError:
        return "[Unintelligible]"
    except sr.RequestError as e:
        return f"[API Error: {e}]"

if __name__ == "__main__":
    transcript = transcribe_with_google("interview.wav")
    print(transcript)

Pros & Cons

Pros
- Zero‑configuration with Google’s free API
- Quick prototyping
Cons
- Requires internet connection
- Limited audio length (about 60 seconds)
- Privacy concerns (audio sent to Google servers)

4. Advanced Transcription Using OpenAI Whisper

Whisper is an open‑source, end‑to‑end model that runs locally:

import whisper

def transcribe_with_whisper(audio_path: str, model_size="base") -> str:
    model = whisper.load_model(model_size)  # choices: tiny, base, small, medium, large
    result = model.transcribe(audio_path)
    return result["text"]

if __name__ == "__main__":
    text = transcribe_with_whisper("podcast.mp3", model_size="small")
    print(text)

Key Features of Whisper

Multilingual: Automatically detects language.
Robust: Works well on noisy, low‑quality audio.
Offline: No API calls; privacy‑friendly.

Note: Larger model sizes yield higher accuracy but require more GPU/CPU resources and memory.

5. Offline, Real‑Time, and Streaming Transcription

5.1. Offline with Vosk

Vosk is another offline toolkit:

pip install vosk sounddevice

import queue
import sounddevice as sd
from vosk import Model, KaldiRecognizer

def real_time_transcribe():
    q = queue.Queue()
    model = Model("model-en")  # download from https://alphacephei.com/vosk/models
    rec = KaldiRecognizer(model, 16000)

    def callback(indata, frames, time, status):
        q.put(bytes(indata))

    with sd.RawInputStream(samplerate=16000, blocksize=8000, dtype="int16",
                           channels=1, callback=callback):
        print("Listening... Press Ctrl+C to stop.")
        while True:
            data = q.get()
            if rec.AcceptWaveform(data):
                print(rec.Result())
            else:
                print(rec.PartialResult())

if __name__ == "__main__":
    real_time_transcribe()

Use cases: Voice assistants, transcription kiosks, hands‑free control.

5.2. Streaming to Cloud Services

For enterprise‑grade needs:

Google Cloud Speech‑to‑Text (streaming RPC)
Azure Speech Services
IBM Watson Speech to Text

These services offer streaming APIs with low latency and speaker diarization but require API keys and incur costs.

6. Improving Accuracy with Preprocessing & Language Models

6.1. Audio Preprocessing

Noise reduction: Use noiseprofile and sox or noisered filter.
Normalization: Ensure consistent volume level.
Resampling: Match the model’s expected sample rate (e.g., 16 kHz).

from pydub import AudioSegment

def preprocess(input_path: str, output_path: str):
    audio = AudioSegment.from_file(input_path)
    audio = audio.set_frame_rate(16000).set_channels(1).normalize()
    audio.export(output_path, format="wav")

6.2. Custom Language Models

For domain‑specific vocabularies (medical, legal):

Fine‑tune models like Whisper or similar transformer-based ASR.
Use lexicons/grammars in engines like Kaldi.
Leverage context biasing in cloud services (e.g., phrase hints in Google Cloud).

7. Handling Large‑Scale & Multi‑Language Scenarios

7.1. Batch Processing

When transcribing hundreds of hours:

Chunking: Split long files into manageable segments (e.g., 5 min).
Parallelization: Use Python’s multiprocessing or task queues (e.g., Celery).
Result aggregation: Stitch segment transcripts, time‑stamp alignment.

7.2. Multi‑Language

Whisper auto-detects language. For other systems:

Pre‑specify language codes (e.g., en-US, es-ES)
Post‑process by detecting language on text segments
Combine with translation APIs for unified output

8. Performance, Privacy, & Best Practices

Aspect Recommendation
Latency: Use smaller models or streaming APIs; GPU acceleration
Accuracy: Preprocess audio; select appropriate model size; customize LM
Privacy: On‑device models (Whisper, Vosk); encrypt audio in transit
Cost: Balance model size vs. API fees; monitor usage
Monitoring: Log errors, word‑error rates (WER), and fallback gracefully

Fallback Strategies: If primary ASR fails, retry with an alternative engine or lower thresholds.
Error Handling: Catch exceptions like timeouts, API errors, and malformed audio.
Logging & Metrics: Track transcription durations, WER, and resource consumption.

Conclusion

Converting speech into text in Python has never been easier or more powerful. Whether you need a quick prototype using Google’s free API, an offline and privacy-safe solution with Whisper or Vosk, or an enterprise‑grade, real‑time streaming pipeline—Python’s rich ecosystem has you covered.

Key Takeaways:

Start small: Prototype with SpeechRecognition and Google.
Scale up: Adopt Whisper for robust, offline transcription.
Optimize: Preprocess audio, choose model sizes wisely, and parallelize batch jobs.
Secure: Use local models for sensitive data.
Monitor: Implement error handling and measure accuracy.

With the knowledge and code snippets in this article, you’re well‑equipped to build anything from a simple transcription tool to a sophisticated, multi‑language voice assistant. Happy coding!