Building Voice AI Applications

May 27, 2024

With GPT-4o’s native audio capabilities and improved speech synthesis, voice AI applications are becoming practical. Natural conversations, not robotic interactions, are now achievable. Building voice AI requires understanding the technology and UX challenges.

Here’s how to build voice AI applications.

Voice AI Architecture

Traditional Pipeline

traditional_voice_pipeline:
  speech_to_text:
    models: [Whisper, Google Speech, Azure Speech]
    latency: 200-500ms
    limitation: Loses tone, emotion, hesitation

  language_model:
    models: [GPT-4, Claude]
    latency: 500-2000ms
    limitation: Text-only reasoning

  text_to_speech:
    models: [ElevenLabs, Azure TTS, Google TTS]
    latency: 200-500ms
    limitation: Generated speech, not natural

  total_latency: 1-3 seconds (unnatural for conversation)

Native Multimodal

native_voice_ai:
  model: GPT-4o
  latency: ~250ms (end-to-end)
  benefits:
    - Preserves tone and emotion
    - Natural conversation speed
    - Cross-modal understanding

Building Blocks

Speech-to-Text with Whisper

import openai

def transcribe_audio(audio_file_path: str) -> str:
    with open(audio_file_path, "rb") as audio_file:
        transcript = openai.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            response_format="text"
        )
    return transcript

# With word-level timestamps
def transcribe_with_timestamps(audio_file_path: str) -> dict:
    with open(audio_file_path, "rb") as audio_file:
        transcript = openai.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            response_format="verbose_json",
            timestamp_granularities=["word"]
        )
    return transcript

Text-to-Speech

def generate_speech(text: str, voice: str = "alloy") -> bytes:
    response = openai.audio.speech.create(
        model="tts-1",
        voice=voice,  # alloy, echo, fable, onyx, nova, shimmer
        input=text
    )
    return response.content

# Streaming for lower latency
async def stream_speech(text: str):
    response = openai.audio.speech.create(
        model="tts-1",
        voice="alloy",
        input=text,
    )

    # Stream to audio output as chunks arrive
    for chunk in response.iter_bytes():
        yield chunk

Complete Pipeline

class VoiceAssistant:
    def __init__(self):
        self.client = openai.OpenAI()

    async def process_voice(self, audio_input: bytes) -> bytes:
        # Transcribe
        transcript = await self._transcribe(audio_input)

        # Generate response
        response_text = await self._generate_response(transcript)

        # Synthesize speech
        audio_output = await self._synthesize(response_text)

        return audio_output

    async def _transcribe(self, audio: bytes) -> str:
        response = self.client.audio.transcriptions.create(
            model="whisper-1",
            file=("audio.wav", audio)
        )
        return response.text

    async def _generate_response(self, text: str) -> str:
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a helpful voice assistant. Keep responses concise and conversational."},
                {"role": "user", "content": text}
            ]
        )
        return response.choices[0].message.content

    async def _synthesize(self, text: str) -> bytes:
        response = self.client.audio.speech.create(
            model="tts-1",
            voice="nova",
            input=text
        )
        return response.content

UX Considerations

Voice-Specific Design

voice_ux_principles:
  brevity:
    - Short responses (1-2 sentences ideal)
    - Get to the point quickly
    - Avoid lists in speech

  confirmation:
    - Acknowledge understanding
    - Repeat key information
    - Ask for clarification naturally

  pacing:
    - Natural rhythm
    - Pauses for comprehension
    - Don't rush

  error_handling:
    - "I didn't catch that"
    - Offer alternatives
    - Know when to escalate

  context:
    - Remember conversation history
    - Reference previous turns
    - Maintain topic coherence

Handling Interruptions

class InterruptibleAssistant:
    def __init__(self):
        self.current_response = None
        self.is_speaking = False

    async def handle_input(self, audio_chunk: bytes):
        # Detect speech during output
        if self.is_speaking and contains_speech(audio_chunk):
            # User interrupted
            await self.stop_current_response()
            self.is_speaking = False

        # Process new input
        transcript = await self.transcribe(audio_chunk)
        if transcript:
            await self.generate_and_speak(transcript)

    async def stop_current_response(self):
        if self.current_response:
            self.current_response.cancel()
            # Acknowledge interruption
            await self.speak("Sure, go ahead.")

Production Concerns

Latency Optimization

latency_optimization:
  streaming:
    - Stream audio as it's generated
    - Start speaking before full response
    - Reduces perceived latency

  caching:
    - Cache common responses
    - Pre-generate frequent phrases
    - Reduces synthesis latency

  edge_deployment:
    - Process near user when possible
    - Reduce network round trips

  model_selection:
    - tts-1 faster than tts-1-hd
    - Trade quality for speed when appropriate

Quality Monitoring

voice_quality_metrics:
  technical:
    - Transcription accuracy (WER)
    - Speech synthesis quality (MOS)
    - End-to-end latency

  user_experience:
    - Task completion rate
    - Conversation length
    - Repeat/clarification rate

  business:
    - User satisfaction (CSAT)
    - Containment rate (for support)
    - Cost per conversation

Key Takeaways

Voice AI is ready for production. Build thoughtfully.