Engineering

The Science Behind Live Audio Capture: How We Achieve <100ms Latency

NK

Niranjan K.

Lead Architect

June 15, 2026
6 min read

When we set out to build hcalls, our absolute, non-negotiable target was latency. If an AI copilot takes five seconds to output talking points after an interviewer asks a question, the conversation dies. It feels awkward, artificial, and unusable.

To feel like a natural extension of your own mind, the delay between the interviewer speaking and the first hint appearing on your screen needs to be sub-second. Specifically, under 100ms for audio capture and transcription start. Here is how we engineered a zero-lag loopback system for video call applications.

The Challenge: Traditional Virtual Audio Drivers

Most screen recording or audio-routing apps rely on virtual audio cables or custom sound card drivers (like VB-Audio or Loopback on Mac). While functional, virtual drivers present massive drawbacks for job candidates:

  • Installation Friction: They require system restarts and administration privileges.
  • Fragility: They frequently reset default audio outputs mid-call, leaving candidates unable to hear their interviewer.
  • Detection Risks: Having a virtual device called "AI Audio Cable" active in your system settings is a clear red flag for browser-based proctored screen-sharing environments.

The Solution: WASAPI Loopback Capture

On Windows, we skipped virtual drivers entirely. Instead, we utilize the native Windows Audio Session API (WASAPI) in loopback mode. This allows our desktop application to tap directly into the sound device's render buffer in user mode.

When your speakers or headphones play the interviewer's voice, WASAPI lets us clone that exact buffer instantly. It requires zero drivers, has zero system-level footprint, and causes zero delay in the audio pipeline. To the operating system and video calling clients (like Zoom or Google Meet), hcalls is just a passive, standard desktop process reading a shared buffer.

Sub-100ms Speech-to-Text Pipeline

Capturing raw PCM audio buffer is only step one. Feeding that buffer to a transcription engine efficiently is where things get complex. We solved this with a three-pronged approach:

  1. Local VAD (Voice Activity Detection): We run a lightweight, local WebRTC-VAD engine. If there is silence, white noise, or keystroke sounds, we don't send any packets. This saves bandwidth and processing cycles.
  2. Deepgram Nova-3 Integration: Once active speech is detected, we stream the audio over a persistent, low-overhead WebSocket connection to Deepgram's Nova-3 model. Because we use raw buffer streams instead of block-based files, the transcription latency is regularly under 100ms.
  3. JSON-LD Prompt Framing: Rather than waiting for a full paragraph to complete, we send transient sentence fragments to our downstream LLM model, predicting the question intent before the interviewer even finishes their sentence.

"Latency isn't just a metric; it's the boundary between a helpful tool and a distracting interruption. In high-pressure interviews, 500ms feels like an eternity."

Visualizing the Audio Pipeline

The entire flow from the interviewer's vocal cord to your display looks like this:

[Interviewer Speech] 
       ↓ (Network Stream)
[Zoom / Meet Audio Output]
       ↓ (Render Buffer)
[WASAPI Loopback Capture] <--- Zero Latency
       ↓ (Local WebRTC-VAD Check)
[WebSocket Binary Stream] <--- Client-to-Edge (~20ms)
       ↓ (Deepgram Nova-3 Engine)
[Instant JSON Transcription] <--- (~80ms)
       ↓ (Streaming LLM Hook)
[Your Screen HUD Overlay] <--- First Token (~650ms total)

The Result

By bypassing virtual audio drivers and building a native, raw-buffer capture pipeline, hcalls registers a system-to-screen response time that is virtually invisible in a natural conversation. You get the confidence of structured talking points exactly when you need them, without any technical hiccups.

Recommended Reading