Real-Time Speech-to-Text in the Browser with Deepgram

Trent Tompkins June 4, 2026

I want to show you the smallest honest version of a feature that feels like magic the first time you see it: a text box with a microphone button, where you click, start talking, and the words stream in live as you speak — flickering as the recognizer guesses, then snapping into place the moment you pause. No framework, no build step, no audio library. A <textarea>, a few dozen lines of JavaScript, and Deepgram doing the hard part on the other end of a WebSocket.

By the end you'll understand every piece of the pipeline and have a single self-contained HTML file you can open and use. I'll also be blunt about the one thing the toy version does that you must not ship.

What we're building

One screen: a big text area and a mic button. You press the mic and grant microphone access. As you talk, gray partial text appears and rewrites itself in real time — that's the recognizer thinking out loud. When you pause, the current phrase finalizes: it commits to the textarea as solid, punctuated, capitalized text, and the partial buffer clears for the next phrase. Press the mic again to stop. That live-then-commit behavior is the whole trick, and it's almost entirely about how you handle two kinds of message coming back.

The pieces

The chain is short but every link matters:

getUserMedia — ask the browser for the microphone and get a MediaStream.
Web Audio API — feed that stream into an AudioContext and tap the raw samples with an AudioWorklet (or a ScriptProcessor on older browsers).
Downsample to 16 kHz mono Int16 PCM — the mic gives you 44.1 or 48 kHz float samples; Deepgram's streaming model wants linear16 at 16 kHz. We resample and convert float [-1,1] to signed 16-bit integers ourselves.
WebSocket — open a socket to Deepgram's streaming endpoint and send those PCM bytes as binary frames as fast as the mic produces them.
Transcripts back — Deepgram replies with JSON; we drop partials and finals into the textarea.

Browsers won't hand you a 16 kHz capture directly, so the downsample step is non-negotiable. It's just a ratio decimation plus a float-to-int16 cast — a dozen lines, shown below.

The Deepgram streaming protocol

You open a WebSocket to api.deepgram.com/v1/listen with your audio format described in the query string. The exact URL we'll use:

wss://api.deepgram.com/v1/listen?model=nova-3&encoding=linear16&sample_rate=16000&interim_results=true&smart_format=true&punctuate=true

What each parameter buys you:

model=nova-3 — Deepgram's current general model.
encoding=linear16 + sample_rate=16000 — tells Deepgram exactly what raw bytes you're about to send: 16-bit signed PCM at 16 kHz.
interim_results=true — emit those live partials, not just finished phrases. This is what makes the text move while you talk.
smart_format=true + punctuate=true — numbers, dates, and punctuation get formatted for you, so finals read like real sentences.

Authentication is a single header: Authorization: Token YOUR_DEEPGRAM_API_KEY. (Browsers can't set custom headers on a WebSocket, which is the first hint that the toy version is taking a shortcut — more on that at the end. The standard fix is the token sub-protocol, shown in the code.)

Once connected, you just send raw binary PCM frames down the socket — no envelope, no base64, no JSON wrapper. The audio is the message. Deepgram sends back JSON text frames that look like this (trimmed):

{
  "type": "Results",
  "is_final": false,
  "channel": {
    "alternatives": [
      { "transcript": "real time speech to", "confidence": 0.98 }
    ]
  }
}

Two fields carry the whole interaction:

channel.alternatives[0].transcript — the text. (An empty string is normal between phrases; ignore it.)
is_final — false means this is a live partial that will be replaced by the next message; true means this segment is finalized and safe to commit permanently.

When you're done, send a JSON control message {"type":"CloseStream"} so Deepgram flushes any pending audio and returns the last final before the socket closes.

The mental model: finals are append-only ink. Partials are a pencil preview you keep erasing and redrawing. Keep them in separate buffers and the live-update behavior falls out for free — you render committedFinals + currentPartial on every message.

Working code

Here's the whole thing in one file. Open it in a browser, paste your key where marked, click the mic, and talk. The AudioWorklet path is used when available; otherwise it falls back to ScriptProcessor. The partial-vs-final handling is the heart of it — note how render() always draws committed text plus the single live partial.

<!DOCTYPE html>
<html lang="en">
<head>
  <base href="/">
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>Deepgram Live Dictation</title>
  <style>
    body { font-family: system-ui, sans-serif; max-width: 720px; margin: 40px auto; padding: 0 16px; }
    textarea { width: 100%; height: 220px; font-size: 16px; padding: 12px; box-sizing: border-box; }
    #mic { font-size: 18px; padding: 10px 20px; margin-bottom: 12px; cursor: pointer; }
    #mic.on { background: #c0392b; color: #fff; }
    .partial { color: #888; }
  </style>
</head>
<body>
  <h1>Live Dictation</h1>
  <button id="mic">🎤 Start</button>
  <textarea id="out" placeholder="Click the mic and start talking..."></textarea>

<script>
// !! Toy demo only: never ship your real key to the browser. See the article. !!
const DEEPGRAM_KEY = "YOUR_DEEPGRAM_API_KEY";

const micBtn = document.getElementById("mic");
const out    = document.getElementById("out");

let committed = "";   // finalized text (append-only)
let partial   = "";   // current live guess (replaced each message)
let audioCtx, stream, node, socket, running = false;

function render() {
  // Always: everything committed so far, plus the single live partial.
  out.value = (committed + " " + partial).trim();
  out.scrollTop = out.scrollHeight;
}

// Float32 [-1,1] @ inputRate  ->  Int16 PCM @ 16000 Hz
function downsampleToInt16(float32, inputRate) {
  const ratio = inputRate / 16000;
  const outLen = Math.floor(float32.length / ratio);
  const int16 = new Int16Array(outLen);
  for (let i = 0; i < outLen; i++) {
    const s = Math.max(-1, Math.min(1, float32[Math.floor(i * ratio)]));
    int16[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
  }
  return int16;
}

async function start() {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  audioCtx = new (window.AudioContext || window.webkitAudioContext)();
  const source = audioCtx.createMediaStreamSource(stream);
  const inputRate = audioCtx.sampleRate;

  // Build the Deepgram streaming URL.
  const url = "wss://api.deepgram.com/v1/listen"
    + "?model=nova-3&encoding=linear16&sample_rate=16000"
    + "&interim_results=true&smart_format=true&punctuate=true";

  // Browsers can't set headers on a WebSocket, so auth rides the sub-protocol.
  socket = new WebSocket(url, ["token", DEEPGRAM_KEY]);
  socket.binaryType = "arraybuffer";

  socket.onmessage = (evt) => {
    const msg = JSON.parse(evt.data);
    if (msg.type !== "Results") return;
    const text = msg.channel?.alternatives?.[0]?.transcript || "";
    if (!text) return;
    if (msg.is_final) {
      committed = (committed + " " + text).trim();  // ink it in
      partial = "";                                  // clear the pencil
    } else {
      partial = text;                                // live preview
    }
    render();
  };

  // Tap raw audio and stream PCM once the socket is open.
  const pump = (float32) => {
    if (socket.readyState === WebSocket.OPEN) {
      socket.send(downsampleToInt16(float32, inputRate).buffer);
    }
  };

  if (audioCtx.audioWorklet) {
    const blob = new Blob([`
      class Tap extends AudioWorkletProcessor {
        process(inputs) {
          const ch = inputs[0][0];
          if (ch) this.port.postMessage(ch.slice(0));
          return true;
        }
      }
      registerProcessor('tap', Tap);
    `], { type: "application/javascript" });
    await audioCtx.audioWorklet.addModule(URL.createObjectURL(blob));
    node = new AudioWorkletNode(audioCtx, "tap");
    node.port.onmessage = (e) => pump(e.data);
    source.connect(node).connect(audioCtx.destination);
  } else {
    // Fallback for browsers without AudioWorklet.
    node = audioCtx.createScriptProcessor(4096, 1, 1);
    node.onaudioprocess = (e) => pump(e.inputBuffer.getChannelData(0));
    source.connect(node);
    node.connect(audioCtx.destination);
  }
}

async function stop() {
  try { socket && socket.readyState === WebSocket.OPEN
        && socket.send(JSON.stringify({ type: "CloseStream" })); } catch (e) {}
  if (node) node.disconnect();
  if (audioCtx) await audioCtx.close();
  if (stream) stream.getTracks().forEach(t => t.stop());
  if (committed) committed += "\n";   // newline between dictation runs
  partial = ""; render();
}

micBtn.onclick = async () => {
  running = !running;
  micBtn.classList.toggle("on", running);
  micBtn.textContent = running ? "⏹ Stop" : "🎤 Start";
  if (running) await start(); else await stop();
};
</script>
</body>
</html>

That's it — a complete, runnable dictation box. The pieces map one-to-one onto the chain we drew: getUserMedia → AudioContext tap → downsampleToInt16 → socket.send(pcm) → onmessage sorting finals from partials.

The production caveat (read this)

The demo above puts DEEPGRAM_KEY straight into the page. Do not ship that. Anyone who opens dev tools can lift your key and run up your bill. It's perfect for a local experiment and disqualifying for production.

The fix is small and standard: run a tiny server-side WebSocket proxy. The browser opens a socket to your server (no key in sight) and streams the same PCM frames to it. Your server holds the Deepgram key, opens its own socket to Deepgram, and relays bytes in both directions — audio up, transcripts down. The key never leaves the backend, and you get a natural place to add auth, rate limits, and usage metering per user. In Node it's barely more than two ws connections wired together with .on('message') handlers piping each to the other.

This is exactly how the live-chat widget on 247ch.at does its voice input: the browser talks only to our relay, the relay talks to the speech provider, and the API key stays server-side where it belongs. Build the toy version to learn the protocol — then put the key behind a proxy before anyone else can reach it.

Once that proxy is in place the client code barely changes: swap the Deepgram URL for your own wss://yourapp.com/stt, drop the key and the token sub-protocol, and everything else — the downsampling, the partial-vs-final rendering, the CloseStream on stop — stays identical. The hard part was never the audio. It was knowing which two fields to watch.