Real-Time Speech-to-Text in the Browser with Deepgram
I want to show you the smallest honest version of a feature that feels like magic the first time
you see it: a text box with a microphone button, where you click, start talking, and the words
stream in live as you speak — flickering as the recognizer guesses, then snapping into
place the moment you pause. No framework, no build step, no audio library. A
<textarea>, a few dozen lines of JavaScript, and Deepgram doing the hard part on
the other end of a WebSocket.
By the end you'll understand every piece of the pipeline and have a single self-contained HTML file you can open and use. I'll also be blunt about the one thing the toy version does that you must not ship.
What we're building
One screen: a big text area and a mic button. You press the mic and grant microphone access. As you talk, gray partial text appears and rewrites itself in real time — that's the recognizer thinking out loud. When you pause, the current phrase finalizes: it commits to the textarea as solid, punctuated, capitalized text, and the partial buffer clears for the next phrase. Press the mic again to stop. That live-then-commit behavior is the whole trick, and it's almost entirely about how you handle two kinds of message coming back.
The pieces
The chain is short but every link matters:
getUserMedia— ask the browser for the microphone and get aMediaStream.- Web Audio API — feed that stream into an
AudioContextand tap the raw samples with anAudioWorklet(or aScriptProcessoron older browsers). - Downsample to 16 kHz mono Int16 PCM — the mic gives you 44.1 or 48 kHz
float samples; Deepgram's streaming model wants
linear16at 16 kHz. We resample and convert float[-1,1]to signed 16-bit integers ourselves. - WebSocket — open a socket to Deepgram's streaming endpoint and send those PCM bytes as binary frames as fast as the mic produces them.
- Transcripts back — Deepgram replies with JSON; we drop partials and finals into the textarea.
Browsers won't hand you a 16 kHz capture directly, so the downsample step is non-negotiable. It's just a ratio decimation plus a float-to-int16 cast — a dozen lines, shown below.
The Deepgram streaming protocol
You open a WebSocket to api.deepgram.com/v1/listen with your audio format described
in the query string. The exact URL we'll use:
wss://api.deepgram.com/v1/listen?model=nova-3&encoding=linear16&sample_rate=16000&interim_results=true&smart_format=true&punctuate=true
What each parameter buys you:
model=nova-3— Deepgram's current general model.encoding=linear16+sample_rate=16000— tells Deepgram exactly what raw bytes you're about to send: 16-bit signed PCM at 16 kHz.interim_results=true— emit those live partials, not just finished phrases. This is what makes the text move while you talk.smart_format=true+punctuate=true— numbers, dates, and punctuation get formatted for you, so finals read like real sentences.
Authentication is a single header: Authorization: Token YOUR_DEEPGRAM_API_KEY.
(Browsers can't set custom headers on a WebSocket, which is the first hint that the toy
version is taking a shortcut — more on that at the end. The standard fix is the
token sub-protocol, shown in the code.)
Once connected, you just send raw binary PCM frames down the socket — no envelope, no base64, no JSON wrapper. The audio is the message. Deepgram sends back JSON text frames that look like this (trimmed):
{
"type": "Results",
"is_final": false,
"channel": {
"alternatives": [
{ "transcript": "real time speech to", "confidence": 0.98 }
]
}
}
Two fields carry the whole interaction:
channel.alternatives[0].transcript— the text. (An empty string is normal between phrases; ignore it.)is_final—falsemeans this is a live partial that will be replaced by the next message;truemeans this segment is finalized and safe to commit permanently.
When you're done, send a JSON control message {"type":"CloseStream"} so Deepgram
flushes any pending audio and returns the last final before the socket closes.
committedFinals + currentPartial on every
message.
Working code
Here's the whole thing in one file. Open it in a browser, paste your key where marked, click the
mic, and talk. The AudioWorklet path is used when available; otherwise it falls back to
ScriptProcessor. The partial-vs-final handling is the heart of it — note how
render() always draws committed text plus the single live partial.
<!DOCTYPE html>
<html lang="en">
<head>
<base href="/">
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Deepgram Live Dictation</title>
<style>
body { font-family: system-ui, sans-serif; max-width: 720px; margin: 40px auto; padding: 0 16px; }
textarea { width: 100%; height: 220px; font-size: 16px; padding: 12px; box-sizing: border-box; }
#mic { font-size: 18px; padding: 10px 20px; margin-bottom: 12px; cursor: pointer; }
#mic.on { background: #c0392b; color: #fff; }
.partial { color: #888; }
</style>
</head>
<body>
<h1>Live Dictation</h1>
<button id="mic">🎤 Start</button>
<textarea id="out" placeholder="Click the mic and start talking..."></textarea>
<script>
// !! Toy demo only: never ship your real key to the browser. See the article. !!
const DEEPGRAM_KEY = "YOUR_DEEPGRAM_API_KEY";
const micBtn = document.getElementById("mic");
const out = document.getElementById("out");
let committed = ""; // finalized text (append-only)
let partial = ""; // current live guess (replaced each message)
let audioCtx, stream, node, socket, running = false;
function render() {
// Always: everything committed so far, plus the single live partial.
out.value = (committed + " " + partial).trim();
out.scrollTop = out.scrollHeight;
}
// Float32 [-1,1] @ inputRate -> Int16 PCM @ 16000 Hz
function downsampleToInt16(float32, inputRate) {
const ratio = inputRate / 16000;
const outLen = Math.floor(float32.length / ratio);
const int16 = new Int16Array(outLen);
for (let i = 0; i < outLen; i++) {
const s = Math.max(-1, Math.min(1, float32[Math.floor(i * ratio)]));
int16[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}
return int16;
}
async function start() {
stream = await navigator.mediaDevices.getUserMedia({ audio: true });
audioCtx = new (window.AudioContext || window.webkitAudioContext)();
const source = audioCtx.createMediaStreamSource(stream);
const inputRate = audioCtx.sampleRate;
// Build the Deepgram streaming URL.
const url = "wss://api.deepgram.com/v1/listen"
+ "?model=nova-3&encoding=linear16&sample_rate=16000"
+ "&interim_results=true&smart_format=true&punctuate=true";
// Browsers can't set headers on a WebSocket, so auth rides the sub-protocol.
socket = new WebSocket(url, ["token", DEEPGRAM_KEY]);
socket.binaryType = "arraybuffer";
socket.onmessage = (evt) => {
const msg = JSON.parse(evt.data);
if (msg.type !== "Results") return;
const text = msg.channel?.alternatives?.[0]?.transcript || "";
if (!text) return;
if (msg.is_final) {
committed = (committed + " " + text).trim(); // ink it in
partial = ""; // clear the pencil
} else {
partial = text; // live preview
}
render();
};
// Tap raw audio and stream PCM once the socket is open.
const pump = (float32) => {
if (socket.readyState === WebSocket.OPEN) {
socket.send(downsampleToInt16(float32, inputRate).buffer);
}
};
if (audioCtx.audioWorklet) {
const blob = new Blob([`
class Tap extends AudioWorkletProcessor {
process(inputs) {
const ch = inputs[0][0];
if (ch) this.port.postMessage(ch.slice(0));
return true;
}
}
registerProcessor('tap', Tap);
`], { type: "application/javascript" });
await audioCtx.audioWorklet.addModule(URL.createObjectURL(blob));
node = new AudioWorkletNode(audioCtx, "tap");
node.port.onmessage = (e) => pump(e.data);
source.connect(node).connect(audioCtx.destination);
} else {
// Fallback for browsers without AudioWorklet.
node = audioCtx.createScriptProcessor(4096, 1, 1);
node.onaudioprocess = (e) => pump(e.inputBuffer.getChannelData(0));
source.connect(node);
node.connect(audioCtx.destination);
}
}
async function stop() {
try { socket && socket.readyState === WebSocket.OPEN
&& socket.send(JSON.stringify({ type: "CloseStream" })); } catch (e) {}
if (node) node.disconnect();
if (audioCtx) await audioCtx.close();
if (stream) stream.getTracks().forEach(t => t.stop());
if (committed) committed += "\n"; // newline between dictation runs
partial = ""; render();
}
micBtn.onclick = async () => {
running = !running;
micBtn.classList.toggle("on", running);
micBtn.textContent = running ? "⏹ Stop" : "🎤 Start";
if (running) await start(); else await stop();
};
</script>
</body>
</html>
That's it — a complete, runnable dictation box. The pieces map one-to-one onto the chain we
drew: getUserMedia → AudioContext tap → downsampleToInt16
→ socket.send(pcm) → onmessage sorting finals from partials.
The production caveat (read this)
The demo above puts DEEPGRAM_KEY straight into the page. Do not ship
that. Anyone who opens dev tools can lift your key and run up your bill. It's perfect for a
local experiment and disqualifying for production.
The fix is small and standard: run a tiny server-side WebSocket proxy. The
browser opens a socket to your server (no key in sight) and streams the same PCM frames to
it. Your server holds the Deepgram key, opens its own socket to Deepgram, and relays bytes in both
directions — audio up, transcripts down. The key never leaves the backend, and you get a natural
place to add auth, rate limits, and usage metering per user. In Node it's barely more than two
ws connections wired together with .on('message') handlers piping each to
the other.
Once that proxy is in place the client code barely changes: swap the Deepgram URL for your own
wss://yourapp.com/stt, drop the key and the token sub-protocol, and
everything else — the downsampling, the partial-vs-final rendering, the CloseStream on
stop — stays identical. The hard part was never the audio. It was knowing which two fields to watch.
Code samples in this article are released under the MIT License — Copyright © 2026 Trent Tompkins. Deepgram is a trademark of its respective owner; this is an independent tutorial.