Building a Voice Agent on Cloudflare Workers, Samuel Cala

I built three versions of the same voice agent before I was happy with any of them.

The first used a managed voice platform. Fast to ship, hard to extend. Every interesting decision I wanted to make ended at someone else’s API surface.

The second self-orchestrated LiveKit Cloud, with a Python container running the in-call loop. Flexible, but five moving parts in three runtimes. Two SDKs, two deploy pipelines, two observability stories. Every bug needed a triage step before it could be a fix.

The third runs end-to-end on Cloudflare Workers and Twilio Media Streams. One deploy. One logger. One bill. The agent itself is a Durable Object, and the audio loop is a WebSocket handler. That is the version I want to talk about. Not because it is the best at every metric, because it is not, but because the simplifications are real and the gotchas are not in any tutorial I could find.

This post is the writeup I wish I had read on day one. The architecture, the loop, the failure modes, and an honest table of what you give up.

The architecture, in one diagram

Here is the whole thing, end to end, for an outbound call to an appointment reminder bot.

sequenceDiagram
    participant Scheduler as Scheduler Worker
    participant Harness as Harness Worker
    participant Twilio
    participant DO as DO(CallSid)
    participant AI as Workers AI

    Scheduler->>Harness: POST /trigger-outbound
    Harness->>Twilio: REST Calls.create
    Twilio-->>Harness: CallSid
    Harness->>DO: store call context (idFromName(CallSid))
    Twilio->>Harness: GET /twilio/voice (TwiML request)
    Harness-->>Twilio: <Stream url=".../stream/{CallSid}"/>
    Twilio->>DO: WebSocket /twilio/stream/CallSid
    loop turn loop
      Twilio-->>DO: mulaw audio frames
      DO->>AI: Flux STT (WS)
      AI-->>DO: transcript
      DO->>AI: Llama 3.3 70B
      AI-->>DO: reply text
      DO->>AI: Aura-2 TTS (sentence chunks)
      AI-->>DO: PCM16 audio
      DO-->>Twilio: mulaw frames
    end
    DO->>Scheduler: POST report (HMAC)

Two design choices in that diagram are doing most of the work, and the rest of the post is mostly about why.

First, the same Durable Object holds the call context AND the audio WebSocket. There is no handoff between a “context store” and a “media handler”, because they are the same object. We get there by addressing the DO with idFromName(CallSid).

Second, the STT, LLM, and TTS triad runs on Workers AI bindings. There is no external auth flow, no separate vendor SDK, no extra deploy. It is env.AI.run(...) and a WebSocket to a model.

”But Workers cannot do realtime audio”

This is the assumption I had to unlearn, and it is half right.

Workers cannot host a long-lived WebRTC media participant. They are not the right runtime for a participant that has to negotiate ICE, manage SRTP, and stay in a media plane for minutes. That is real, and that is why “LiveKit-native on Workers” is a research project, not a migration.

But Twilio Media Streams does not need any of that. Twilio runs the telephony plane. What it gives you, over a plain WebSocket, is mulaw audio frames at 8kHz, 20 milliseconds each, base64-encoded inside a JSON envelope. That is a normal WebSocket. A Worker plus a Durable Object handles a normal WebSocket fine.

The trick is to not try to be a WebRTC participant. Let Twilio be the telephony plane. You speak WebSocket.

Once you accept that framing, everything else falls out. The DO is the place where call state lives. The Worker is the dispatcher. Workers AI is the model layer. There is no extra runtime.

Outbound dial: Twilio REST plus TwiML

Outbound calls in this stack are a two-step dance. You ask Twilio to dial someone, and you tell Twilio where to fetch instructions when the call answers.

Step one, place the call from a Worker:

// harness/src/twilio-rest.ts
export async function placeCall(
  env: Env,
  to: string,
  callbackHost: string,
): Promise<string> {
  const auth = btoa(`${env.TWILIO_ACCOUNT_SID}:${env.TWILIO_AUTH_TOKEN}`);
  const body = new URLSearchParams({
    To: to,
    From: env.TWILIO_PHONE_NUMBER,
    Url: `https://${callbackHost}/twilio/voice`,
    Method: "POST",
    StatusCallback: `https://${callbackHost}/twilio/status`,
    StatusCallbackEvent: "initiated ringing answered completed",
  });

  const res = await fetch(
    `https://api.twilio.com/2010-04-01/Accounts/${env.TWILIO_ACCOUNT_SID}/Calls.json`,
    {
      method: "POST",
      headers: {
        Authorization: `Basic ${auth}`,
        "content-type": "application/x-www-form-urlencoded",
      },
      body,
    },
  );

  if (!res.ok) throw new Error(`twilio create failed: ${res.status}`);
  const json = await res.json<{ sid: string }>();
  return json.sid;
}

Twilio responds with a CallSid. Hold onto it. It is the most important string in the whole system.

Step two, the TwiML handler. When the callee answers, Twilio fetches /twilio/voice from your Worker, expects XML back, and follows whatever instructions you give it. We give it a <Stream> instruction that points at our own WebSocket endpoint.

// harness/src/index.ts
app.post("/twilio/voice", async (c) => {
  const form = await c.req.formData();
  const callSid = form.get("CallSid") as string;
  const host = c.req.header("host");

  return c.text(
    `<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://${host}/twilio/stream/${callSid}" />
  </Connect>
</Response>`,
    200,
    { "content-type": "text/xml" },
  );
});

The trick that took me longer than it should have: the CallSid arrives in the form body of /twilio/voice, and you embed it directly in the WebSocket URL. That is what lets us route the WebSocket to the same DO that already has context. The Twilio docs mention you can interpolate values into <Stream url>, but they do not stop to explain why you would want to. This is why.

The Durable Object is the agent

The WebSocket comes back to your Worker at /twilio/stream/:callSid. From there, you forward it to a DO addressed by that exact CallSid.

// harness/src/index.ts
app.get("/twilio/stream/:callSid", async (c) => {
  const id = c.env.TWILIO_AGENT.idFromName(c.req.param("callSid"));
  const stub = c.env.TWILIO_AGENT.get(id);
  return stub.fetch(c.req.raw); // forwards the upgrade
});

The DO that receives the upgrade is, by construction, the same DO the Scheduler talked to when it stored the call context before placing the call:

// harness/src/index.ts
app.post("/internal/store-context", async (c) => {
  const ctx = await c.req.json<CallContext>();
  const id = c.env.TWILIO_AGENT.idFromName(ctx.callSid);
  await c.env.TWILIO_AGENT.get(id).fetch(
    new Request("https://do/store", {
      method: "POST",
      body: JSON.stringify(ctx),
    }),
  );
  return c.text("ok");
});

Why idFromName(CallSid) instead of newUniqueId()? Because the DO that gets the WebSocket from Twilio and the DO that stored the context need to be the same object, and the only string both sides know is the CallSid. Hashing it deterministically through idFromName gives both sides the same DO ID without having to share a lookup table.

I tried KV with a TTL first. Do not do this. KV is eventually consistent. The WebSocket upgrade can arrive before the context write has propagated, and you will find yourself debugging an empty context object with no obvious cause. Even when you tighten the window, you are fighting a race that does not need to exist. The DO route gives you strong consistency for free.

The turn loop

Inside the DO, the loop is conceptually small. Twilio frames come in, you decode them to PCM16, you stream that into Flux STT over a WebSocket, you take each utterance and run it through Llama 3.3 70B, you take the reply and stream it through Aura-2 TTS sentence by sentence, you re-encode each chunk back to mulaw, and you push it to Twilio.

// harness/src/twilio-agent.ts
export class TwilioAgent extends DurableObject<Env> {
  async fetch(req: Request): Promise<Response> {
    const [client, server] = Object.values(new WebSocketPair());
    server.accept();
    this.runLoop(server).catch((e) => server.close(1011, String(e)));
    return new Response(null, { status: 101, webSocket: client });
  }

  async runLoop(twilioWs: WebSocket) {
    const ctx = await this.loadContext();
    const stt = await openFluxSTT(this.env);
    const transcript: TurnLog[] = [];
    let streamSid: string | null = null;

    twilioWs.addEventListener("message", (ev) => {
      const msg = JSON.parse(ev.data as string);
      if (msg.event === "start") {
        streamSid = msg.start.streamSid;
        return;
      }
      if (msg.event === "media" && streamSid) {
        const pcm16 = mulawToPcm16(b64.decode(msg.media.payload));
        stt.send(pcm16);
      }
    });

    stt.on("utterance", async (text) => {
      transcript.push({ role: "user", text });

      const reply = await runLlama(this.env, ctx.systemPrompt, transcript);
      transcript.push({ role: "assistant", text: reply });

      for await (const sentence of chunkSentences(reply)) {
        const pcm16 = await synthAura(this.env, sentence);
        const mulaw = pcm16ToMulaw(pcm16);
        for (const frame of frameChunks(mulaw, 160)) {
          twilioWs.send(
            JSON.stringify({
              event: "media",
              streamSid,
              media: { payload: b64.encode(frame) },
            }),
          );
        }
      }
    });
  }
}

Three subtle choices in that loop are worth pausing on.

Sentence chunking before TTS. Do not wait for the whole reply to finish generating. Split on sentence boundaries as the LLM streams, send each sentence to TTS as it completes, and start pushing audio while the model is still thinking about the next sentence. This is the single biggest perceptual-latency win in the whole stack. Done well, the user starts hearing the response within a couple of hundred milliseconds of finishing their turn.

Mulaw frames are 8kHz, 20 milliseconds, 160 bytes. Twilio expects exactly that shape on the way back. Send bigger frames and you get audio glitches, send smaller and you waste bandwidth. The frameChunks(mulaw, 160) line is doing nothing clever, it is just slicing the buffer at the size Twilio wants.

The DO is ephemeral. It exists for the lifetime of the call, and that is fine. Do not put audit state in DO storage. Post a transcript at the end of the call, signed, to a stable backend. That is the only durable artifact.

Sending the call report back

When the call ends, Twilio sends a stop event, the WebSocket closes, and the DO has the full transcript in memory. We sign it and post it to the stable backend.

async sendReport(transcript: TurnLog[], ctx: CallContext) {
  const body = JSON.stringify({
    callSid: ctx.callSid,
    sessionId: ctx.sessionId,
    transcript,
    endedAt: new Date().toISOString(),
  });

  const sig = await hmacSha256(this.env.AGENT_SHARED_SECRET, body);

  for (const delay of [0, 1000, 4000, 16000]) {
    if (delay) await new Promise((r) => setTimeout(r, delay));
    const res = await fetch(`${this.env.API_URL}/v1/agent/call-report`, {
      method: "POST",
      headers: {
        "content-type": "application/json",
        "x-agent-signature": sig,
      },
      body,
    });
    if (res.ok) return;
  }
  console.error(JSON.stringify({ event: "report_lost", callSid: ctx.callSid }));
}

The retry schedule (one second, four seconds, sixteen seconds) is not magic. It is small enough that the DO is still alive when it fires, big enough that the receiving Worker has time to recover from a transient failure. After the last attempt, log it loudly and move on. There is no DLQ in this system, on purpose. If the report did not land, you find out from logs, you fix the cause, you replay from the audit trail you keep on the receiving side. Adding a queue would be infrastructure for a problem I have not actually had.

The gotchas no tutorial shows you

This is the section I would have wanted on day one. None of these are clever, all of them cost me hours.

1. Workers do not host long-lived WebRTC participants. If you find yourself trying to make Workers a SIP-level participant or a WebRTC peer, stop. That is the wrong path until Cloudflare ships a media-plane primitive for it. The path that works today is Twilio Media Streams: telephony stays at Twilio, your Worker speaks WebSocket. Pick a stack that matches the runtime, not the other way around.

2. Do not store call context in KV with a TTL. I keep saying it because the pull is real. KV looks like the right tool for “small piece of state I need for the next thirty seconds.” It is not. Eventual consistency plus a thirty-second window is a race waiting to happen. Use a Durable Object addressed by idFromName(CallSid), write the context via service binding before Calls.create returns, and never think about it again.

3. STT 429? Drop the frame. Audio is realtime. A retry queue in front of STT is poison: by the time the retry succeeds, the conversation has moved on, and your user is hearing transcripts of audio from three seconds ago. If STT throttles you, drop the frame and continue. The transcript will have a small hole. That is fine. The alternative is conversational chaos.

4. LLM 429 or 5xx? Speak a fallback line and end the turn. You cannot make a human wait for an exponential backoff on a phone call. Have a small canned recovery utterance in the DO (“sorry, give me a moment”), play it, log the failure, and let the user take the next turn. Hard rule: anything longer than about a second of dead air on a phone call is broken.

5. There is no turn-detector model on Workers AI. This caught me. The textbook answer to end-of-turn detection is “use a VAD plus a turn-detector model.” Workers AI does not have one, at least not at the time of writing. The pragmatic answer: lean on Flux’s eotThreshold (around 0.85), add a hard 1.5-second post-utterance silence timeout, and add a 30-second max-turn-duration cutoff in case the user goes on a monologue. Tune these per environment via env vars, not constants.

6. Twilio’s first audio frame can arrive before the start event. This one cost me an embarrassing amount of debugging. The protocol does not strictly guarantee the order I assumed. Buffer incoming frames until you have seen a start event and have a streamSid. Otherwise either your first 100 milliseconds of audio is silently dropped, or your first send back to Twilio crashes because you have no streamSid to attach.

7. Mulaw to PCM16 is a four-line function with a lookup table. Do not pull a library. The table is 256 entries. There is no algorithm to optimize, just a known mapping. Adding a dependency for something this small is a tax on every cold start.

8. Treat the report-back as the only durable artifact. The DO storage is fine for in-call state, but resist the urge to use it as your audit trail. The DO will be garbage collected. The stable backend, written to from the report-back, is the source of truth. Design as if the DO could die at any second after the call ends.

9. <Stream> URL interpolation is not optional. Twilio’s docs gloss over this. Yes, you can put the CallSid from the form body into the url attribute of the <Stream> element, and yes, it is the only sane way to route the resulting WebSocket to the right DO. Without this, you are back to maintaining a lookup table.

Honest tradeoffs

I would mislead you if I said this stack wins on every axis. It does not. Here is what you give up by going Workers-native.

Stack piece	Workers AI choice	Heavier alternative	What you give up
STT	Flux	Deepgram Nova-3	A few WER points on noisy lines
LLM	Llama 3.3 70B	Claude Haiku 4.5, GPT-4o	Some reasoning and tool polish
TTS	Aura-2	Cartesia Sonic-2, ElevenLabs	A bit of voice naturalness
Turn detection	eotThreshold + silence	LiveKit MultilingualModel	False cuts on fast speakers
Telephony	Twilio Media Streams + NC	LiveKit + BVCTelephony NC	Audible noise on bad lines

The choice is not “which stack is best.” It is “what is your call quality budget.” If your domain is loud industrial calls or accented speech in a noisy cafe, this stack is the wrong choice and you should pay for the heavier components. If your domain is appointment reminders, support callbacks, simple confirmations on a clean line, the heavier stack is overkill and the bill will surprise you.

Run the cost spread for your expected volume before you decide. Workers AI bundled into your Workers bill versus per-minute Deepgram plus per-character Cartesia plus LiveKit minutes plus container compute is not a small difference at scale.

The reason I keep choosing this stack

The honest reason, beyond the architectural pieces, is that I like the operational shape.

One deploy command. One logger. One set of dashboards. One bill. When something breaks at 3am, I open one tail and one set of metrics, not five. When I want to add a feature, I do not negotiate with three different SDKs. When I want to roll back, I roll back one Worker.

The temptation in voice-agent land is to chain four vendors because every vendor’s blog says theirs is best at one thing. They are usually right at the micro level and wrong at the system level. The cost of integrating five best-in-class components is rarely worth the marginal gain over four good-enough ones, especially when “good enough” lives on the same platform as everything else you ship.

If you have gone the other way and stayed on the heavier stack, I would love to hear what made it worth it. The right answer depends on your use case more than on the stack, and the only honest version of this post is one that says so out loud.