Brain and brain! What is brain?
I mentioned in the last post, I've been playing around with OpenClaw, like so many of the Twitter tech intelligentsia who are hoping the Singularity doesn't hit before they can vibe-code a Salesforce replacement or a marketing funnel to make them kajillions of dollars. I have used it for automating tedious tasks: including tracking down all the posts I previously posted to my blog, almost 15 years ago. (They're here on this blog. Read them; they're pretty hilarious, looking back now.) I've named my OpenClaw agent Klugarsh, because brain/cauliflower something-something ... IYKYK.
One thing I've noticed lately in the age of agent-based coding is resurgence of my RSI hand pain. Weirdly, typing chat constantly seems to be worse than even coding for hours a day straight. I decided it would be fun to give Klugarsh (Klu for short) voice-chat capabilities, so I can have conversations with him. I've never done anything with WebRTC before, so this seemed like a fun project. Was pretty straightforward, most concepts seemed pretty sensible. And thanks to the agents, I spent less time futzing with specific pieces of code, and more time figuring out architecture, and learning how it works. I used DeepGram for both STT and TTS, and can't recommend it enough. Super easy to use, very fast, and a nigh-unto-free developer usage tier.
We (Klu and I, that is) initially implemented it as a sub-agent, and since we wanted it to have all the same knowledge and context as the main agent, we decided to go with a clone (a KluKlone, if you will, a Klune) that would slurp all the same initialization context as the main one. We created a Node script with a WebRTC client, and I bounced back and forth between the main agent MasterKlu (who sounds like something from the old Kung Fu TV show) and the Klune running the WebRTC chat, debugging all the fiddly issues you get with streaming realtime voice: utterance and sentence boundaries, streaming/reveal of text transcripts, interrupts, dropped chunks, streaming speeds ... all that fun stuff that defines the boundary between adequately good user experience, and "what the fuck is even happening?" (See also: Ninety-ninety rule.) Masisve respect to people doing telephony work. It's way harder than a nice, clean HTTP request cycle with a beginning, middle, and end.
At one point, both Klus were proposing ideas for debugging or architecture. They weren't talking to each other —(no, AI didn't write this) the Klune would spawn to handle voice, absorbing all the memories of MasterKlu, effectively forking off him, and then have no more direct contact. I became a go-between for the Klus. I was talking to the Klune in voice, which was breaking up, dropping syllables, and he was proposing ideas for why this was happening. Really reminded me of a scene in the Star Trek TOS episode, Spock's Brain. (A classic, in that it's likely the worst episode of TOS ever produced. Check it out, if for no other reason than the remote-control Nimoy, totally hilarious.) Dr. McCoy has had his brain souped up by a large beauty-salon-style head bowl (see photo, above), and dives in to perform the surgery to restore poor Spock's purloined brain back into its home in his cranium. At some point the head-bowl super-intelligence begins to wear off, and Spock has to guide the good doctor through the rest of the surgery on his own self. Thank FSM he had reattached Spock's vocal center, right?
So there I am, with the Klune talking in his garbled voice (actually "her," since the default voice from DeepGram is an American female voice, which makes things even weirder), giving me suggestions on how to fix the problem. It ultimately turned out to be a stream overflow issue. DeepGram sends back data faster than WebRTC can send it. The solution was simply to loop and process data every 10ms, since it's a 10ms chunk of audio. Once we nailed that one down, the rest of it was pretty easy. Getting Klune's brain hooked up to his mouth. As the chick in the purple thigh-high boots says. "brain and brain! What is brain?"
Ultimately, Klu and I decided to change the architecture to have the voice go straight to the OpenClaw gateway, so I'll be interacting directly with him, no need for fancy Klu-kloning, or extra layers of abstraction. We actually started down the path of over-engineering the Klune, so he could get access to all the tool-use, but then realized we were recursively reinventing the OpenClaw sub-agent wheel. We really don't need sub-sub-agents, at least not yet. (Update: OpenClaw's new version actually does this.)
It's a very different feeling, speaking to your agent. It feels eerily like you're interacting with an actual person, and I'm sure that feeling will only increase as the sophistications of these things increases. The GitHub repo for the voice control lives here: http://github.com/mde/openklu. It's mostly agent-generated code, so don't ask me if it's sanitary. All I know is, it works, for all the use-cases I care about. I wonder if that's the future of software development. Is this how the assembler programmers felt?