During the past few years, I have worked on design and modeling of Spoken Dialog Systems (SDS). After LLMs came in, the ML sided work has gone down along with some complexity of the architecture. But I still feel like most of the current designs are sub-optimal considering the true north of natural1 conversations. This post is an attempt to cover an event based system that allows high degree of flexibility and is future proof enough to support all the conversational features that we would expect from machine conversationalists. It's mostly a high level everything-is-an-event design idea rather than details of implementation.
1. Turn-Taking
When we talk about the design or architecture of an SDS, we refer to the implementation of turn-taking dynamics between humans and the machine. In earlier systems we used to have half-duplex conversations based on a Finite State Machine (FSM) that looked like the following simple loop:
To implement this, you would normally not follow the loop's flow (like in a stream-processing system) but build a central orchestrator that coordinates everything from audio in to audio out. As ML models improved we started moving towards accepting streams and allowing interruptions in some places within the orchestrator to allow barge-ins and other backchannels. Systems like vocode do something like this.
2. Embracing Events
While the designs used till now are pretty practical and not overly complicated2, their are many accidental complexities that pop up once you start adding more advanced features like supporting multiple speakers, short-circuit processing, deliberation, etc.
A more natural design is to organize processing units on a hierarchical event bus allowing distributed, concurrent, and modular processing. This is a little like Actor model3. There are two core concepts in this design, the event buses, and the actors. The buses lay down the foundation of all categories of signals that we care about in a machine conversation. On top of this, you add actors to do actual processing.
2.1. Buses
Here you have various levels of buses that work with events like audio signals at the bottom, go to text, semantics, and come back to emit audio signals from the SDS.
These buses can be subscribed to for listening and emitting events by Actors. At times there will be concurrency bottlenecks which effectively will look like interruptible priority queues. For example the audio coming from the SDS will have a single concurrency constraint. If someone pushes a higher priority signal—say a background computation has finished and the SDS has to speak the audio coming from processing that computation—the device plays that before playing something else. The decision to get back to the original interrupted audio is something that will depend on the context and will be taken up by another Actor.
2.2. Actors
Actors subscribe to buses, process information, and emit messages. There is no restriction on input and output levels. Although there will be a natural organization of actors up and down the hierarchy, there could be an actor that listens to audio segment events and lexical events both.
The set of actors you use is all that defines your stack. And working with individual actors is really easy. Here are a few examples:
- For short circuiting, add an actor that listens to stream from lexical bus and runs a small model to detect milestones of interest (partial SLU). Once the actor decides something should be done, it can send an interrupt.
- Multiple speakers are just color-coded events handled concurrently with actors till the point you need aggregating actors to coordinate the merged output.
- Responding to user interruptions and using other backchannels is also easier here since events are given first class citizenship. For example an actor could just listen all audio channels for audience silence and then signal something that makes the SDS pause and say "umm … are you all with me? should I clarify the last part a little more?".
- Running background thinking process involves just invoking another actor by putting some message on the appropriate bus.
- Any kind of async paralinguistic analysis or self-checks are again just actors added to the stack.
- If you want to move to an ASR free system, just remove your current ASR-bound actors, make an actor that uses Speech LLMs, then listen and emit in the right channels.
Other than these, you can also do things like mini-batch processing at an actor level, which will help in utilizing resources like GPUs better. This system is also more experiment friendly as you can try out new actors in shadow mode much more easily.
There will, of course, be some trade-offs here. But my hunch is that the gains are more than what's needed to offset any loss. Further insights can only come once there is a working implementation, but I have not been able to work on this. In fact writing this post was a way to make some progress rather than staying at zero. I have some more free time now so I might be able to do something beyond this, relatively, abstract article.
Footnotes:
I believe naturalness has more to do with ergonomics than likeness to humans.
Thanks in part to the simplification brought in from LLMs handling most of SLU and DST.
Probably also a little inspired by Minsky's Society of Mind, but that's inspiration at best.