Dictation Mode for Emacs

There is a wide range of input mechanisms for computers, starting with keyboards (which are relatively mature) and extending to various types of neural interfaces (currently under research). Speech lies somewhere on this spectrum with a lot of promises but still not much to show for. Keeping accessibility aspects aside, I think speech is mature enough to be used for drafting ideas and taking notes. Maybe not so much for structured writing like programming or final versions of most prose.

We have transcription tools that can work as dictation interfaces but they inevitably make mistakes and mishandle speech disfluencies, fixing which offline takes a lot of additional time. I wanted to explore augmenting transcription tools with LLMs to enable real-time edits. If done well this should feel as if you are talking to a human writer.

I had this idea for some time but kept delaying this for many months. Finally I stopped thinking about it altogether when I saw Aqua Voice launch. Recently I got some free time so attempted this as I needed a dictation tool for Emacs, which is where I do all of my writings.

1. esi-dictate.el

I wrote a small Emacs package here that works by adding a minor mode where spoken words enter your buffer at an independent voice cursor¹ and the entered text is corrected in real time based on your [voice] instructions. Here are a few screencasts:

Figure 1: [With audio] The ⤳ symbol shows the voice cursor. Underlined text is used as voice context for the LLM. I can move around with my text cursor to make fixes if needed. The LLM does general fixes as well as following explicit instructions.

Figure 2: [No audio] You can set mark to move the position of voice cursor. This will also reset the voice context.

Figure 3: [No audio] Extending the idea of mark, setting a region will change the voice context. I try translation in this example which is not something the LLM is instructed to do, therefore the mess.

Earlier I experimented with an explicit command mode where you need to press a hotkey to specify that you are going to make a correction from that moment in time. But this became very annoying to use and the LLM (gpt-4o-mini) was generally good enough to detect what's an instruction and what's not.

2. Future Improvements

From UX perspective², I think this flow is good enough for me but I suspect that I am not at the sweet spot yet. A major gain at this point will come from latency improvements in both ASR and LLM. Some of that is achievable by making good use of asynchronous calls, caching, etc.

Furthermore, while the code for this package is open-source it's still not a nice FOSS since it depends on Deepgram and OpenAI's services, both of which I want to swap out with lightweight on-device or self-hosted alternatives. There are also a few minor bugs that pop up when voice and text cursor's actions conflict. I will resolve these in due time.

Footnotes:

Thanks to Nemo for suggesting multiple cursors.

Thanks to Unnu for helping me refine this.