mu4e-snooze and adventures with AI co‑programming

An update on this using the newer o1-preview model is here.

From around February-March this year, I have been attempting conscious letting go of reservations around modern productivity tools. For example, I switched to using shortwave from Gmail web for managing work email, switched to Todoist from Org Mode for task management, started using Sunsama instead of a paper planner, etc. A few of these worked well and a few didn't. Many that worked had a simple one-feature difference from the older method.

Specifically for email, I realized that the one-feature difference was the ability to snooze an email. I understand that snoozing is available in most tools now, but it was shortwave that made me utilize this feature properly. Knowing that I can follow the shortwave method anywhere I have snooze feature, I decided to implement snoozing in my favorite email tool, mu4e.

Since I was on a break last week¹, I tried taking more time here to work on the project with GPT4's help to gain more insights in AI co-programming as a technique and current capabilities of GPT4 itself. This blog notes down few observations from that experiment.

Overall, there is nothing new here that hasn't been discussed already in the community. But nevertheless it was fun to understand the current state and build insights on my own.

1. Email Snoozing

Before starting, let's understand what email snoozing means and how we can design a system to do this.

Snoozing refers to the action of moving an email out of inbox with a commitment on date and time when it would appear back in inbox. This is helpful to move things out of sight that you can't attend to right now anyway. This maps to defer action in the inbox-zero vocabulary. This is not a feature that you would get out-of-the-box in email servers, or clients for that matter.

1.1. Mu4e and the Setup

Email setups based on mu4e are made up of three components:

A Maildir syncing setup using, say, OfflineIMAP. This maps your remote email directories to your local filesystem in the Maildir format.
A CLI tool mu for indexing and searching over a Maildir.
An elisp package (mu4e) to wrap around mu so that you can do email inside Emacs.

Making snoozing feature in this setup is an interesting challenge for an AI because of the following reasons, all more or less related to familiarity and generalization:

This feature is not that common on public code repositories. I got 264k repository hits when searching for 'todo app' but 16 when searching for 'email snooze' on GitHub.
Emacs Lisp is not a common language for programming. This might pose challenges to the AI² on the language syntax front.
Maildir + Mu4e is relatively niche setup as far as desktop email clients go. This means that the AI might not be aware completely of structural aspects like data organization here.

1.2. Design

Knowing a few things about how email works and my own mail setup, I was able to come up with a first pass design for this system like the following:

There will be a new label where all snoozed emails from inbox will go to. This is how a lot other wrapper over GMail used to implement snoozing.
I will add an mu4e action that asks me date and time and moves marked emails to the snoozed label. The date and time saved will be stored somewhere in a database. Little later I made this information go in X-headers³.
An Emacs timer will go through the emails in the snoozed label every few minutes and move the expired items to inbox.

Generally we are talking about 3 abstractions:

Staging area for moving snoozed emails.
Tracking date time information for each snoozed email.
A timer setup that runs through the snoozed emails and move them to inbox if they have gone beyond the planned time.

2. GPT4

Keeping in mind the rough design mentioned above, my initial expectations was that GPT4 should be good enough to handle the project end to end. It could come with different concrete choices for the abstractions mentioned above, but generally it should get the job done without much intervention.

I don't know why I had such high expectations, even though I actively believed this problem might be challenging for GPT4.

But as I started working on this, I got disillusioned. Here are a few observations, in no specific order:

There were many confident mistakes. If you poke holes, you will get an apology and a slightly improved solution. But then you will have to know the system to really be able to trust GPT4 output.
The output code, and notes along with it, was very hand-wavy. "Do this and that and done, simple.". Almost like the end goal was not to actually run the code but to provide some explanation or pseudo-code to the seeker. And inevitably this was missing a whole lot of details that were super critical for the code to run.
Unless it's totally autonomous—we are not there yet—you will have to dive in the solution specifications in addition to problem specifications. And writing a solution specification is difficult. I had to see the written code multiple times to finally say that "Hey, I want this and not that because xyz". I guess this gets easier as you co-program for a while.
Probably a minor point, but there was a background anxiety around seeing code close enough to have stylistic annoyances but not the right kind of control to do anything about them. I am assuming this will improve drastically in the future. We probably just don't have the right language for conveying how right now. In the end how won't even matter but is still an important area—personally—for the time being.
Finally I had to write the program on my own anyway, so that was the biggest disappointment. I could see places where small portions of the problems were done well but together they were not solving the problem completely.

Though in terms of the 3 challenging aspects of this problem mentioned earlier, I believe Emacs Lisp didn't pose any. There were no stupid language related mistakes, even though there were many semantic mistakes around how few sub-problems could be solved.

You can see my interaction log in the Appendix.

3. Patterns

In spite of the annoying experience, there were many patterns that seemed promising. Almost all are some form of cognitive heuristics working on a layer above the core reasoning system. Few are:

Keep asking questions around the implementation, ask to explain, ask to 'make things better or correct'. These don't need any specific details but work well. Approaches like Chain of Thoughts fall in similar category.
Spend time setting boundaries via asking to write tests. LLMs can do a decent job of self criticism. Same model with different context can evaluate its own output well, unlike classical ML setting. In ways this is obvious and expected in in-context-learning as you should probably look at same LLM with different contexts as different classical ML models.
Coupled with the previous point, a lot of issues seem fixable with access to runtime system. I believe products geared towards proper programming assistance (GPT4 via ChatGPT is obviously not) would utilize this a lot.

4. Future

Let's keep analysis—code reviews, static analysis, testing, etc.—and assistance—copilot and other similar current generation IDE features—out of the picture. In terms of pure automation, I believe there are problems where you can get automation to the level of subsumption even right now where you don't need to worry about the underlying substrate, code, at all. For example an interface for plotting data in common ways is possible with an AI writing throwable matplotlib code on every prompt⁴. But to get to similar level for a lot of what we do in programming, there is still much to be done.

An interesting question I have is around the future of cognitive heuristics mentioned earlier. Will they remain part of architecture over LLMs or will they be subsumed inside? I believe it will be the latter, but I would like to understand the shelf life of such things in practical products for the next, say, 2-3 years.

It will also be interesting to deconstruct the abilities needed to make a 'todo app' vs 'snooze feature' which can help to define what to solve. A major part of this will be familiarity in training data, like with humans. But in the end I don't want to have an AI system that does better on creating shopping websites since they are more common on GitHub. So this is not enough. Pretty sure there is some study here that I need to find which might be guiding next upgrades in LLMs that can program.

This post is a relatively shallow evaluation. Folks working on programming LLMs can give you much better insights, and I am not one of them, yet. But I can make few general statements: Working with GPT4 doesn't feel like a Product Manager working with an Engineer. Neither does it feel like an Engineer working with another. Nor like an Engineer working with a Junior Engineer. It's in reality more like:

an Engineer working with a Junior Engineer who is asked to write code, send results without reflections, and without access to a runtime, but with access to stackoverflow.

I am very confident that this would start to feel like the PM working with a Good Engineer pattern soon, but there are a few more steps to get there.

5. Appendix

5.1. GPT4 interactions

Here are my interactions with GPT4 during the process of development. If you can't see the embed, click here to go to the GitHub URL.

Footnotes:

I've been sitting on this post for a long time. I was on a break sometime in end of April 2023, not last week.

I realize that I have started to use 'AI' a lot more now. I have not given deeper thought on whether that signals a downturn in my life.

This doesn't solve the problem completely as tools like offlineimap don't sync changes to X-headers. So you will be tied to the same machine for this to work.

⁴

You can fight with me on the kind of plots you can create here and I know I will lose, but the point stays that there are workflows that have been lifted to the level of a non-programmer's direct consumption.