ml emacs programming productivity 100daystooffload llm

Earlier I tried developing email snoozing feature for mu4e using gpt4. I did this using the ChatGPT UI without using any programming specific augmentation. You can see my observations from that time in this old blog post. In fact I would recommend reading that to get context about the problem. Now after o1-preview launch, I wanted to run the same task again—almost one year later—by this newer model and see what changed.

In the main, I can see healthier amount of intra-snippet cohesiveness in the code output and explanations (along with whatever is exposed from the Chain of Thoughts) which tells me that the model's reasoning is much better than gpt4. I still wasn't able to get the full working package but the missing pieces involve knowledge that's outside the model, in my opinion. Here is the log of my conversations with the model.

1. Observations

  1. o1 still makes confident mistakes. I had to point out non-existent functions a few times which it was hallucinating pretty confidently. I believe for niche code bases, expecting the system to work without assistance is wrong. Once tool calling is supported, I would expect the thinking step of o1 to realize the lack of knowledge and hunt for mu4e documentation before emitting output.
  2. While previous model was decent in high level design of the solution, its actual code was very hand-wavy. o1 gives much more concrete code that seems to run leaving a few hallucinated function names aside.
  3. Connected to previous point, gpt4 needed a lot of guidance for solution specification. o1 on the other hand seems to think thoroughly about the solution and covers most edge cases. For example it tells me to refresh mu4e display if the updated email data is not shown correctly. This was something I had to explicitly tell gpt4 about and ask it to handle.
  4. The code snippets, functions, etc. gelled well together and they seemed written with a cohesive vision instead of copy-pastes across snippets and responses like with gpt4.

2. Cognitive Heuristics

The main change in o1 is subsumption of what I have been calling cognitive heuristics. Things like Chain of Thoughts (CoT) prompting, tricks like constantly asking 'make this better', etc. come under this category. From my last post:

An interesting question I have is around the future of cognitive heuristics mentioned earlier. Will they remain part of architecture over LLMs or will they be subsumed inside? I believe it will be the later.

Seems like this proved to be true (maybe earlier than I predicted) with RL guided CoT in o1. Once this technology is iterated upon and perfected, good access to runtime systems and documentation might be all that's needed to solve many of the programming problems unguided. Since the problem that I picked up—while rare in representation—is not representative of large complex software that we have in the wild, I would still reserve my judgement in case until I try on actual AI-programming products (and not just the ChatGPT UI). In any case, here is my updated response to "How does it feel like to work with <model-name>?" for o1:

an Engineer working with a Junior(?) Engineer who is asked to write code, send results without reflections, and, without access to a runtime or documentation.