Earlier I tried developing email snoozing feature for mu4e using gpt4
. I did
this using the ChatGPT UI without using any programming specific
augmentation. You can see my observations from that time in this old blog
post. In fact I would recommend reading that to get context about the
problem. Now after o1-preview
launch, I wanted to run the same task
again—almost one year later—by this newer model and see what changed.
In the main, I can see healthier amount of intra-snippet cohesiveness in the
code output and explanations (along with whatever is exposed from the Chain of
Thoughts) which tells me that the model's reasoning is much better than gpt4
. I
still wasn't able to get the full working package but the missing pieces involve
knowledge that's outside the model, in my opinion. Here is the log of my
conversations with the model.
1. Observations
o1
still makes confident mistakes. I had to point out non-existent functions a few times which it was hallucinating pretty confidently. I believe for niche code bases, expecting the system to work without assistance is wrong. Once tool calling is supported, I would expect the thinking step ofo1
to realize the lack of knowledge and hunt formu4e
documentation before emitting output.- While previous model was decent in high level design of the solution, its
actual code was very hand-wavy.
o1
gives much more concrete code that seems to run leaving a few hallucinated function names aside. - Connected to previous point,
gpt4
needed a lot of guidance for solution specification.o1
on the other hand seems to think thoroughly about the solution and covers most edge cases. For example it tells me to refresh mu4e display if the updated email data is not shown correctly. This was something I had to explicitly tellgpt4
about and ask it to handle. - The code snippets, functions, etc. gelled well together and they seemed
written with a cohesive vision instead of
copy-pastes across snippets and responses like with
gpt4
.
2. Cognitive Heuristics
The main change in o1
is subsumption of what I have been calling cognitive
heuristics. Things like Chain of Thoughts (CoT) prompting, tricks like
constantly asking 'make this better', etc. come under this category. From my
last post:
An interesting question I have is around the future of cognitive heuristics mentioned earlier. Will they remain part of architecture over LLMs or will they be subsumed inside? I believe it will be the later.
Seems like this proved to be true (maybe earlier than I predicted) with RL
guided CoT in o1
. Once this technology is iterated upon and perfected, good
access to runtime systems and documentation might be all that's needed to solve
many of the programming problems unguided. Since the problem that I picked
up—while rare in representation—is not representative of large complex
software that we have in the wild, I would still reserve my judgement in case
until I try on actual AI-programming products (and not just the ChatGPT UI). In
any case, here is my updated response to "How does it feel like to work with
<model-name>
?" for o1
:
an Engineer working with a Junior(?) Engineer who is asked to write code,
send results without reflections, and, without access to a runtime or documentation.