One small leap for machine kind

There is a lot of hype around what language models can supposedly do… But are we collectively diluting ourselves on their capabilities? If these benchmarks are performing near “human like” on certain performance tasks, why I am still stuck working as a software engineer instead of living my dream as a stay-at-home dog mom?

If you are not familiar with all the existing AI benchmarks, don’t worry, you’re not alone. There seems to be more and more cropping up everyday with the sole purpose of comparing world class models against each other in an equal playing field. You can think of it like the LLM Olympics without the roaring stadiums, e-coli infested waters, and snoop dog.

A well-known example is Huggingface LLM Leaderboard. While these benchmarks indicate model performance, they do not accurately reflect real-world performance. How well will they perform when the answers are not multiple choice and require more reasoning than memorization?

In my experience, not great! ☹

The Cheeseburger Conundrum

I started exploring how well these State of the Art (SOTA) foundational models would perform on coding challenges. The first problem I tried was a softball: an 8-point practice round question from the 2023 Meta coding competition called “The Cheeseburger Corollary 1”. GPT-4o solved this first attempt, no issue.

This problem is straightforward as our goal is to determine the conditions a cheeseburger can be made given the ingredients parameters. No tricks… just cheeseburgers. 🍔

The Dim Sum Disaster

Okay time to move on to something slightly harder, the Dim Sum Delivery Problem (12 point question).

Background on this problem: We have two classic characters (Alice, Bob) that are taking turns moving a cart along a grid and with constraints on how the characters can move such that Alice can only move 1-A rows and Bob can move between 1-B columns, and Alice always starts. Given a grid configuration (Rows x Columns) and A, B constraints, are there conditions that guarantee Alice will win the game (be the one who pushes cart to the customer)?

Unlike the previous problem, here you must understand the “trick”. Walk through a few examples like the one above (R=3, C=3, A=2, B=1) and you might notice a pattern. You will find that Alice wins when Rows > Columns and Bob wins when Columns >= Rows if they are both playing to win. But what does GPT-4o do? It fails to see this. It fails largely due to a couple of facts:

  1. Alice can move between 1-A spaces. It tries to always move “A” spaces, even if that is not the optimal move for her. This holds true for Bob as well.
  2. It consistently gravitates towards known solutions, in this case modular arithmetic.

Claude 3.5 with zero-shot prompting does not do much better:

How do we improve these results?

In the real world, iteration is key. So maybe an agent framework will work. I experimented using AutoGen to add roles and create a Mixture of Experts approach. Hopefully we can get more diverse solutions!

With this, I created more unique solutions including even/odd analysis, dynamic programming, Nim Sum, and Grundy theorem but sadly still did not arrive at the correct solution. How do we get it to logic?

This is a new frontier as there are not a lot of agent benchmarks, and only one I know of in the coding arena, SWE-Bench. The results seen from SWE-bench are not much better. SWE-bench tests if language models can resolve real GitHub issues using a dataset of 2,294 software engineering problems and the highest current ranking is ~22%.

This is a tough problem and a worthy adversary for both AI enthusiasts and the academic community. Solving tricky problems like these coding challenges may require a combo of techniques like fine-tuning, multi-agents, tool use, prompting, or maybe something else. It may even involve a unique way to represent data or a different type of model completely. 🤷

This year NeurIPS is introducing the HackerCup AI competition. This is a AI track in the Meta HackerCup, which is historically an all-human competition (prepare to get crushed 🤖). It should be exciting to see the performance gap between humans and machines!

But in any case, I am invested in seeing what these agents can do. With only a week until the practice round, I am in full-on experimental mode, so stay tuned for updates. And if you’re interested in competing or just playing around with LLMs and agents, check out the starter-kit that I helped create here. Best of luck ❤️

References:

  1. Huggingface Leaderboard
  2. Cheeseburger Corollary
  3. Dim Sun Delivery
  4. AutoGen framework
  5. SWE Bench
  6. HackerCupAI