ARC-AGI-2 is a competition regarding pattern recognition for machine learning algorithms.

The base idea is to see if current state-of-the-art models are capable of understanding and implementing advanced patters. This benchmark-competition is unique in that it is easy for humans to score extremely high, while it is very difficult for current models to perform at any acceptable rate.

data #

The dataset comprises of squares divided into sub-squares, with a visual queue in order to determine the colors, or the new position, or any different style of visual reasoning.

base arc-agi example
The example shown on the arc-agi website.
The colors correspond to the number of holes.

This information is served as JSON, so the agents need to be able to parse JSON to ‘visualise’ the patterns, either by tokenizing and breaking down the data, or by parsing the JSON using a hard-coded method.

A typical JSON data object looks like

{
  "train": [
    {
      "input": [
        [7, 9],
        [4, 3]
      ],
      "output": [
        [7, 9, 7, 9, 7, 9],
        [4, 3, 4, 3, 4, 3],
        [9, 7, 9, 7, 9, 7],
        [3, 4, 3, 4, 3, 4],
        [7, 9, 7, 9, 7, 9],
        [4, 3, 4, 3, 4, 3]
      ]
    },
    {
      "input": [
        [8, 6],
        [6, 4]
      ],
      "output": [
        [8, 6, 8, 6, 8, 6],
        [6, 4, 6, 4, 6, 4],
        [6, 8, 6, 8, 6, 8],
        [4, 6, 4, 6, 4, 6],
        [8, 6, 8, 6, 8, 6],
        [6, 4, 6, 4, 6, 4]
      ]
    }
  ],
  "test": [
    {
      "input": [
        [3, 2],
        [7, 8]
      ],
      "output": [
        [3, 2, 3, 2, 3, 2],
        [7, 8, 7, 8, 7, 8],
        [2, 3, 2, 3, 2, 3],
        [8, 7, 8, 7, 8, 7],
        [3, 2, 3, 2, 3, 2],
        [7, 8, 7, 8, 7, 8]
      ]
    }
  ]
}

on this example, the pattern is to generate alternating pairs of the input

current approaches #

Most approaches rely on LLMs; the best scores are achived by the state-of-the-art reasoning models such as OpenAI’s o3, with a 3% performance rating. These try to zero-shot the problem, i.e. they see the problem, might reason (depending on the model) and then suggest a solution immediately for the entire puzzle.

It is my belief that this approach is flawed. LLMs are markov chains, and are frozen at inference. Hence they are not able to adjust their trajectories in the embedded vector space, nor is there any actual decision making.

Even “reasoning” models, which have had their trajectories finetuned via rl, that are able to self-prompt to cover more of the embedded space as they autoregress, are unable to reason.