Scientists create new AI model that surpasses ChatGPT in major AGI benchmark tests

EdTech
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

It appears scientists are working quickly to create artificial intelligence models that mimic human brains when it comes to reasoning. A new AI model, according to reports, can perform high-level reasoning, as opposed to widely used large language models (LLMs) like ChatGPT. Scientists report seeing improved performance in major benchmarks.

Researchers at Singapore AI firm Sapient have christened the new reasoning AI a hierarchical reasoning model (HRM), and it is said to be based on the hierarchical and multi-timescale processing that the human brain uses. This is really the manner in which various regions of the brain combine information over different periods, from milliseconds to minutes.

Based on the scientists, the new model of reasoning has shown improved performance compared to current LLMs and can function more efficiently. All this, they say, is made possible due to the model requiring fewer parameters and training samples. The scientists asserted that the HRM model requires 27 million parameters as it employs 1,000 training samples. Parameters in AI models are the learned variables during training, including weights and biases. However, the majority of sophisticated LLMs have billions or trillions of parameters. How does it fare?

When the HRM was tested in the ARC-AGI benchmark, which is known to be among the toughest tests to find out how close models are to attaining artificial general intelligence, the new model showed remarkable results, according to the study.

The model scored 40.3 per cent in ARC-AGI-1, whereas OpenAI’s 03-mini-high had scored 34.5 per cent, Anthropic Claude 3.7 scored 21.2 per cent, and DeepSeek R1 scored 15.8 per cent. In the same vein, HRM performed better in the harder ARC-AGI-2 test with a 5 per cent score, leaving the other models far behind. Although the majority of current advanced LLMs rely on chain-of-thought (CoT) reasoning, researchers at Sapient maintained that the technique has some major limitations, i.e., 'brittle task decomposition, extensive data requirements, and high latency.' HRM, however, applies sequential reasoning tasks within one forward pass and not step-by-step. It consists of two modules: one high-level module that does slow and abstract planning and one low-level module that does fast and precise calculations.

This is based on how various parts of the human brain do planning versus rapid response. Additionally, HRM uses a process called iterative refinement, i.e., it begins with an approximate solution and refines it with many bursts of short thinking. Apparently, after every explosion, it verifies whether it should continue to refine or if the obtained results are satisfactory enough as the final result. Based on the scientists, HRM was able to solve Sudoku puzzles that typical normal LLMs are incapable of solving.

The model was also highly proficient in identifying the optimal routes in mazes, proving that it can solve structured and logical problems significantly better than LLMs. Although the findings are staggering, there is a point to be made that the paper, published in the arXiv database, is still to be peer-reviewed. However, the ARC-AGI benchmark team tried to reproduce the results after the model became open-source. The team did verify the numbers; however, they also discovered that hierarchical architecture did not accelerate performance significantly as reported. They discovered that a lesser-documented improvement process in the course of training was probably the cause for the robust figures.