SoftwareAIA new math benchmark just dropped and leading AI models can solve ‘less than 2%’ of its problems… oh dearWhen you purchase through links on our site, we may earn an affiliate commission.Here’s how it works.

SoftwareAIA new math benchmark just dropped and leading AI models can solve ‘less than 2%’ of its problems… oh dearWhen you purchase through links on our site, we may earn an affiliate commission.Here’s how it works.

When you purchase through links on our site, we may earn an affiliate commission.Here’s how it works.

(Image credit: PhonlamaiPhoto)

A robot reading a book in a library.

Sometimes I forget there’s a whole other world out there where AI models aren’t just used for basic tasks such as simple research and quick content summaries. Out in the land of bigwigs, they’re instead being used to help with everything from financial analysis to scientific research. That’s why their mathematical capabilities are so important—plus it’s a general marker of reasoning capabilities.

Which is why mathematical benchmarks exist. Benchmarks such asFrontierMath, which its maker, Epoch AI, has just dropped and which is putting LLMs through their paces with “hundreds of original, expert-crafted mathematics problems designed to evaluate advanced reasoning capabilities in AI systems” (viaArs Technica).

While today’s AI models don’t tend to struggle with other mathematical benchmarks such as GSM-8k and MATH, according to Epoch AI, “they solve less than 2% of FrontierMath problems, revealing a substantial gap between current AI capabilities and the collective prowess of the mathematics community”.

To be clear, these arehardproblems. As in, so hard that they “typically require hours or days for expert mathematicians to solve”, ranging “from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory”.

What’s so different about this benchmark is that solving these mathematical problems requires “extended chains of precise reasoning, with each step building exactly on what came before”.

AI models have traditionally not been great at extended reasoning in general, let alone for super-advanced math. This makes sense when you consider what AI models, at bottom, are doing. Using LLMs as an example, these are trained on tons of data to figure out what each next word would most likely be based on this data. Although of course there’s plenty of room for directing the model more towards different words, the process is essentially probabilistic.

Of late, however, we’ve seen AI models apply their probabilistic “thinking” in more of a directed fashion towards intermediary steps of this “thinking”. In other words, we’ve seen a move towards AI models that attempt toreason throughtheir thinking, rather than just jumping to a probabilistic conclusion.

The biggest gaming news, reviews and hardware deals

There’s now a version of ChatGPT-4o, for instance, that uses reasoning (and you better make sure youdon’t question it). It’s also telling that you can now potentially be awarded for giving a question that AI can’t answer for “humanity’s last exam”.

Of course, these individual steps of reasoning might themselves be arrived at probabilistically—and could we expect any more from a non-sentient algorithm?—but they do seem to be engaging in what we flesh-and-bloodies after the fact consider to be “reasoning”.

We’re clearly a way off from having these AI models achieve the reasoning capabilities of our best and brightest, though. We can see that now that we have a mathematical benchmark capable of really putting them to the test—2% isn’t great, is it? (And take that, robots.)

AI, explained(Image credit: Jakub Porzycki/NurPhoto via Getty Images)What is artificial general intelligence?:We dive into the lingo of AI and what the terms actually mean.

AI, explained

(Image credit: Jakub Porzycki/NurPhoto via Getty Images)What is artificial general intelligence?:We dive into the lingo of AI and what the terms actually mean.

(Image credit: Jakub Porzycki/NurPhoto via Getty Images)

OpenAI logo displayed on a phone screen and ChatGPT website displayed on a laptop screen are seen in this illustration photo taken in Krakow, Poland on December 5, 2022.

What is artificial general intelligence?:We dive into the lingo of AI and what the terms actually mean.

While AI models might not be able to crack these difficult problems just yet, the FrontierMath benchmark looks to serve as a good litmus test for future improvements, ensuring the models aren’t just spewing out mathematical nonsense that only experts could verify as such.

We must, in the end, remember that AI is not truth-aiming, however closelywe humansaim its probabilistic reasoning at results that tend towards the truth. The philosopher in me must ask: Without it having an inner life aiming towards truth, can truth actually exist for the AI, even if it spews it out? Truth for us, yes, but for the AI? I suspect not, and that’s why benchmarks like these will be crucial moving forwards into thisnew industrial revolution, or whatever they’re calling it these days.

TOPICSHardware

TOPICS

More about aiNvidia’s impressive AI-based computer tuneup tool G-Assist launches next month but the best bit is missingLogitech has announced an ‘intelligent streaming assistant’ in Streamlabs to tell you when your live stream sucksLatestCruel is a frantic run-and-gun shooter where you boot cultists out of windows in a cursed apartment block that wants you deadSee more latest►

More about aiNvidia’s impressive AI-based computer tuneup tool G-Assist launches next month but the best bit is missingLogitech has announced an ‘intelligent streaming assistant’ in Streamlabs to tell you when your live stream sucksLatestCruel is a frantic run-and-gun shooter where you boot cultists out of windows in a cursed apartment block that wants you deadSee more latest►

More about aiNvidia’s impressive AI-based computer tuneup tool G-Assist launches next month but the best bit is missingLogitech has announced an ‘intelligent streaming assistant’ in Streamlabs to tell you when your live stream sucks

More about ai

Nvidia’s impressive AI-based computer tuneup tool G-Assist launches next month but the best bit is missingLogitech has announced an ‘intelligent streaming assistant’ in Streamlabs to tell you when your live stream sucks

Nvidia’s AI assistant named Project G-Assist

Nvidia’s impressive AI-based computer tuneup tool G-Assist launches next month but the best bit is missing

Nvidia’s impressive AI-based computer tuneup tool G-Assist launches next month but the best bit is missing

A young Asian woman opening visual aids to give her audience a better understanding while holding a podcast session.

Logitech has announced an ‘intelligent streaming assistant’ in Streamlabs to tell you when your live stream sucks

Logitech has announced an ‘intelligent streaming assistant’ in Streamlabs to tell you when your live stream sucks

LatestCruel is a frantic run-and-gun shooter where you boot cultists out of windows in a cursed apartment block that wants you deadSee more latest►

Latest

Cruel is a frantic run-and-gun shooter where you boot cultists out of windows in a cursed apartment block that wants you dead

Cruel

Cruel is a frantic run-and-gun shooter where you boot cultists out of windows in a cursed apartment block that wants you dead

Cruel is a frantic run-and-gun shooter where you boot cultists out of windows in a cursed apartment block that wants you dead

See more latest►

Most Popular

My Summer Car, the absurdly detailed Finnish life sim about vehicle maintenance and drinking in your underpants, smashes into 1.0 after nearly a decade in Steam early access

Assassin’s Creed Shadows takes a run at improving parkour, as Ubisoft strives to make the system less ‘like a gas pedal’

Today’s Wordle answer for Saturday, January 11

The Last of Us season 2 trailer is a brief look at a story you should probably just play yourself in April

Lords of the Fallen publisher embraces fear of the DEI boogeyman, says it will not include ‘any social or political agendas’ in its games

Epic CEO Tim Sweeney says tech leaders are ‘pretending to be Republicans’ to gain favor with Trump, skirt antitrust laws, and ultimately ‘rip off consumers and crush competitors’

If you’re trying to convince me your ‘companionship’ robot is ‘lifelike’, maybe don’t rip her face off in the demo video

Square Enix launches new anti-harassment policy to protect its employees and partners from abusive fans

The Sims begins its 25th anniversary celebration next week with a Behind The Sims episode of news and announcements

Path of Exile 2 numberlord spends 16 straight days killing rare monsters to prove that a stat that makes loot better makes better loot

HARDWARE BUYING GUIDESLATEST GAME REVIEWS1Best Steam Deck accessories in Australia for 2025: Our favorite docks, powerbanks and gamepads2Best graphics card for laptops: the mobile GPUs I’d want in my next gaming laptop3Best mini PCs in 2025: The compact computers I love the most4Best 14-inch gaming laptop: The top compact gaming laptops I’ve held in these hands5Best Mini-ITX motherboards in 2025: My pick from all the mini mobo marvels I’ve tested1Thank Goodness You’re Here! review: An anarchic treasure trove of jokes and skits2Shiren the Wanderer: The Mystery Dungeon of Serpentcoil Island review—like juggling chainsaws on horseback3WD Black SN850X 8 TB NVMe SSD review4Ikea Utespelare desk review5Asus ROG Harpe Ace Mini wireless mouse review

HARDWARE BUYING GUIDESLATEST GAME REVIEWS1Best Steam Deck accessories in Australia for 2025: Our favorite docks, powerbanks and gamepads2Best graphics card for laptops: the mobile GPUs I’d want in my next gaming laptop3Best mini PCs in 2025: The compact computers I love the most4Best 14-inch gaming laptop: The top compact gaming laptops I’ve held in these hands5Best Mini-ITX motherboards in 2025: My pick from all the mini mobo marvels I’ve tested1Thank Goodness You’re Here! review: An anarchic treasure trove of jokes and skits2Shiren the Wanderer: The Mystery Dungeon of Serpentcoil Island review—like juggling chainsaws on horseback3WD Black SN850X 8 TB NVMe SSD review4Ikea Utespelare desk review5Asus ROG Harpe Ace Mini wireless mouse review

HARDWARE BUYING GUIDESLATEST GAME REVIEWS1Best Steam Deck accessories in Australia for 2025: Our favorite docks, powerbanks and gamepads2Best graphics card for laptops: the mobile GPUs I’d want in my next gaming laptop3Best mini PCs in 2025: The compact computers I love the most4Best 14-inch gaming laptop: The top compact gaming laptops I’ve held in these hands5Best Mini-ITX motherboards in 2025: My pick from all the mini mobo marvels I’ve tested1Thank Goodness You’re Here! review: An anarchic treasure trove of jokes and skits2Shiren the Wanderer: The Mystery Dungeon of Serpentcoil Island review—like juggling chainsaws on horseback3WD Black SN850X 8 TB NVMe SSD review4Ikea Utespelare desk review5Asus ROG Harpe Ace Mini wireless mouse review

HARDWARE BUYING GUIDESLATEST GAME REVIEWS1Best Steam Deck accessories in Australia for 2025: Our favorite docks, powerbanks and gamepads2Best graphics card for laptops: the mobile GPUs I’d want in my next gaming laptop3Best mini PCs in 2025: The compact computers I love the most4Best 14-inch gaming laptop: The top compact gaming laptops I’ve held in these hands5Best Mini-ITX motherboards in 2025: My pick from all the mini mobo marvels I’ve tested1Thank Goodness You’re Here! review: An anarchic treasure trove of jokes and skits2Shiren the Wanderer: The Mystery Dungeon of Serpentcoil Island review—like juggling chainsaws on horseback3WD Black SN850X 8 TB NVMe SSD review4Ikea Utespelare desk review5Asus ROG Harpe Ace Mini wireless mouse review

HARDWARE BUYING GUIDESLATEST GAME REVIEWS

1Best Steam Deck accessories in Australia for 2025: Our favorite docks, powerbanks and gamepads

A bunch of the best Steam Deck accessories on a blue background.

1Best Steam Deck accessories in Australia for 2025: Our favorite docks, powerbanks and gamepads

1

Best Steam Deck accessories in Australia for 2025: Our favorite docks, powerbanks and gamepads

2Best graphics card for laptops: the mobile GPUs I’d want in my next gaming laptop

2Best graphics card for laptops: the mobile GPUs I’d want in my next gaming laptop

2

Best graphics card for laptops: the mobile GPUs I’d want in my next gaming laptop

3Best mini PCs in 2025: The compact computers I love the most

3Best mini PCs in 2025: The compact computers I love the most

3

Best mini PCs in 2025: The compact computers I love the most

4Best 14-inch gaming laptop: The top compact gaming laptops I’ve held in these hands

4Best 14-inch gaming laptop: The top compact gaming laptops I’ve held in these hands

4

Best 14-inch gaming laptop: The top compact gaming laptops I’ve held in these hands

5Best Mini-ITX motherboards in 2025: My pick from all the mini mobo marvels I’ve tested

5Best Mini-ITX motherboards in 2025: My pick from all the mini mobo marvels I’ve tested

5

Best Mini-ITX motherboards in 2025: My pick from all the mini mobo marvels I’ve tested

1Thank Goodness You’re Here! review: An anarchic treasure trove of jokes and skits

Thank Goodness You’re Here review

1Thank Goodness You’re Here! review: An anarchic treasure trove of jokes and skits

1

Thank Goodness You’re Here! review: An anarchic treasure trove of jokes and skits

2Shiren the Wanderer: The Mystery Dungeon of Serpentcoil Island review—like juggling chainsaws on horseback

2Shiren the Wanderer: The Mystery Dungeon of Serpentcoil Island review—like juggling chainsaws on horseback

2

Shiren the Wanderer: The Mystery Dungeon of Serpentcoil Island review—like juggling chainsaws on horseback

3WD Black SN850X 8 TB NVMe SSD review

3WD Black SN850X 8 TB NVMe SSD review

3

WD Black SN850X 8 TB NVMe SSD review

4Ikea Utespelare desk review

4Ikea Utespelare desk review

4

Ikea Utespelare desk review

5Asus ROG Harpe Ace Mini wireless mouse review

5Asus ROG Harpe Ace Mini wireless mouse review

5

Asus ROG Harpe Ace Mini wireless mouse review