▲Seven replies to the viral Apple reasoning paper and why they fall shortgarymarcus.substack.com

312 points by spwestwood 20 hours ago | 245 comments

eviks 10 minutes ago [-]

> We have every right to expect machines to do things we can’t.

Not really, this makes little sense in general, but also when in comes to this specific type is machine. In general: you can have a machine that is worse than human in everything that it does yet still be immensely valuable because it's very cheap.

In this specific case:

> AGI should be a step forward

Nope, read the definition. Matching human level intelligence, warts and all, will by definition reach AGI.

> in many cases LLMs are a step backwards

That's ok, use them in cases where it's a step forward, what's the big deal?

> note the bait and switch from “we’re going to build AGI that can revolutionize the world” to “give us some credit, our systems make errors and humans do, too”.

Ah, well, again, not really, the author just has unrealistic model of the minimum requirements for a revolution.

thomasahle 17 hours ago [-]

> 1. Humans have trouble with complex problems and memory demands. True! But incomplete. We have every right to expect machines to do things we can’t. [...] If we want to get to AGI, we will have to better.

I don't get this argument. The paper is about "whether RLLMs can think". If we grant "humans make these mistakes too", but also "we still require this ability in our definition of thinking", aren't we saying "thinking in humans is a illusion" too?

YeGoblynQueenne 19 minutes ago [-]

>> I don't get this argument.

The argument is that LLMs are computer systems and a computer system that's as bad as a human is less useful than a human.

briandw 7 hours ago [-]

Humans use tools to extend their abilities. LLM can do the same. In this paper they didn’t allow tool use. When others gave the tower of hanoi task to llms with tool use, like a python env, they were able to complete the task.

xienze 6 hours ago [-]

But the Tower of Hanoi can be solved without "tools" by humans, simply by understanding the problem, thinking about the solution, and writing it out. Having the LLM shell out to a Python example that it "wrote" (or rather, "pasted" since surely a Python solution to the Tower of Hanoi was part of its training set) is akin to a human Googling "program to solve Tower of Hanoi", copy-pasting and running the solution. Yes the LLM has "reasoned" that the solution to the problem is call out to a solution that it "knows" is out there, but that's not really "thinking" about how to solve a problem in the human sense.

What happens when some novel Tower of Hanoi-esque puzzle is presented and there's nothing available in its training set to reference as an executable solution? A human can reason about and present a solution, but an LLM? Ehh...

eviks 9 minutes ago [-]

> can be solved without "tools" by humans, simply by understanding the problem,

This already excludes a lot of humans

DiogenesKynikos 5 hours ago [-]

LLMs are perfectly capable of writing code to solve problems that are not in their training set. I ask LLMs to write code for niche problems that you won't find answers to just by Googling all the time. The LLMs usually get it right.

dfawcus 1 hours ago [-]

Maybe they can, but what the human is able to do is examine the tower of Hanoi problem, and the derive the general rule for solving (odd or even number of disks).

Based upon that comprehension, we then need little working memory (tokens) to solve the problem, it just becomes tedious to execute the algorithm.. But the algorithm was derived after considering the first 3 or 4 cases.

Whereas for the moment, LLMS are just pattern matching; whereas we do the pattern match, then derive the generalised rule.

saberience 55 minutes ago [-]

LLMs do that too though lol.

The Tower of Hanoi problem is terrible example for somehow suggesting humans are superior.

Firstly, there are plenty of humans who can’t solve this problem even for 3 disks, let alone 6 or 7. Secondly, LLMs can both give you general instructions to solve for any case and they can write out exhaustive move lists too.

Anyway, the fact that there are humans who cannot do Tower of Hanoi already rules it out as a good test of general intelligence anyway. We don’t say that a human doesn’t have “general intelligence” if they cannot solve Towers of Hanoi, so why then would it be a good test for LLM general intelligence?

xienze 5 hours ago [-]

> LLMs are perfectly capable of writing code to solve problems that are not in their training set.

Examples of these problems? You'll probably find that they're simply compositions of things already in the training set. For example, you might think that "here's a class containing an ID field and foobar field. Make a linked list class that stores inserted items in reverse foobar order with the ID field breaking ties" is something "not in" the training set, but it's really just a composition of the "make a linked list class" and "sort these things based on a field" problems.

saberience 49 minutes ago [-]

That’s exactly what humans do though lol.

We reason about things based on our training data. We have a hard time or impossible time reasoning about things we haven’t trained on.

Ie: a human with no experience of board games cannot reason about chess moves. A human with no math knowledge cannot reason about math problems.

How would expect an LLM to reason about something with no training data?

YeGoblynQueenne 34 minutes ago [-]

>> Ie: a human with no experience of board games cannot reason about chess moves. A human with no math knowledge cannot reason about math problems.

Then how did the first humans solve math and chess problems, if there were none around solved to give them examples of how to solve them in the first place?

Workaccount2 56 minutes ago [-]

The problem with this is that anything presented can be claimed to be in the training set, which is likely a zetebyte in size if not larger. However the counter-factual, the LLM failing a problem that is provably in it's training set (there are many), seems to carry no weight.

procgen 49 minutes ago [-]

> that they're simply compositions of things already in the training set

Yes, knowledge is compositional. This is just as true for humans as it is for machines.

DiogenesKynikos 1 hours ago [-]

What you're describing is successful generalization from the training dataset, also called "understanding" by laypeople.

FINDarkside 17 hours ago [-]

Agreed. But also his point about AGI is incorrect. AI that will perform on the level of average human in every task is AGI by definition.

pzo 9 hours ago [-]

Why AGI need to be even as good as average human. If you get someone with 80 IQ is still smart enough to reason and do plenty of menial tasks. Also not sure why AGI need to be as good in every task? Average human will excel others at few tasks and sux terribly in many others.

Someone 9 hours ago [-]

Because that’s how AGI is defined. https://en.wikipedia.org/wiki/Artificial_general_intelligenc...: “Artificial general intelligence (AGI)—sometimes called human‑level intelligence AI—is a type of artificial intelligence that would match or surpass human capabilities across virtually all cognitive tasks”

But yes, you’re right that software needs not be AGI to be useful. Artificial narrow intelligence or weak AI (https://en.wikipedia.org/wiki/Weak_artificial_intelligence) can be extremely useful, even something as narrow as a services that transcribes speech and can’t do anything else.

dvfjsdhgfv 4 hours ago [-]

The Hanoi Towers example demonstrates that SOTA RLMs struggle with tasks a pre-schooler solves.

The implication here is that they excel at things that occur very often and are bad at novelty. This is good for individuals (by using RLMs I can quickly learn about many other aspects of human body of knowledge in a way impossible/inefficient with traditional methods) but they are bad at innovation. Which, honestly, is not necessarily bad: we can offload lower-level tasks[0] to RLMs and pursue innovation as humans.

[0] Usual caveats apply: with time, the population of people actually good at these low-level tasks will diminish, just as we have very few Assembler programmers for Intel/AMD processors.

16 hours ago [-]

simonw 17 hours ago [-]

That very much depends on which AGI definition you are using. I imagine there are a dozen or so variants out there. See also "AI" and "agents" and (apparently) "vibe coding" and pretty much every other piece of jargon in this field.

FINDarkside 16 hours ago [-]

I think it's very widely accepted definition and there's really no competing definitions either as far as I know. While some people might think AGI means superintelligence, it's only because they've heard the term but never bothered to look up what it means.

simonw 14 hours ago [-]

OpenAI: https://openai.com/index/how-should-ai-systems-behave/#citat...

"By AGI, we mean highly autonomous systems that outperform humans at most economically valuable work."

AWS: https://aws.amazon.com/what-is/artificial-general-intelligen...

"Artificial general intelligence (AGI) is a field of theoretical AI research that attempts to create software with human-like intelligence and the ability to self-teach. The aim is for the software to be able to perform tasks that it is not necessarily trained or developed for."

DeepMind: https://arxiv.org/abs/2311.02462

"Artificial General Intelligence (AGI) is an important and sometimes controversial concept in computing research, used to describe an AI system that is at least as capable as a human at most tasks. [...] We argue that any definition of AGI should meet the following six criteria: We emphasize the importance of metacognition, and suggest that an AGI benchmark should include metacognitive tasks such as (1) the ability to learn new skills, (2) the ability to know when to ask for help, and (3) social metacognitive abilities such as those relating to theory of mind. The ability to learn new skills (Chollet, 2019) is essential to generality, since it is infeasible for a system to be optimized for all possible use cases a priori [...]"

The key difference appears to be around self-teaching and meta-cognition. The OpenAI one shortcuts that by focusing on "outperform humans at most economically valuable work", but others make that ability to self-improve key to their definitions.

Note that you said "AI that will perform on the level of average human in every task" - which disagrees very slightly with the OpenAI one (they went with "outperform humans at most economically valuable work"). If you read more of the DeepMind paper it mentions "this definition notably focuses on non-physical tasks", so their version of AGI does not incorporate full robotics.

bluefirebrand 13 hours ago [-]

Doesn't the "G" in AGI stand for "General" as in "Generally Good at everything"?

neom 12 hours ago [-]

I think the G is what really screws things up. I thought it was, as good as the general human, but upon googling it has a defined meaning among researchers. There appears to be confusion all over the place tho.

General-Purpose (Wide Scope): It can do many types of things.

Generally as Capable as a Human (Performance Level): It can do what we do.

Possessing General Intelligence (Cognitive Mechanism): It thinks and learns the way a general intelligence does.

So, for researchers, general intelligence is characterized by: applying knowledge from one domain to solve problems in another, adapting to novel situations without being explicitly programmed for them, and: having a broad base of understanding that can be applied across many different areas.

adastra22 9 hours ago [-]

Yes, but “good at” here has a very limited, technical meaning, which can be oversimplified as “better than random chance.”

If something can be better than random chance in any arbitrary problem domain it was not trained on, that is AGI.

math_dandy 13 hours ago [-]

I was hoping the accepted definition would not use humans as a baseline, rather that humans would be an (the) example of AGI.

thomasahle 8 hours ago [-]

The argument of (1) doesn't really have anything to do with humans or antromorphising. We're not even discussing AGI, we're just talking about the property of "thinking".

If somebody claims "computers can't do X, hence they can't think". A valid counter argument is "humans can't do X either, but they can think."

It's not important for the rebuttal that we used humans. Just that there exists entities that don't have property X, but are able to think. This shows X is not required for our definition of "thinking".

bastawhiz 13 hours ago [-]

The A in AGI is "artificial" which sort of precludes humans from being AGI (unless you have a very unconventional belief about the origin of humans).

Since there's not really a whole lot of unique examples of general intelligence out there, humans become a pretty straightforward way to compare.

xeonmc 12 hours ago [-]

> unless you have a very unconventional belief about the origin of humans

No so unconventional in many cultures.

bastawhiz 11 hours ago [-]

Certainly many cultures and religions believe in some flavor of intelligent design, but you could argue that if the natural world (for what we generally regard as "the natural world") is created by the same entity or entities that created humans, that doesn't make humans artificial. Ignoring the metaphysical (souls and such) I'm struggling to think of a culture that believes the origin of humans isn't shared by the world.

In this case, I was thinking of unusual beliefs like aliens creating humans or humans appearing abruptly from an external source such as through panspermia.

usef- 10 hours ago [-]

Yes. I wonder if he was thinking of ASI, not AGI

gylterud 8 hours ago [-]

ASI meaning Artificial Super Intelligence, I guess.

adastra22 9 hours ago [-]

Most people are. One of my pet peeves is that people falsely equate AGI with ASI, constantly. We have had full AGI for years now. It is a powerful tool, but not what people tend to think of as god-like “AGI.”

aaron695 10 hours ago [-]

[dead]

mathgradthrow 14 hours ago [-]

the average human is good at something, and sucks at almost everything. Human performance at chess and average performance at chess differ by 7 orders of magnitude.

datadrivenangel 13 hours ago [-]

Your standard model of human needs a little bit of fine tuning for most games.

jltsiren 12 hours ago [-]

AGI should perform on the level of an experienced professional in every task. The average human is useless for pretty much everything but capable of learning to perform almost any task, given enough motivation and effort.

Or perhaps AGI should be able to reach the level of an experienced professional in any task. Maybe a single system can't be good at everything, if there are inherent trade-offs in learning to perform different tasks well.

godelski 12 hours ago [-]

For comparison, the average person can't print Hello World in python. Your average programmer (probably) can.

It's surprisingly simple to be above average in most tasks. Which people often confuse with having expertise. It's probably pretty easy to get into the 80th percentile of most subjects. That won't make you the 80th percentile of people that do the thing, but most people don't. I'd wager 80th percentile is still amateur.

MoonGhost 9 hours ago [-]

> The average human is useless for pretty much everything but capable of learning to perform almost any task

But only the limited number of tasks per human.

> Or perhaps AGI should be able to reach the level of an experienced professional in any task.

Even if it performs just better than untrained human but on any task this will be superhuman level. As no human can do it.

jltsiren 9 hours ago [-]

The G in AGI stands for "general", not for "superhuman". An intelligence that can't learn to perform information processing and decision-making tasks people routinely do does not seem very general to me.

whatagreatboy 9 hours ago [-]

the real ability of intelligence is to correct mistakes in a gradual and consistent way.

autobodie 17 hours ago [-]

Agree. Both sides of the argument are unsatisfying. They seem like quantitative answers to a qualitative question.

serbuvlad 16 hours ago [-]

"Have we created machines that can do something qualitatevely similar to that part of us that can correlate known information and pattern recognition to produce new ideas and solutions to problems -- that part we call thinking?"

I think the answer to this question is certainly "Yes". I think the reason people deny this is because it was just laughably easy in retrospect.

In mid-2022 people were like. "Wow this GPT3 thing generates kind of coherent greentexts"

Since then really only we got: larger models, larger models, search, agents, larger models, chain-of-thought and larger models.

And from a novelty toy we got a set of tools that at the very least massively increase human productivity in a wide range of tasks and certainly pass any Turing test.

Attention really was all you needed.

But of course, if you ask a buddhist monk, he'll tell you we are attention machines, not computation machines.

He'll also tell you, should you listen, that we have a monkey in our mind that is constantly producing new thoughts. This monkey is not who we are, it's an organ. It's thoughts are not our thoughts. It's something we perceive. And that we shouldn't identify with.

Now we have thought-genrating-monkeys with jet engines and adrenaline shots.

This can be good. Thought-genrating-monkeys put us on the moon and wrote Hamlet and the Oddesy.

The key is to not become a slave to them. To realize that our worth consists not in our ability to think. And that we are more than that.

viccis 16 hours ago [-]

>I think the answer to this question is certainly "Yes".

It is unequivocally "No". A good joint distribution estimator is always by definition a posteriori and completely incapable of synthetic a priori thought.

serbuvlad 13 hours ago [-]

The human mind is an estimator too.

The fact that the human mind can think in concepts, images AND words, and then compresses that into words for transmission, wheras LLMs think directly in words, is no object.

If you watch someone reach a ledge, your mind will generate, based on past experience, a probabilistic image of that person falling. Then it will tie that to the concept of problem (self-attention) and start generating solutions, such as warning them or pulling them back etc.

LLMs can do all this too, but only in words.

corimaith 3 hours ago [-]

Do you think language is sufficient to model reality (not just physical, but abstract) here?

I think not, we can get close, but there exists problems and situations beyond that, especially in mathematics and philosophy. And I don't a visual medium or combination of is sufficient either, there's a more fundamental, underlying abstract structure that we use to model reality.

viccis 10 hours ago [-]

>LLMs think

Quick aside here: They do not think. They estimate generative probability distributions over the token space. If there's one thing I do agree with Dijkstra on, it's that it's important not to anthropomorphize mathematical or computing concepts.

As far as the rest of your comment, I generally agree. It sort of fits a Kantian view of epistemology, in which we have sensibility giving way to semiotics (we'll say words and images for simplicity) and we have concepts that we understand by a process of reasoning about a manifold of things we have sensed.

That's not probabilistic though. If we see someone reach a ledge and take a step over it, then we are making a synthetic a priori assumption that they will fall. It's synthetic because there's nothing about a ledge that means the person must fall. It's possible that there's another ledge right under we can't see. Or that they're in zero gravity (in a scifi movie maybe). Etc. It's a priori because we're making this statement not based on what already happened but rather what we know will happen.

We accomplish this by forming concepts such as "ledge", "step", "person", "gravity", etc., as we experience them until they exist in our mind as purely rational concepts we can use to reason about new experiences. We might end up being wrong, we might be right, we might be right despite having made the wrong claims (maybe we knew he'd fall because of gravity, however there was no gravity but he ended up being pushed by someone and "falling" because of it, this is called a "Gettier problem"). But our correctness is not a matter of probability but rather one of how much of the situation we understand and how well we reason about it.

Either way, there is nothing to suggest that we are working from a probability model. If that were the case, you wind up in what's called philosophical skepticism [1], in which, if all we are are estimation machines based on our observances, how can we justify any statement? If every statement must have been trained by a corresponding observation, then how do we probabilistically model things like causality that we would turn to to justify claims?

Kant's not the only person to address this skepticism, but he's probably the most notable to do so, and so I would challenge you to justify whether the "thinking" done by LLMs has any analogue to the "thinking" done using the process described in my second paragraph.

[1] https://en.wikipedia.org/wiki/Philosophical_skepticism#David...

mofeien 7 hours ago [-]

> We accomplish this by forming concepts such as "ledge", "step", "person", "gravity", etc., as we experience them until they exist in our mind as purely rational concepts we can use to reason about new experiences.

So we receive inputs from the environment and cluster them into observations about concepts, and form a collection of truth statements about them. Some of them may be wrong, or apply conditionally. These are probabilistic beliefs learned a posteriori from our experiences. Then we can do some a priori thinking about them with our eyes and ears closed with minimal further input from the environment. We may generate some new truth statements that we have not thought about before (e. g. "stepping over the ledge might not cause us to fall because gravity might stop at the ledge") and assign subjective probabilities to them.

This makes the a priori seem to always depend on previous a posterioris, and simply mark the cutoff from when you stop taking environmental input into account for your reasoning within a "thinking session". Actually, you might even change your mind mid-reasoning process based on the outcome of a thought experiment you perform which you use to update your internal facts collection. This would give the a priori reasing you're currently doing an even stronger a posteriori character. To me, these observations above basically dissolve the concept of a priori thinking.

And this makes it seem like we are very much working from probabilistic models, all the time. To answer how we can know anything: If a statement's subjective probability becomes high enough, we qualify it as a fact (and may be wrong about it sometimes). But this allows us to justify other statements (validly, in ~ 1-sometimes of cases). Hopefully our world model map converges towards a useful part of the territory!

serbuvlad 8 hours ago [-]

But I do not think humans think like that by default.

When I spill a drink, I don't think "gravity". That's too slow.

And I don't think humans are particularly good at that kind of rational thinking.

viccis 8 hours ago [-]

>When I spill a drink, I don't think "gravity". That's too slow.

I think you do, you just don't need to notice it. If you spilled it in the International Space Station, you'd probably respond differently even if you didn't have to stop and contemplate the physics of the situation.

nerdponx 16 hours ago [-]

That doesn't seem true to me at all. Let's say you fit y=c+bx+ax^2 on the domain -10,10 with 1000 data points uniformly distributed along x and with no more than 1% noise in observed y. Your model will be pretty damn good and absolutely will be able to generate "synthetic a priori" y outputs for any given x within the domain.

Now let's say you didn't know the true function and had to use a neural network instead. You would probably still get a great result in the sense of generating "new" outputs that are not observed in the training data, as long as they are within or reasonably close to the original domain.

LLMs are that. With enough data and enough parameters and the right inductive bias and the right RLHF procedure etc, they are getting increasingly good at estimating a conditional next token distribution given the context. If by "synthetic" you mean that an LLM can never generate a truly new idea that was not in it's training data, then that becomes the question of what the "domain" of the data really is.

I'm not convinced that LLMs are strictly limited to ideas that they have "learned" in their data. Before LLMs, I don't think people realized just how much pattern and structure there was in human thought, and how exposed it was through text. Given the advances of the last couple of years, I'm starting to come around to the idea that text contains enough instances of reasoning and thinking that these models might develop some kind of ability to do something like reasoning and thinking simply because they would have to in order to continue decreasing validation loss.

I want to be clear that I am not at all an AI maximalist, and the fact that these things are built largely on copyright infringement continues to disgust me, as do the growing economic and environmental externalities and other problems surrounding their use and abuse. But I don't think it does any good to pretend these things are dumber than they are, or to assume that the next AI winter is right around the corner.

viccis 10 hours ago [-]

>Your model will be pretty damn good and absolutely will be able to generate "synthetic a priori" y outputs for any given x within the domain.

You don't seem to understand what synthetic a priori means. The fact that you're asking a model to generate outputs based on inputs means it's by definition a posteriori.

>You would probably still get a great result in the sense of generating "new" outputs that are not observed in the training data, as long as they are within or reasonably close to the original domain.

That's not cognition and has no epistemological grounds. You're making the assumption that better prediction of semiotic structure (of language, images, etc.) results in better ability to produce knowledge. You can't model knowledge with language alone, the logical positivists found that out to their disappointment a century or so ago.

For example, I don't think you adequately proved this statement to be true:

>they would have to in order to continue decreasing validation loss

This works if and only if the structure of knowledge lies latently beneath the structure of semiotics. In other words, if you can start identifying the "shape" of the distribution of language, you can perturb it slightly to get a new question and expect to get a new correct answer.

autobodie 16 hours ago [-]

> The key is to not become a slave to them. To realize that our worth consists not in our ability to think. And that we are more than that.

I cannot afford to consider whether you are right because I am a slave to capital, and therefore may as well be a slave to capital's LLMs. The same goes for you.

serbuvlad 12 hours ago [-]

I am not a slave to capital. I am a slave to the harsh nature of the world.

I get too hot in summer and too cold in winter. I die of hunger. I am harassed by critters of all sorts.

And when my bed breaks, to keep my fragile spine from straining at night, I _want_ some trees to be cut, some mattresses to be provisioned, some designers to be provisioned etc. And capital is what gets me that, from people I will never meet, who wouldn't blink once if I died tomorrow.

LinXitoW 7 hours ago [-]

Considering capitalism is a very new phenomenon in human history, how do you think people survived and thrived for the other 248000 years? It's as ludicrous to believe that capitalism is some kind of force of nature as it is to believe kings were chosen by god.

serbuvlad 7 hours ago [-]

That depends on how you define your terms. A pro-capital laissez-faire policy is new, sure.

But the first civilizations in the world around 3000BC had trade, money, banking, capital accumulation, divison of labour etc.

jes5199 14 hours ago [-]

I think the Apple paper is practically a hack job - the problem was set up in such a way that the reasoning models must do all of their reasoning before outputting any of their results. Imagine a human trying to solve something this way: you’d have to either memorize the entire answer before speaking or come up with a simple pattern you could do while reciting that takes significantly less brainpower - and past a certain size/complexity, it would be impossible.

And this isn’t how LLMs are used in practice! Actual agents do a thinking/reasoning cycle after each tool-use call. And I guarantee even these 6-month-old models could do significantly better if a researcher followed best practices.

Brystephor 13 hours ago [-]

Forcing reasoning is analogous to requiring a student to show their work when solving a problem if im understanding the paper correctly.

> you’d have to either memorize the entire answer before speaking or come up with a simple pattern you could do while reciting that takes significantly less brainpower

This part i dont understand. Why would coming up with an algorithm (e.g. a simple pattern) and reciting it be impossible? The paper doesnt mention the models coming up with the algorithm at all AFAIK. If the model was able to come up with the pattern required to solve the puzzles and then also execute (e.g. recite) the pattern, then that'd show understanding. However the models didn't. So if the model can answer the same question for small inputs, but not for big inputs, then doesnt that imply the model is not finding a pattern for solving the answer but is more likely pulling from memory? Like, if the model could tell you fibbonaci numbers when n=5 but not when n=10, that'd imply the numbers are memorized and the pattern for generation of numbers is not understood.

qarl 13 hours ago [-]

> The paper doesnt mention the models coming up with the algorithm at all AFAIK.

And that's because they specifically hamstrung their tests so that the LLMs were not "allowed" to generate algorithms.

If you simply type "Give me the solution for Towers of Hanoi for 12 disks" into chatGPT it will happily give you the answer. It will write program to solve it, and then run that program to produce the answer.

But according to the skeptical community - that is "cheating" because it's using tools. Nevermind that it is the most effective way to solve the problem.

https://chatgpt.com/share/6845f0f2-ea14-800d-9f30-115a3b644e...

zoul 10 hours ago [-]

This is not about finding the most effective solution, it’s about showing that they “understand” the problem. Could they write the algorithm if it were not in their training set?

boredhedgehog 9 hours ago [-]

If that's the point, shouldn't they ask the model to explain the principle for any number of discs? What's the benefit of a concrete application?

johnecheck 7 hours ago [-]

Because that would prove absolutely nothing. There are numerous examples of tower of Hanoi explanations in the training set.

elbear 5 hours ago [-]

How do you check that a human understood it and not simply memorised different approaches?

Too 10 hours ago [-]

How can one know that's not coming from the pre-trained data. The paper is trying to evaluate whether the LLM has general problem solving ability.

jsnell 9 hours ago [-]

The paper doesn't mention it because either the researchers did not care to check the outputs manually, or reporting what was in the outputs would have made it obvious what their motives were.

When this research has been reproduced, the "failures" on the Tower of Hanoi are the model printing out a bunch of steps, saying there is no point in doing it thousands of times more. And they they'd either output an the algorithm for printing the rest in words or code

xtracto 10 hours ago [-]

I think the paper got unwanted attention... for a scientific paper. It's like that old paper about a "gravity shielding" podkelnov rings experiment that got publicized by some UK news paper as "scientists find antigravity" and ended up destroying the Russian author's career.

By the way, it seems Appke researchers got inspired by this [1] older chinese paper to get their title. The Chinese author's made a very similar argument, without the experiments. I myself believe Apple experiments are just good curiosities, but don't drive as much of a point as they believe.

[1] https://arxiv.org/abs/2506.02878

wohoef 18 hours ago [-]

Good article giving some critique to Apple's paper and Gary Marcus specifically.

https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-gen...

godelski 12 hours ago [-]

  > this is a preprint that has not been peer reviewed.

This conversation is peer review...

You don't need a conference for something to be peer reviewed, you only need... peers...

In fact, this paper is getting more peer review than most works. Conferences are notoriously noisy as reviewers often don't care and are happy to point out criticisms. All works have valid criticisms... Finding criticisms is the easy part. The hard part is figuring out if these invalidate the claims or not.

hintymad 17 hours ago [-]

Honest question: does the opinion of Gary Marcus still count? His criticism seems more philosophical than scientific. It's hard for me see what he builds or reasons to get to his conclusions.

zer00eyz 17 hours ago [-]

> seems more philosophical than scientific

I think this is a fair assessment but reason, and intelligence dont really have an established control or control group. If you build a test and say "Its not intelligent because it can't..." and someone goes out and add's that feature in is it suddenly now intelligent?

If we make a physics break through tomorrow is there any LLM that is going to retain that knowledge permanently as part of its core or will they all need to be re-trained? Can we make a model that is as smart as a 5th grader without shoving the whole corpus of human knowledge into it, folding it over twice and then training it back out?

The current crop of tech doesn't get us to AGI. And the focus to make it "better" is for the most part a fools errand. The real winners in this race are going to be those who hold the keys to optimization: short retraining times, smaller models (with less upfront data), optimized for lower performance systems.

hintymad 13 hours ago [-]

> The current crop of tech doesn't get us to AGI

I actually agree with this. Time and again, I can see that LLMs do not really understand my questions, let alone being able to perform logical deductions beyond in-distribution answers. What I’m really wondering is whether Marcus’s way of criticizing LLMs is valid.

Workaccount2 15 hours ago [-]

What gets me, and the author talks about it in the post, is that people will readily attribute correct answers to "its in the training set" but nobody says anything about incorrect answers that are in the training set. LLMs get stuff in the training set wrong all the time, but nobody uses it as evidence that it probably can't lean too hard on it's memorization for complex questions it does get right.

It puts LLMs in an impossible position; if they are right, they memorized it, if they are wrong, they cannot reason.

thrwaway55 11 hours ago [-]

Do you hypothese that they see more wrong examples then right? Why is there concern about model collapse if they are reasoning and can sort it out, why does the data even need to be scrubbed before training?

How many r's really are in Strawberry?

Jensson 14 hours ago [-]

> It puts LLMs in an impossible position; if they are right, they memorized it, if they are wrong, they cannot reason.

Both of those can be true at the same time though. They memorize a lot of things, but its fuzzy and when they remember wrong they cannot fix it via reasoning.

Workaccount2 13 hours ago [-]

It's more than fuzzy, they are packing exabytes, perhaps zetabytes of training data into a few terabytes. Without any reasoning ability it must be divine intervention that they ever get anything right...

chongli 5 hours ago [-]

It is divine intervention if you believe human minds are the product of a divine creator. Most of the attribution of miraculous reasoning ability on the part of LLMs I would attribute to pareidolia on the part of their human evaluators. I don’t think we’re much closer at all to having an AI which can replace an average minimum wage full-time worker, who will work largely unsupervised but ask their manager for help when needed, without screwing anything up.

We have LLMs that can produce copious text but cannot stop themselves from attempting to solve a problem they have no idea how to solve and making a mess of things as a result. This puts an LLM on the level of an overly enthusiastic toddler at best.

DanAtC 16 hours ago [-]

[flagged]

diego898 16 hours ago [-]

Sorry - what do you mean by yud-cult? Searching google didn’t help me (as far as I can tell) - I view LW from an outside perspective as well, but don’t understand the reference

jazzypants 15 hours ago [-]

They're referring to the founder of that website, Eliezer Yudkowsky, who is controversial due to his 2023 Time article that called for a complete halt on the development of AI.

https://en.m.wikipedia.org/wiki/Eliezer_Yudkowsky

https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-no...

heavyset_go 15 hours ago [-]

Yudkowsky is controversial for much more than an article from 2023.

Yudkowsky lacks credentials and MIRI and its adjacents have proven to be incestuous organizations when it comes to the rationalist cottage industry, one that has a serious problem with sexual abuse and literal cults.

diego898 1 hours ago [-]

Do you have any references you can share for sexual abuse and/or cults associated with rationalists?

ninjin 14 hours ago [-]

It was not so much the call for a complete halt that caused controversy, but rather this part of his piece in Time (my emphasis):

"Make it explicit in international diplomacy that preventing AI extinction scenarios is considered a priority above preventing a full nuclear exchange, and that allied nuclear countries are willing to run some risk of nuclear exchange if that's what it takes to reduce the risk of large AI training runs.

That's the kind of policy change that would cause my partner and I to hold each other, and say to each other that a miracle happened, and now there's a chance that maybe [our daughter] will live."

astrange 10 hours ago [-]

[flagged]

tomhow 9 hours ago [-]

Please don't do this here.

9 hours ago [-]

lexh 15 hours ago [-]

Use the debatably intelligent machines for this sort of question not Google.

It seems “Yud” here is a shorthand for Yudkowsky. Hinted by the capitalization.

f33d5173 16 hours ago [-]

Elezier yudkowsky, often referred to as yud, started lw.

labrador 19 hours ago [-]

The key insight is that LLMs can 'reason' when they've seen similar solutions in training data, but this breaks down on truly novel problems. This isn't reasoning exactly, but close enough to be useful in many circumstances. Repeating solutions on demand can be handy, just like repeating facts on demand is handy. Marcus gets this right technically but focuses too much on emotional arguments rather than clear explanation.

swat535 18 hours ago [-]

If that was the case, it would have been great already but these tools can’t even do that. They frequently make mistake repeating the same solutions available everywhere during their “reasoning” process and fabricates plausible hallucinations which you then have to inspect carefully to catch.

woopsn 17 hours ago [-]

That alone would be revolutionary - but still aspirational for now. The other day Gemini mixed up left and right on me in response to basic textbook problem.

Jabrov 19 hours ago [-]

I’m so tired of hearing this be repeated, like the whole “LLMs are _just_ parrots” thing.

It’s patently obvious to me that LLMs can reason and solve novel problems not in their training data. You can test this out in so many ways, and there’s so many examples out there.

______________

Edit for responders, instead of replying to each:

We obviously have to define what we mean by "reasoning" and "solving novel problems". From my point of view, reasoning != general intelligence. I also consider reasoning to be a spectrum. Just because it cannot solve the hardest problem you can think of does not mean it cannot reason at all. Do note, I think LLMs are generally pretty bad at reasoning. But I disagree with the point that LLMs cannot reason at all or never solve any novel problems.

In terms of some backing points/examples:

1) Next token prediction can itself be argued to be a task that requires reasoning

2) You can construct a variety of language translation tasks, with completely made up languages, that LLMs can complete successfully. There's tons of research about in-context learning and zero-shot performance.

3) Tons of people have created all kinds of challenges/games/puzzles to prove that LLMs can't reason. One by one, they invariably get solved (eg. https://gist.github.com/VictorTaelin/8ec1d8a0a3c87af31c25224..., https://ahmorse.medium.com/llms-and-reasoning-part-i-the-mon...) -- sometimes even when the cutoff date for the LLM is before the puzzle was published.

4) Lots of examples of research about out-of-context reasoning (eg. https://arxiv.org/abs/2406.14546)

In terms of specific rebuttals to the post:

1) Even though they start to fail at some complexity threshold, it's incredibly impressive that LLMs can solve any of these difficult puzzles at all! GPT3.5 couldn't do that. We're making incremental progress in terms of reasoning. Bigger, smarter models get better at zero-shot tasks, and I think that correlates with reasoning.

2) Regarding point 4 ("Bigger models might to do better"): I think this is very dismissive. The paper itself shows a huge variance in the performance of different models. For example, in figure 8, we see Claude 3.7 significantly outperforming DeepSeek and maintaining stable solutions for a much longer sequence length. Figure 5 also shows that better models and more tokens improve performance at "medium" difficulty problems. Just because it cannot solve the "hard" problems does not mean it cannot reason at all, nor does it necessarily mean it will never get there. Many people were saying we'd never be able to solve problems like the medium ones a few years ago, but now the goal posts have just shifted.

aucisson_masque 18 hours ago [-]

> It’s patently obvious that LLMs can reason and solve novel problems not in their training data.

Would you care to tell us more ?

« It’s patently obvious » is not really an argument, I could say just as well that everyone know LLM can’t resonate or think (in the way we living beings do).

socalgal2 10 hours ago [-]

I'm working on new API. I asked the LLM to read the spec and write tests for it. It does. I don't know if that's "reasoning". I know that no tests exist for this API. I know that the internet is not full of training data for this API because it's a new API. It's also not a CRUD API or some other API that's got a common pattern. And yet, with a very short prompt, Gemini Code Assist wrote valid tests for a new feature.

It certainly feels like more than fancy auto-complete. That is not to say I haven't run into issue but I'm still often shocked at how far it gets. And that's today. I have no idea what to expect in 6 months, 12, 2 years, 4, etc.

travisjungroth 13 hours ago [-]

Copied from a past comment of mine:

I just made up this scenario and these words, so I'm sure it wasn't in the training data.

Kwomps can zark but they can't plimf. Ghirns are a lot like Kwomps, but better zarkers. Plyzers have the skills the Ghirns lack.

Quoning, a type of plimfing, was developed in 3985. Zhuning was developed 100 years earlier. I have an erork that needs to be plimfed. Choose one group and one method to do it.

> Use Plyzers and do a Quoning procedure on your erork.

If that doesn't count as reasoning or generalization, I don't know what does.

firesteelrain 13 hours ago [-]

It’s just a truth table. I had a hunch that it was a truth table and then I asked AI how it figured it out and it confirmed it built a truth table. Still impressive either way

* Goal: Pick (Group ∧ Method) such that Group can plimf ∧ Method is a type of plimfing

* Only one group (Plyzers) passes the "can plimf" test

* Only one method (Quoning) is definitely plimfing

Therefore, the only valid (Group ∧ Method) combo is: → (Plyzer ∧ Quoning)

Source: ChatGPT

travisjungroth 13 hours ago [-]

So? Is the standard now that reasoning using truth tables or reasoning that can be expressed as truth tables doesn’t count?

krackers 13 hours ago [-]

If anything you'd think that the neurosymbolic people would be pleased that the LLMs do in fact reason by learning circuits representing boolean logic and truth tables. In a way they were right, it's just that starting with logic and then feeding in knowledge grounded in that logic (like Cyc) seems less scalable than feeding in knowledge and letting the model infer the underlying logic.

firesteelrain 11 hours ago [-]

Right, that’s my point. LLMs are doing pattern abstraction and in this way can mimic logic. They are not trained explicitly to do just truth tables even thought truth tables are fundamental.

16 hours ago [-]

goalieca 17 hours ago [-]

So far they cannot even answer questions which are straight up fact checking and search engine like queries. Reasoning means they would be able to work through a problem and generate a proof they way a student might.

Workaccount2 15 hours ago [-]

So if they have bad memory, then they must be reasoning to get the correct answer for the problems they do solve?

Jensson 14 hours ago [-]

A clock that is right twice a day is still broken.

Workaccount2 13 hours ago [-]

I think it's more fair to say a clock that is wrong twice a day is still broken...

astrange 10 hours ago [-]

> It’s patently obvious to me that LLMs can reason and solve novel problems not in their training data.

So can real parrots. Parrots are pretty smart creatures.

18 hours ago [-]

bfung 18 hours ago [-]

Any links or examples available? Curious to try it out

lossolo 1 hours ago [-]

Your entire edit essentially walks back your earlier strong claims.

None of your current points actually support your position.

1. No, it doesn't. That's a ridiculous claim. Are you seriously suggesting that statistics require reasoning?

2. If you map that language to tokens, it's obvious the model will follow that mapping.

etc.

Here are papers showing that these models can't reason:

https://arxiv.org/abs/2311.00871

https://arxiv.org/abs/2309.13638

https://arxiv.org/abs/2311.09247

https://arxiv.org/abs/2305.18654

https://arxiv.org/abs/2309.01809

You're mistaking pattern matching and the modeling of relationships in latent space for genuine reasoning.

I don't know what you're working on, but while I'm not curing cancer, I am solving problems that aren't in the training data and can't be found on Google. Just a few days ago, Gemini 2.5 Pro literally told me it didn’t know what to do and asked me for help. The other models hallucinated incorrect answers. I solved the problem in 15 minutes.

If you're working on yet another CRUD app, and you've never implemented transformers yourself or understood how they work internally, then I understand why LLMs might seem like magic to you.

labrador 18 hours ago [-]

I've done this excercise dozens of times because people keep saying it, but I can't find an example where this is true. I wish it was. I'd be solving world problems with novel solutions right now.

People make a common mistake by conflating "solving problems with novel surface features" with "reasoning outside training data." This is exactly the kind of binary thinking I mentioned earlier.

jjaksic 16 hours ago [-]

"Solving novel problems" does not mean "solving world problems that even humans are unable to solve", it simply means solving problems that are "novel" compared to what's in the training data.

Can you reason? Yes? Then why haven't you cured cancer? Let's not have double standards.

jhanschoo 17 hours ago [-]

I think that "solving world problems with novel solutions" is a strawman for an ability to reason well. We cannot solve world problems with reasoning, because pure reasoning has no relation to reality. We lack data and models about the world to confirm and deny our hypotheses about the world. That is why the empirical sciences do experiments instead of sit in an armchair and mull all day.

andrewmcwatters 18 hours ago [-]

It's definitely not true in any meaningful sense. There are plenty of us practitioners in software engineering wishing it was true, because if it was, we'd all have genius interns working for us on Mac Studios at home.

It's not true. It's plainly not true. Go have any of these models, paid, or local try to build you novel solutions to hard, existing problems despite being, in some cases, trained on literally the entire compendium of open knowledge in not just one, but multiple adjacent fields. Not to mention the fact that being able to abstract general knowledge would mean it would be able to reason.

They. Cannot. Do it.

I have no idea what you people are talking about because you cannot be working on anything with real substance that hasn't been perfectly line fit to your abundantly worked on problems, but no, these models are obviously not reasoning.

I built a digital employee and gave it menial tasks that compare to current cloud solutions who also claim to be able to provide you paid cloud AI employees and these things are stupider than fresh college grads.

18 hours ago [-]

multjoy 18 hours ago [-]

Lol, no.

lossolo 18 hours ago [-]

They can't create anything novel and it's patently obvious if you understand how they're implemented. But I'm just some anonymous guy on HN, so maybe this time I will just cite the opinion of the DeepMind CEO, who said in a recent interview with The Verge (available on YouTube) that LLMs based on transformers can't create anything truly novel.

jjaksic 16 hours ago [-]

Since when is reasoning synonymous with invention? All humans with a functioning brain can reason, but only a tiny fraction have or will ever invent anything.

lossolo 2 hours ago [-]

Read what OP said "It’s patently obvious to me that LLMs can ... solve novel problems", this is what I was replying to. I see everyone is smarter here than researchers at DeepMind, without any proofs or credentials to back their claims.

labrador 18 hours ago [-]

"I don't think today's systems can invent, you know, do true invention, true creativity, hypothesize new scientific theories. They're extremely useful, they're impressive, but they have holes."

Demis Hassabis On The Future of Work in the Age of AI (@ 2:30 mark)

https://www.youtube.com/watch?v=CRraHg4Ks_g

lossolo 18 hours ago [-]

Yes, this one. Thanks

gjm11 16 hours ago [-]

He doesn't say "that LLMs based on transformers can't create anything truly novel". Maybe he thinks that, maybe not, but what he says is that "today's systems" can't do that. He doesn't make any general statement about what transformer-based LLMs can or can't do; he's saying: we've interacted with these specific systems we have right now and they aren't creating genuinely novel things. That's a very different claim, with very different implications.

Again, for all I know maybe he does believe that transformer-based LLMs as such can't be truly creative. Maybe it's true, whether he believes it or not. But that interview doesn't say it.

2 hours ago [-]

aucisson_masque 18 hours ago [-]

That’s the opposite of reasoning tho. Ai bros want to make people believe LLM are smart but they’re not capable of intelligence and reasoning.

Reasoning mean you can take on a problem you’ve never seen before and think of innovative ways to solve it.

LLM can only replicate what is in its data, it can in no way think or guess or estimate what will likely be the best solution, it can only output a solution based on a probability calculation made on how frequent it has seen this solution linked to this problem.

labrador 18 hours ago [-]

You're assuming we're saying LLMs can't reason. That's not what we're saying. They can execute reasoning-like processes when they've seen similar patterns, but this breaks down when true novel reasoning is required. Most people do the same thing. Some poeple can come up with novel solutions to new problems, but LLMs will choke. Here's an example:

Prompt: "Let's try a reasoning test. Estimate how many pianos there are at the bottom of the sea."

I tried this on three advanced AIs* and they all choked on it without further hints from me. Claude then said:

    Roughly 3 million shipwrecks on ocean floors globally
    Maybe 1 in 1000 ships historically carried a piano (passenger ships, luxury vessels)
    So ~3,000 ships with pianos sunk
    Average maybe 0.5 pianos per ship (not all passenger areas had them)
    Estimate: ~1,500 pianos

*Claude Sonnet 4, Google Gemini 2.5 and GPT 4o

kgeist 16 hours ago [-]

GPT4o isn't considered an "advanced" LLM at this point. It doesn't use reasoning.

I gave your prompt to o3 pro, and this is what I got without any hints:

  Historic shipwrecks (1850 → 1970)
  • ~20 000 deep water wrecks recorded since the age of steam and steel  
  • 10 % were passenger or mail ships likely to carry a cabin class or saloon piano   
  • 1 piano per such vessel 20 000 × 10 % × 1 ≈ 2 000

  Modern container losses (1970 → today)
  • ~1 500 shipping containers lost at sea each year  
  • 1 in 2 000 containers carries a piano or electric piano   
  • Each piano container holds ≈ 5 units   
  • 50 year window 1 500 × 50 / 2 000 × 5 ≈ 190

  Coastal disasters (hurricanes, tsunamis, floods)
  • Major coastal disasters each decade destroy ~50 000 houses  
  • 1 house in 50 owns a piano   
  • 25 % of those pianos are swept far enough offshore to sink and remain (50 000 / 50) × 25 % × 5 decades ≈ 1 250

  Add a little margin for isolated one offs (yachts, barges, deliberate dumping): ≈ 300

  Best guess range: 3 000 – 5 000 pianos are probably resting on the seafloor worldwide.

yen223 15 hours ago [-]

The difference between o3 and o4-mini is so substantial I think this is the reason why people can't agree on how capable LLMs are nowadays.

theendisney 12 hours ago [-]

The correct answer is: I'm sorry, I don't have time for this.

FINDarkside 16 hours ago [-]

What does "choked on it" mean for you? Gemini 2.5 pro gives this, even estimating what amouns of those 3m ships that sank after pianos became common item. Not pasting the full reasoning here since it's rather long.

Combining our estimates:

From Shipwrecks: 12,500 From Dumping: 1,000 From Catastrophes: 500 Total Estimated Pianos at the Bottom of the Sea ≈ 14,000

Also I have to point out that 4o isn't a reasoning model and neither is Sonnet 4, unless thinking mode was enabled.

Jabrov 17 hours ago [-]

That seems like a totally reasonable response ... ?

labrador 17 hours ago [-]

I think you missed the part where I had to give them hinits to solve it. All 3 initially couldn't or refused saying it was not a real problem on their first try.

ej88 17 hours ago [-]

Can you share the chats? I tried with o3 and it gave a pretty reasonable answer on the first try.

https://chatgpt.com/share/684e02de-03f0-800a-bfd6-cbf9341f71...

Jabrov 16 hours ago [-]

You must be on the wrong side of an A/B test or very unlucky.

Because I gave your exact prompt to o3, Gemini, and Claude and they all produced reasonable answers like above on the first shot, with no hints, multiple times.

gjm11 16 hours ago [-]

FWIW I just gave a similar question to Claude Sonnet 4 (I asked about something other than pianos, just in case they're doing some sort of constant fine-tuning on user interactions[1] and to make it less likely that the exact same question is somewhere in its training data[2]) and it gave a very reasonable-looking answer. I haven't tried to double-check any of its specific numbers, some of which don't match my immediate prejudices, but it did the right sort of thing and considered more ways for things to end up on the ocean floor than I instantly thought of. No hints needed or given.

[1] I would bet pretty heavily that they aren't, at least not on the sort of timescale that would be relevant here, but better safe than sorry.

[2] I picked something a bit more obscure than pianos.

dialup_sounds 16 hours ago [-]

How much of that is inability to reason vs. being trained to avoid making things up?

YeGoblynQueenne 21 minutes ago [-]

To summarise: we spent billions to make intelligent machines and when they're asked to solve toy problems all we get is excuses.

ummonk 18 hours ago [-]

Most of the objections and their counterarguments seem like either poor objections (e.g. ad hominem against the first listed author) or seem to be subsumed under point 5. It’s annoying that most of this post focuses so much effort on discussing most of the other objections when the important discussion is the one to be had in point 5:

I.e. to what extent are LLMs able to reliably make use of writing code or using logic systems, and to what extent does hallucinating / providing faulty answers in the absence of such tool access demonstrate an inability to truly reason (I’d expect a smart human to just say “that’s too much” or “that’s beyond my abilities” rather than do a best effort faulty answer)?

thomasahle 18 hours ago [-]

> I’d expect a smart human to just say “that’s too much” or “that’s beyond my abilities” rather than do a best effort faulty answer)?

That's what the models did. They gave the first 100 steps, then explained how it was too much to output all of it, and gave the steps one would follow to complete it.

They were graded as "wrong answer" for this.

---

Source: https://x.com/scaling01/status/1931783050511126954?t=ZfmpSxH...

> If you actually look at the output of the models you will see that they don't even reason about the problem if it gets too large: "Due to the large number of moves, I'll explain the solution approach rather than listing all 32,767 moves individually"

> At least for Sonnet it doesn't try to reason through the problem once it's above ~7 disks. It will state what the problem and the algorithm to solve it and then output its solution without even thinking about individual steps.

sponnath 13 hours ago [-]

Didn't they start failing well before they hit token limits? I'm not sure what the point the source you linked to is trying to make.

thomasahle 9 hours ago [-]

OP said:

> I’d expect a smart human to just say “that’s too much” or “that’s beyond my abilities” rather than do a best effort faulty answer)?

And that's what the models did.

This is a good answer from the model. Has nothing to do with token limits.

emp17344 16 hours ago [-]

Why should we trust a guy with the following twitter bio to accurately replicate a scientific finding?

>lead them to paradise

>intelligence is inherently about scaling

>be kind to us AGI

Who even is this guy? He seems like just another r/singularity-style tech bro.

andy12_ 2 hours ago [-]

Not to be that guy but... clearly Ad Hominem.

FINDarkside 17 hours ago [-]

I don't think most of the objections are poor at all apart from 3, it's this article that seems to make lots of strawmans. Especially the first objection is often heard because people claim "this paper proves LLMs don't reason". The author moves goalposts and is arguing against about whether LLMs lead to AGI, which is already a strawman for those arguments. And in addition, he even seems to misunderstand AGI, thinking it's some sort of super intelligence ("We have every right to expect machines to do things we can’t"). AI that can do everything at least as good as average human is AGI by definition.

It's especially weird argument considering that LLMs are already ahead of humans in Tower of Hanoi. I bet average person will not be able to "one-shot" you the moves to 8 disk tower of Hanoi without writing anything down or tracking the state with the actual disks. LLMs have far bigger obstacles to reaching AGI though.

5 is also a massive strawman with the "not see how well it could use preexisting code retrieved from the web" as well, given that these models will write code to solve these kind of problems even if you come up with some new problem that wouldn't exist in its training data.

Most of these are just valid the issues in the paper. They're not supposed to be some kind of arguments that try to make everything the paper said invalid. The paper didn't really even make any bold claims, it only concluded LLMs have limitations in its reasoning. It had a catchy title and many people didn't read past that.

chongli 15 hours ago [-]

It's especially weird argument considering that LLMs are already ahead of humans in Tower of Hanoi

No one cares about Towers of Hanoi. Nor do they care about any other logic puzzles like this. People want AIs that solve novel problems for their businesses. The kind of problems regular business employees solve every single day yet LLMs make a mess of.

The purpose of the Apple paper is not to reveal the fact that LLMs routinely fail to solve these problems. Everyone who uses them already knows this. The paper is an argument for why this happens (lack of reasoning skills).

No number of demonstrations of LLMs solving well-known logic puzzles (or other problems humans have already solved) will prove reasoning. It's not interesting at all to solve a problem that humans have already solved (with working software to solve every instance of the problem).

ummonk 16 hours ago [-]

I'm more saying that points 1 and 2 get subsumed under point 5 - to the extent that existing algorithms / logical systems for solving such problems are written by humans, an AGI wouldn't need to match the performance of those algorithms / logical systems - it would merely need to be able to create / use such algorithms and systems itself.

You make a good point though that the question of whether LLMs reason or not should not be conflated with the question of whether they're on the pathway to AGI or not.

FINDarkside 16 hours ago [-]

Right, I agree there. Also that's something LLMs can already do. If you give the problem to ChatGPT o3 model, it will actually write python code, run it and give you the solution. But I think points 1 and 2 are still very valid things to talk about, because while Tower of Hanoi can be solved by writing code that doesn't apply to every problem that would require extensive reasoning.

Dzugaru 5 hours ago [-]

> just as humans shouldn’t serve as calculators

But they definitely could and were [0]. You just employ multiple, and cross check - with the ability of every single one to also double check and correct errors.

LLMs cannot double check, and multiples won't really help (I suspect ultimately for the same reason - exponential multiplication of errors [1])

[0] https://en.wikipedia.org/wiki/Computer_(occupation)

[1] https://www.tobyord.com/writing/half-life

neoden 15 hours ago [-]

> Puzzles a child can do

Certainly, I couldn't solve Hanoi's towers with 8 disks purely in my mind without being able to write down the state of every step or having a physical state in front of me. Are we comparing apples to apples?

thomasahle 17 hours ago [-]

> 5. A student might complain about a math exam requiring integration or differentiation by hand, even though math software can produce the correct answer instantly. The teacher’s goal in assigning the problem, though, isn’t finding the answer to that question (presumably the teacher already know the answer), but to assess the student’s conceptual understanding. Do LLM’s conceptually understand Hanoi? That’s what the Apple team was getting at. (Can LLMs download the right code? Sure. But downloading code without conceptual understanding is of less help in the case of new problems, dynamically changing environments, and so on.)

Why is he talking about "downloading" code? The LLMs can easily "write" out out the code themselves.

If the student wrote a software program for general differentiation during the exam, they obviously would have a great conceptual understanding.

autobodie 17 hours ago [-]

If the student could reference notes a fraction of the size of the LLM then I would not be convinced.

Workaccount2 14 hours ago [-]

LLMs are (suspected) a few TB in size.

Gemma 2 27B, one of the top ranked open source models, is ~60GB in size. LLama 405B is about 1TB.

Mind you that they train on likely exabytes of data. That alone should be a strong indication that there is a lot more than memory going on here.

sigotirandolas 3 hours ago [-]

I'm not convinced by this argument. You can fit a bunch of books covering up to MSc level maths on less than 100MB. After that point, more books will mostly be redundant information so it doesn't need much more space for maths beyond that.

Similarly TBs of Twitter/Reddit/HN add near zero new information per comment.

If anything you can fit an enormous amount of information in 1MB - we just don't need to do it because storage is cheap.

Workaccount2 1 hours ago [-]

People aren't claiming that they are holding textbooks in their model, that would just be even more evidence of reasoning (the LLM would have to reason what textbook to reference, and then extrapolate from the textbook(s) how to solve the problem at hand - pretty much what students in school do; study the textbook and reason from it to answer new test questions)

People are claiming that the models sit on a vast archive of every answer to every question. i.e. when you ask it 92384 x 333243 = ?, the model is just pulling from where it has seen that before. Anything else would necessitate some level of reasoning.

Also in my own experience, people are stunned when they learn that the models are not exabytes in size.

exe34 17 hours ago [-]

I suspect human memory consists of a lot more bits than LLMs encode.

autobodie 17 hours ago [-]

I rest my case — the question concerns a quality, not a quantity. These juvenile comparisons are mere excuses.

exe34 6 hours ago [-]

Oh we've shifted the goal post to quality now, very good! That does rest the case.

hrldcpr 19 hours ago [-]

In case anyone else missed the original paper (and discussion):

https://news.ycombinator.com/item?id=44203562

dang 18 hours ago [-]

Thanks! Macroexpanded:

The Illusion of Thinking: Strengths and limitations of reasoning models [pdf] - https://news.ycombinator.com/item?id=44203562 - June 2025 (269 comments)

Also this: A Knockout Blow for LLMs? - https://news.ycombinator.com/item?id=44215131 - June 2025 (48 comments)

Were there others?

bluefirebrand 19 hours ago [-]

I'm glad to read articles like this one, because I think it is important that we pour some water on the hype cycle

If we want to get serious about using these new AI tools then we need to come out of the clouds and get real about their capabilities

Are they impressive? Sure. Useful? Yes probably in a lot of cases

But we cannot continue the hype this way, it doesn't serve anyone except the people who are financially invested in these tools.

senko 19 hours ago [-]

Gary Marcus isn't about "getting real", it's making a name for himself as a contrarian to the popular AI narrative.

This article may seem reasonable, but here he's defending a paper that in his previous article he called "A knockout blow for LLMs".

Many of his articles seem reasonable (if a bit off) until you read a couple dozen a spot a trend.

steamrolled 18 hours ago [-]

> Gary Marcus isn't about "getting real", it's making a name for himself as a contrarian to the popular AI narrative.

That's an odd standard. Not wanting to be wrong is a universal human instinct. By that logic, every person who ever took any position on LLMs is automatically untrustworthy. After all, they made a name for themselves by being pro- or con-. Or maybe a centrist - that's a position too.

Either he makes good points or he doesn't. Unless he has a track record of distorting facts, his ideological leanings should be irrelevant.

senko 17 hours ago [-]

He makes many very good points:

For example he continusly calls out AGI hype for what it is, and also showcases dangers of naive use of LLMs (eg. lawyers copy-pasting hallucinated cases into their documents, etc). For this, he has plenty of material!

He also makes some very bad points and worse inferences: that LLMs as a technology are useless because they can't lead to AGI, that hallucation makes LLMs useless (but then he contradicts himself in another article conceding they "may have some use"), that because they can't follow an algorithm they're useless, etc, that scaling laws are over therefore LLMs won't advance (he's been making that for a couple of years), that AI bubble will collapse in a few months (also a few years of that), etc.

Read any of his article (I've read too many, sadly) and you'll never come to the conclusion that LLMs might be a useful technology, or be "a good thing" even in some limited way. This just doesn't fit with reality I can observe with my own eyes.

To me, this shows he's incredibly biased. That's okay if he wants to be a pundit - I couldn't blame Gruber for being biased about Apple! But Marcus presents himself as the authority on AI, a scientist, showing a real and unbiased view on the field. In fact, he's as full of hype as Sam Altman is, just in another direction.

Imagine he was talking about aviation, not AI. 787 dreamliner crashes? "I've been saying for 10 years that airplanes are unsafe, they can fall from the sky!" Boeing the company does stupid shit? "Blown door shows why airplane makers can't be trusted" Airline goes bankrupt? "Air travel winter is here"

I've spoken to too many intelligent people who read Marcus, take him at his words and have incredibly warped views on the actual potential and dangers of AI (and send me links to his latest piece with "so this sounds pretty damning, what's your take?"). He does real damage.

Compare him with Simon Willison, who also writes about AI a lot, and is vocal about its shortcomings and dangers. Reading Simon, I never get the feeling I'm being sold on a story (either positive or negative), but that I learned something.

Perhaps a Marcus is inevitable as a symptom of the Internet's immune system to the huge amount of AI hype and bullshit being thrown around. Perhaps Gary is just fed up with everything and comes out guns blazing, science be damned. I don't know.

But in my mind, he's as much BSer as the AGI singularity hypers.

ImageDeeply 16 hours ago [-]

> Compare him with Simon Willison, who also writes about AI a lot, and is vocal about its shortcomings and dangers. Reading Simon, I never get the feeling I'm being sold on a story (either positive or negative), but that I learned something.

Very true!

sinenomine 18 hours ago [-]

Marcus' points routinely fail to pass scrutiny, nobody in the field takes him seriously. If you seek real scientifically interesting LLM criticism, read François Chollet and his Arc AGI series of evals.

adamgordonbell 19 hours ago [-]

This!

For all his complaints about llms, his writing could be generated by an llm with a prompt saying: 'write an article responding to this news with an essay saying that you are once again right that this AI stuff is overblown and will never amount to anything.'

woopsn 16 hours ago [-]

Given that the links work, the quotes were actually said, numbers are correct, cited research actually exists etc we can immediately rule that out.

2muchcoffeeman 18 hours ago [-]

What’s the argument here that he’s not considering all the information regarding GenAI?

That there’s a trend to his opinion?

If I consider all the evidence regarding gravity, all my papers will be “gravity is real”.

In what ways is he only choosing what he wants to hear?

senko 17 hours ago [-]

Replied elsewhere in the thread: https://news.ycombinator.com/item?id=44279283

To your example about gravity, I argue that he goes from "gravity is real" to "therefore we can't fly", and "yeah maybe some people can but that's not really solving gravity and they need to go down eventually!"

2muchcoffeeman 6 hours ago [-]

If your argument about my gravity example holds. That’s not really a good argument. Between Newtons death and the first powered flight was almost 200 years. Being all negative about gravity would be reasonable since a bunch of stuff had to happen.

I’m not sure I buy your longer argument either.

I have a feeling the nay sayers are right on this. The next leap in AI isn’t something we’re going to recognise. (Obviously it’s possible - humans exist)

ramchip 15 hours ago [-]

I was very put off by his article "A knockout blow for LLMs?", especially all the fuss he was making about using his own name as a verb to mean debunking AI hype...

ninjin 14 hours ago [-]

Marcus comes with a very standard cognitive science criticism of statistical approaches to artificial intelligence, many parts of which dates back to the late 50s from when the field was born and moved to distance itself from behaviourism. The worst part to me is not that his criticism is entirely wrong, but rather that it is obvious and yet peddled as something that those of us that develop statistical approaches are completely ignorant of. To make matters worse, instead of developing alternative approaches (like plenty of my colleagues in cognitive science do!), he simply reiterates pretty much the same points over and over and has done so at least for the last twenty or so years. He and others paint themselves as sceptics and bulwarks against the current hype (which I can assure you, I hate at least as much as they do). But, to me, they are cynics, not sceptics.

I try to maintain a positive and open mind of other researchers, but Marcus lost me pretty much at "first contact" when a student in the group who leaned towards cognitive science had us read "Deep Learning: A Critical Appraisal" by Marcus (2018) [1] back around when it was published. Finally I could get into the mind of this guy so many people were talking about! 27 pages and yet I learned next to nothing new as the criticism was just the same one we have heard for decades: "Statistical learning has limits! It may not lead to 'truly" intelligent machines!". Not only that, the whole piece consistently conflates deep learning and statistical learning for no reason at all, reads as if it was rushed (and not proofed), emphasises the author's research strongly rather than giving a broad overview, etc. In short, it is bad, very bad as a scientific piece. At times, I read short excerpts of an article Marcus has written and yet sadly it is pretty much the same thing all over again.

[1]: https://arxiv.org/abs/1801.00631

There is a horrible market to "sell" hype when it comes to artificial intelligence, but there is also a horrible market to "sell" anti-hype. Sadly, both brings traffic, attention, talk invitations, etc. Two largely unscientific tribes, that I personally would rather do without, with their own profiting gurus.

newswasboring 18 hours ago [-]

What exactly is your objection here? That the guy has an opinion and is writing about it?

senko 17 hours ago [-]

Replied elsewere in the thread: https://news.ycombinator.com/item?id=44279283

bobxmax 18 hours ago [-]

[flagged]

g-b-r 17 hours ago [-]

I see the opposite, the wide majority of people commenting on Hacker News seem now very favorable to LLMs.

ramchip 15 hours ago [-]

I think a common opinion is that it's useful as a research and code generation tool, and that it has some really negative effects on the Internet and society in general. Since the discussion on HN is often focused on coding the first aspect is just a bit more visible.

bobxmax 17 hours ago [-]

Not at all.

fhd2 19 hours ago [-]

Even of the people invested in these tools, hype only benefits those attempting a pump and dump scheme, or those selling training, consulting or similar services around AI.

People who try to make genuine progress, while there's more money in it now, might just have to deal with another AI winter soon at this rate.

bluefirebrand 19 hours ago [-]

> hype only benefits those attempting a pump and dump scheme

I read some posts the other day saying Sam Altman sold off a ton of his OpenAI shares. Not sure if it's true and I can't find a good source, but if it is true then "pump and dump" does look close to the mark

aeronaut80 19 hours ago [-]

You probably can’t find a good source because sources say he has a negligible stake in OpenAI. https://www.cnbc.com/amp/2024/12/10/billionaire-sam-altman-d...

bluefirebrand 19 hours ago [-]

Interesting

When I did a cursory search, this information didn't turn up either

Thanks for correcting me. I suppose the stuff I saw the other day was just BS then

aeronaut80 17 hours ago [-]

To be fair I struggle to believe he’s doing it out of the goodness of his heart.

spookie 18 hours ago [-]

Think the same thing, we need more breakthroughs. Until then, it is still risky to rely on AI for most applications.

The sad thing is that most would take this comment the wrong way. Assuming it is just another doomer take. No, there is still a lot to do, and promissing the world too soon will only lead to disappointment.

Zigurd 17 hours ago [-]

This is the thing of it: "for most applications."

LLMs are not thinking. They way they fail, which is confidently and articulately, is one way they reveal there is no mind behind the bland but well-structured text.

But if I was tasked with finding 500 patents with weak claims or claims that have been litigated and knocked down, I would turn into LLMs to help automate that. One or two "nines" of reliability is fine, and LLMs would turn this previously impossible task into something plausible to take on.

mountainriver 19 hours ago [-]

I’ll take critiques from someone who knows what a test train split is.

The idea that a guy so removed from machine learning has something relevant to say about its capabilities really speaks to the state of AI fear

Spooky23 18 hours ago [-]

The idea that practitioners would try to discredit research to protect the golden goose from critique speaks to human nature.

mountainriver 17 hours ago [-]

No one is discrediting research from valid places, this is the victim alt-right style narrative that seems to follow Gary Marcus around. Somehow the mainstream is "suppressing" the real knowledge

devwastaken 19 hours ago [-]

experts are often blinded by their paychecks to see how nonsense their expertise is

mountainriver 17 hours ago [-]

Not knowing the most basic things about the subject you are critiquing is utter nonsense. Defending someone who does this is even worse

bluefirebrand 11 hours ago [-]

I think it's pretty fair to be critical of what LLMs are producing and how they fit into the tools without necessarily understanding how they work

If you bought a chainsaw that broke when you tried to cut down a tree, then you can criticize the chainsaw without knowing how the motor on it works, right?

soulofmischief 19 hours ago [-]

[citation needed]

Spooky23 18 hours ago [-]

Remember Web 3.0? Lol

Zigurd 17 hours ago [-]

It's unfortunate that a discussion about LLM weaknesses is giving crypto bro. But telling. There are a lot of bubble valuations out there.

bandrami 18 hours ago [-]

How actually useful are they though? We've had more than a year now of saying these things 10X knowledge workers and creatives, so.... where is the output? Is there a new office suite I can try? 10 times as many mobile apps? A huge new library of ebooks? Is this actually in practice producing things beyond Ghibli memes and RETVRN nostalgia slop?

2muchcoffeeman 18 hours ago [-]

I think it largely depends on what you’re writing. I’ve had it reply to corporate emails which is good since I need to sound professional not human.

If I’m coding it still needs a lot of baby sitting and sometimes I’m much faster than it.

Gigachad 18 hours ago [-]

And then the person on the end is using AI to summarise the email back to normal English. To what end?

js8 17 hours ago [-]

But look the GDP has increased!

bandrami 17 hours ago [-]

But that's what I don't get: it hasn't in that scenario because that doesn't lead to a greater circulation of money at any point. And that's the big thing I'm looking for: something AI has created that consumers are willing to pay for. Because if that doesn't end up happening no amount of sunk investment is going to save the ecosystem.

bandrami 18 hours ago [-]

So this would be an interesting output to measure but I have no idea how we would do that: has the volume of corporate email gone up? Or the time spent creating it gone down?

bigyabai 19 hours ago [-]

There's something innately funny about "HN's undying optimism" and "bad-news paper from Apple" reaching a head like this. An unstoppable object is careening towards an impervious wall, anything could happen.

DiogenesKynikos 19 hours ago [-]

I don't understand what people mean when they say that AI is being hyped.

AI is at the point where you can have a conversation with it about almost anything, and it will answer more intelligently than 90% of people. That's incredibly impressive, and normal people don't need to be sold on it. They're just naturally impressed by it.

woopsn 17 hours ago [-]

If the claims about AI were that it is a great or even incredible chat app, there would be no mismatch.

I think normal people understand curing all disease, replacing all value, generating 100x stock market returns, uploading our minds etc to be hype.

I said a few days ago, LLM is amazing product. Sad that these people ruin their credibility immediately upon success.

FranzFerdiNaN 18 hours ago [-]

I don’t need a tool that’s right maybe 70% of the time (and that’s me being optimistic). It needs to be right all the time or at least tell you when it doesn’t know for sure, instead of just making up something. Comparing it to going out in the streets and asking random people random questions is not a good comparison.

amohn9 17 hours ago [-]

It might not fit your work, but there are tons of areas where “good enough” can still provide a lot of value. I’m sure you’d be thrilled with a tool that could correctly tell you if Apple’s stock was going up or down tomorrow 70% of the time.

chongli 4 hours ago [-]

I work in a mail room sending hard copy letters to customers. If I got my job right only 70% of the time then I’d be causing massive privacy breaches daily by sending the wrong personal information to the wrong customers.

Would you trust an AI that gets your banking transactions right only 70% of the time?

newswasboring 18 hours ago [-]

> I don’t need a tool that’s right maybe 70% of the time (and that’s me being optimistic).

Where are you getting this from? 70%?

18 hours ago [-]

travisgriggs 18 hours ago [-]

I get even better results talking to myself.

georgemcbay 18 hours ago [-]

AI, in the form of LLMs, can be a useful tool.

It is still being vastly overhyped, though, by people attempting to sell the idea that we are actually close to an AGI "singularity".

Such overhype is usually easy to handwave away as like not my problem. Like, if investors get fooled into thinking this is anything like AGI, well, a fool and his money and all that. But investors aside this AI hype is likely to have some very bad real world consequences based on the same hype-men selling people on the idea that we need to generate 2-4 times more power than we currently do to power this godlike AI they are claiming is imminent.

And even right now there's massive real world impact in the form of say, how much grok is polluting Georgia.

hellohello2 18 hours ago [-]

Its quite simple, people upvote content that makes them feel good. Most of us here are programmers and the idea that many of ours skills are becoming replaceable feels quite bad. Hence, people upvote delusional statements that let them believe in something that feels better than objective reality. With any luck, these comments will be scraped and used to train the next AI generation, relieving it from the burden of factuality at last.

landl0rd 17 hours ago [-]

[flagged]

hellojimbo 9 hours ago [-]

The only real point is number 5.

> Huge vindication for what I have been saying all along: we need AI that integrates both neural networks and symbolic algorithms and representations

This is basically agents which is literally what everyone has been talking about for the past year lol.

> (Importantly, the point of the Apple paper goal was to see how LRM’s unaided explore a space of solutions via reasoning and backtracking, not see how well it could use preexisting code retrieved from the web.

This is a false dichotomy. The thing that apple tested was dumb and dl'ing code from the internet is also dumb. What would've been interesting is, given the problem, would a reasoning agent know how to solve the problem with access to a coding env.

> Do LLM’s conceptually understand Hanoi?

Yes and the paper didn't test for this. The paper basically tested the equivalent of, can a human do hanoi in their head.

I feel like what the author is advocating for is basically a neural net that can send instructions to an ALU/CPU, but I haven't seen anything promising that shows that its better than just giving an agent access to a terminal

starchild3001 16 hours ago [-]

We built planes—critics said they weren't birds. We built submarines—critics said they weren't fish. Progress moves forward regardless.

You have a choice: master these transformative tools and harness their potential, or risk being left behind by those who do.

Pro tip: Endless negativity from the same voices won't help you adapt to what's coming—learning will.

clbrmbr 2 hours ago [-]

Indeed. Anyone who has built things with Claude Code (Opus 4) and/or something more than one-shot with o3 should be feeling the AGI at this point. Certainly there’s still many limitations, but progress is undoubtedly moving forward.

doctor_blood 3 hours ago [-]

Is there a name for this authorial voice and cadence? I see midwits posting exactly like this on twitter and linkedin; it's insufferable.

sponnath 13 hours ago [-]

Toxic positivity is also not good.

skywhopper 19 hours ago [-]

The quote from the Salesforce paper is important: “agents displayed near-zero confidentiality awareness”.

Illniyar 14 hours ago [-]

I find it weird that people are taking the original paper to be some kind of indictment against llms. It's not like LLMs failing at doing Hanoi tower problem at higher levels is new, the paper took an existing method that was done before.

It was simply comparing the effectiveness of reasoning and non reasoning models on the same problem.

hiddencost 19 hours ago [-]

Why do we keep posting stuff from Gary? He's been wrong for decades but he keeps writing this stuff.

As far as I can tell he's the person that people reach for when they want to justify their beliefs. But surely being this wrong for this wrong should eventually lead to losing ones status as an expert.

jakewins 19 hours ago [-]

I thought this article seemed like well articulated criticism of the hype cycle - can you be more specific what you mean? Are the results in the Apple paper incorrect?

astrange 19 hours ago [-]

Gary Marcus always, always says AI doesn't actually work - it's his whole thing. If he's posted a correct argument it's a coincidence. I remember seeing him claim real long-time AI researchers like David Chapman (who's a critic himself) were wrong anytime they say anything positive.

(em-dash avoided to look less AI)

Of course, the main issue with the field is the critics /should/ be correct. Like, LLMs shouldn't work and nobody knows why they work. But they do anyway.

So you end up with critics complaining it's "just a parrot" and then patting themselves on the back, as if inventing a parrot isn't supposed to be impressive somehow.

foldr 19 hours ago [-]

I don’t read GM as saying that LLMs “don’t work” in a practical sense. He acknowledges that they have useful applications. Indeed, if they didn’t work at all, why would he be advocating for regulating their use? He just doesn’t think they’re close to AGI.

kadushka 19 hours ago [-]

The funny thing is, if you asked “what is AGI” 5 years ago, most people would describe something like o3.

foldr 19 hours ago [-]

Even Sam Altman thinks we’re not at AGI yet (although of course it’s coming “soon”).

kadushka 16 hours ago [-]

Markus has been consistently wrong over the many years predicting the (lack of) progress of the current deep learning methods. Altman has been correct so far.

foldr 8 hours ago [-]

Marcus has made some good predictions and some bad ones. That’s usually the way with people who make specific predictions — there are no prophets.

Not sure I’d agree that SA has been any more consistently right. You can easily find examples of overconfidence from him (though he rarely says anything specific enough to count as a prediction).

9 hours ago [-]

16 hours ago [-]

barrkel 19 hours ago [-]

You need to read everything that Gary writes with the particular axe to grind he has in mind: neurosymbolic AI. That's his specialism, and he essentially has a chip in his shoulder about the attention probabilistic approaches like LLMs are getting, and their relative success.

You can see this in this article too.

The real question you should be asking is if there is a practical limitation in LLMs and LRMs revealed by the Hanoi Towers problem or not, given that any SOTA model can write code to solve the problem and thereby solve it with tool use. Gary frames this as neurosymbolic, but I think it's a bit of a fudge.

krackers 18 hours ago [-]

Hasn't the symbolic vs statistical split in AI existed for a long time? With things like Cyc growing out of the former. I'm not too familiar with linguistics but maybe this extends there too, since I think Chomsky was heavy on formal grammars over probabilistic models [1].

Must be some sort of cognitive sunk cost fallacy, after dedicating your life to one sect, it must be emotionally hard to see the other "keep winning". Of course you'd root for them to fall.

[1] https://norvig.com/chomsky.html

charcircuit 17 hours ago [-]

>with tool use

A LLM with tool use can solve anything. It is interesting to try and measure its capabilities without tools.

barrkel 6 hours ago [-]

I don't think the first is true at all, unless you imagine some powerful oracle tools.

I think the second is interesting for comparing models, but not interesting for determining the limits of what models can automate in practice.

It's the prospect of automating labour which makes AI exciting and revolutionary, not their ability when arbitrarily restricted.

NoahZuniga 19 hours ago [-]

None of the arguments presented in this piece depend on his authority as an expert, so this is largely irrelevant.

mountainriver 19 hours ago [-]

It’s insane, he doesn’t know what a test train split is but he’s an AI expert? Is this where we are?

marvinborner 19 hours ago [-]

Is this supposed to be a joke reflecting point (3)?

19 hours ago [-]

brcmthrowaway 17 hours ago [-]

In classic ML, you never evaluste against data that was in the training set. In LLMs, everything is the training set. Doesn't this seem wrong?

16 hours ago [-]

avsteele 19 hours ago [-]

This doesn't rebut anything from the best critique of the Apple paper.

https://arxiv.org/abs/2506.09250

Jabbles 19 hours ago [-]

Those are points (2) and (5).

foldr 19 hours ago [-]

It does rebut point (1) of the abstract. Perhaps not convincingly, in your view, but it does directly addresses this kind of response.

avsteele 19 hours ago [-]

Papers make specific conclusions based on specific data. The paper I linked specifically rebuts the conclusions of the paper. Gary makes vague statements that could be interpreted as being related.

It is scientific malpractice to write a post supposedly rebutting responses to a paper and not directly address the most salient one.

foldr 19 hours ago [-]

This sort of omission would not be considered scientific malpractice even in a journal article, let alone a blog post. A rebuttal of a position that fails to address the strongest arguments for it is a bad rebuttal, but it’s not scientific malpractice to write a bad paper — let alone a bad blog post.

I don’t think I agree with you that GM isn’t addressing the points in the paper you link. But in any case, you’re not doing your argument any favors by throwing in wild accusations of malpractice.

avsteele 18 hours ago [-]

Malpractice slightly hyperbolic.

But anybody relying on Gary's posts in order to be be informed on this subject is being being mislead. This isn't an isolated incident either.

People need to be made be aware when you read him it is mere punditry, not substantive engagement with the literature.

spookie 19 hours ago [-]

A paper citing arxiv papers and x.com doesn't pass my smell test tbh

revskill 12 hours ago [-]

I'm shorting Apple.

akomtu 12 hours ago [-]

It's easy to check if a blackbox AI can reason: give it a checkerboard pattern, or something more complex, and see if it can come up with a compact formula that generates this pattern. You can't bullshit your way thru this problem, and it's easy to verify the answer, yet none of these so-called researchers attempt to do this.

baxtr 16 hours ago [-]

The last paragraph:

>Talk about convergence evidence. Taking the SalesForce report together with the Apple paper, it’s clear the current tech is not to be trusted.

mentalgear 19 hours ago [-]

AI hype-bros like to complain that real AI experts are too much concerned about debunking current AI then improving it - but the truth is that debunking bad AI IS improving AI. Science is a process of trial and error which only works by continuously questioning the current state.

dang 18 hours ago [-]

Can you please make your substantive points without name-calling or swipes? This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.

sieabahlpark 16 hours ago [-]

[dead]

neepi 19 hours ago [-]

Indeed. I completely agree with this.

My objection to the whole thing is the AI hype bros, which is really the funding solicitation facade over everything rather the truth, only has one outcome and that is that it cannot be sustained. At that point all investor confidence disappears, the money is gone and everyone loses access to the tools that they suddenly built all their dependencies on because it's all proprietary service model based.

Which is why I am not poking it with a 10 foot long shitty stick any time in the near future. The failure mode scares me, not the technology which arguably does have some use in non-idiot hands.

wongarsu 18 hours ago [-]

A lot of the best internet services came around in the decade after the dot-com crash. There is a chance Anthropic or OpenAI may not survive when funding suddenly dries up, but existing open weight models won't be majorly impacted. There will always be someone willing to host DeepSeek for you if you're willing to pay.

And while it will be sad to see model improvements slow down when the bubble bursts there is a lot of untapped potential in the models we already have. Especially as they become cheaper and easier to run

neepi 17 hours ago [-]

Someone might host DeepSeek for you but you'll pay through the nose for it and it'll be frozen in time because the training cost doesn't have the revenue to keep the ball rolling.

I'm not sure the GPU market won't collapse with it either. Possibly taking out a chunk of TSMC in the process, which will then have knock on effects across the whole industry.

wongarsu 17 hours ago [-]

There are already inference providers like DeepInfra or inference.net whose entire business model is hosted inference of open-source models. They promise not to keep or use any of the data and their business model has no scaling effects, so I assume they are already charging a fair market rate where the price covers the costs and returns a profit.

The GPU market will probably take a hit. But the flip side of that is that the market will be flooded with second-hand enterprise-grade GPUs. And if Nvidia needs sales from consumer GPUs again we might see more attractive prices and configurations there too. In the short term a market shock might be great for hobby-scale inference, and maybe even training (at the 7B scale). In the long term it will hurt, but if all else fails we still have AMD who are somehow barely invested in this AI boom

xoac 18 hours ago [-]

Yeah this is history repeating. See for example less known “Dreyfuss affair” at MIT and the brilliantly titled books: “What Computers Can’t Do” and its sequel “What Computers Still Can’t Do”.

3abiton 18 hours ago [-]

To hammer one point though, you have to understand that researcher are desensitized to minor novel improvement that translate to great value products. While obviously studying and assessing the limitations of AI is crucial, to the general public its capabilities are just so amazing, they can't fathom why we should think about limitations. Optimizing what we have is bette than rethinking the whole process.

bobxmax 18 hours ago [-]

> AI hype-bros like to complain that real AI experts are too much concerned about debunking current AI then improving it

You're acting like this is a common ocurrence lol

bowsamic 19 hours ago [-]

This doesn’t address the primary issue: that they had no methodology for choosing puzzles that weren’t in the training set and indeed while they claimed to have chosen puzzles that aren’t they didn’t explain why they think that. The whole point of the paper was to test LLM reasoning in untrained cases but there’s no reason to expect such puzzles to not part of the training set, and if you don’t have any way of telling if it is not or then your paper is not going to work out

roywiggins 18 hours ago [-]

Isn't it worse for LLMs if an LLM that has been trained on the Towers of Hanoi still can't solve it reliably?

bowsamic 11 hours ago [-]

Yes

anonthrowawy 18 hours ago [-]

how could you prove that?

bowsamic 11 hours ago [-]

You couldn’t, so such a paper cannot be scientific

(Or it should not be based on that claim as a central point, which apples paper was)

16 hours ago [-]

Loading comments...

eviks 10 minutes ago [-]

> We have every right to expect machines to do things we can’t.

In this specific case:

> AGI should be a step forward

Nope, read the definition. Matching human level intelligence, warts and all, will by definition reach AGI.

> in many cases LLMs are a step backwards

That's ok, use them in cases where it's a step forward, what's the big deal?

> note the bait and switch from “we’re going to build AGI that can revolutionize the world” to “give us some credit, our systems make errors and humans do, too”.

Ah, well, again, not really, the author just has unrealistic model of the minimum requirements for a revolution.

thomasahle 17 hours ago [-]

YeGoblynQueenne 19 minutes ago [-]

>> I don't get this argument.

The argument is that LLMs are computer systems and a computer system that's as bad as a human is less useful than a human.

briandw 7 hours ago [-]

xienze 6 hours ago [-]

eviks 9 minutes ago [-]

> can be solved without "tools" by humans, simply by understanding the problem,

This already excludes a lot of humans

DiogenesKynikos 5 hours ago [-]

dfawcus 1 hours ago [-]

Maybe they can, but what the human is able to do is examine the tower of Hanoi problem, and the derive the general rule for solving (odd or even number of disks).

Whereas for the moment, LLMS are just pattern matching; whereas we do the pattern match, then derive the generalised rule.

saberience 55 minutes ago [-]

LLMs do that too though lol.

The Tower of Hanoi problem is terrible example for somehow suggesting humans are superior.

xienze 5 hours ago [-]

> LLMs are perfectly capable of writing code to solve problems that are not in their training set.

saberience 49 minutes ago [-]

That’s exactly what humans do though lol.

We reason about things based on our training data. We have a hard time or impossible time reasoning about things we haven’t trained on.

Ie: a human with no experience of board games cannot reason about chess moves. A human with no math knowledge cannot reason about math problems.

How would expect an LLM to reason about something with no training data?

YeGoblynQueenne 34 minutes ago [-]

>> Ie: a human with no experience of board games cannot reason about chess moves. A human with no math knowledge cannot reason about math problems.

Then how did the first humans solve math and chess problems, if there were none around solved to give them examples of how to solve them in the first place?

Workaccount2 56 minutes ago [-]

procgen 49 minutes ago [-]

> that they're simply compositions of things already in the training set

Yes, knowledge is compositional. This is just as true for humans as it is for machines.

DiogenesKynikos 1 hours ago [-]

What you're describing is successful generalization from the training dataset, also called "understanding" by laypeople.

FINDarkside 17 hours ago [-]

Agreed. But also his point about AGI is incorrect. AI that will perform on the level of average human in every task is AGI by definition.

pzo 9 hours ago [-]

Someone 9 hours ago [-]

dvfjsdhgfv 4 hours ago [-]

The Hanoi Towers example demonstrates that SOTA RLMs struggle with tasks a pre-schooler solves.

[0] Usual caveats apply: with time, the population of people actually good at these low-level tasks will diminish, just as we have very few Assembler programmers for Intel/AMD processors.

16 hours ago [-]

simonw 17 hours ago [-]

FINDarkside 16 hours ago [-]

simonw 14 hours ago [-]

OpenAI: https://openai.com/index/how-should-ai-systems-behave/#citat...

"By AGI, we mean highly autonomous systems that outperform humans at most economically valuable work."

AWS: https://aws.amazon.com/what-is/artificial-general-intelligen...

DeepMind: https://arxiv.org/abs/2311.02462

bluefirebrand 13 hours ago [-]

Doesn't the "G" in AGI stand for "General" as in "Generally Good at everything"?

neom 12 hours ago [-]

General-Purpose (Wide Scope): It can do many types of things.

Generally as Capable as a Human (Performance Level): It can do what we do.

Possessing General Intelligence (Cognitive Mechanism): It thinks and learns the way a general intelligence does.

adastra22 9 hours ago [-]

Yes, but “good at” here has a very limited, technical meaning, which can be oversimplified as “better than random chance.”

If something can be better than random chance in any arbitrary problem domain it was not trained on, that is AGI.

math_dandy 13 hours ago [-]

I was hoping the accepted definition would not use humans as a baseline, rather that humans would be an (the) example of AGI.

thomasahle 8 hours ago [-]

The argument of (1) doesn't really have anything to do with humans or antromorphising. We're not even discussing AGI, we're just talking about the property of "thinking".

If somebody claims "computers can't do X, hence they can't think". A valid counter argument is "humans can't do X either, but they can think."

bastawhiz 13 hours ago [-]

The A in AGI is "artificial" which sort of precludes humans from being AGI (unless you have a very unconventional belief about the origin of humans).

Since there's not really a whole lot of unique examples of general intelligence out there, humans become a pretty straightforward way to compare.

xeonmc 12 hours ago [-]

> unless you have a very unconventional belief about the origin of humans

No so unconventional in many cultures.

bastawhiz 11 hours ago [-]

In this case, I was thinking of unusual beliefs like aliens creating humans or humans appearing abruptly from an external source such as through panspermia.

usef- 10 hours ago [-]

Yes. I wonder if he was thinking of ASI, not AGI

gylterud 8 hours ago [-]

ASI meaning Artificial Super Intelligence, I guess.

adastra22 9 hours ago [-]

aaron695 10 hours ago [-]

[dead]

mathgradthrow 14 hours ago [-]

the average human is good at something, and sucks at almost everything. Human performance at chess and average performance at chess differ by 7 orders of magnitude.

datadrivenangel 13 hours ago [-]

Your standard model of human needs a little bit of fine tuning for most games.

jltsiren 12 hours ago [-]

godelski 12 hours ago [-]

For comparison, the average person can't print Hello World in python. Your average programmer (probably) can.

MoonGhost 9 hours ago [-]

> The average human is useless for pretty much everything but capable of learning to perform almost any task

But only the limited number of tasks per human.

> Or perhaps AGI should be able to reach the level of an experienced professional in any task.

Even if it performs just better than untrained human but on any task this will be superhuman level. As no human can do it.

jltsiren 9 hours ago [-]

whatagreatboy 9 hours ago [-]

the real ability of intelligence is to correct mistakes in a gradual and consistent way.

autobodie 17 hours ago [-]

Agree. Both sides of the argument are unsatisfying. They seem like quantitative answers to a qualitative question.

serbuvlad 16 hours ago [-]

I think the answer to this question is certainly "Yes". I think the reason people deny this is because it was just laughably easy in retrospect.

In mid-2022 people were like. "Wow this GPT3 thing generates kind of coherent greentexts"

Since then really only we got: larger models, larger models, search, agents, larger models, chain-of-thought and larger models.

And from a novelty toy we got a set of tools that at the very least massively increase human productivity in a wide range of tasks and certainly pass any Turing test.

Attention really was all you needed.

But of course, if you ask a buddhist monk, he'll tell you we are attention machines, not computation machines.

Now we have thought-genrating-monkeys with jet engines and adrenaline shots.

This can be good. Thought-genrating-monkeys put us on the moon and wrote Hamlet and the Oddesy.

The key is to not become a slave to them. To realize that our worth consists not in our ability to think. And that we are more than that.

viccis 16 hours ago [-]

>I think the answer to this question is certainly "Yes".

It is unequivocally "No". A good joint distribution estimator is always by definition a posteriori and completely incapable of synthetic a priori thought.

serbuvlad 13 hours ago [-]

The human mind is an estimator too.

The fact that the human mind can think in concepts, images AND words, and then compresses that into words for transmission, wheras LLMs think directly in words, is no object.

LLMs can do all this too, but only in words.

corimaith 3 hours ago [-]

Do you think language is sufficient to model reality (not just physical, but abstract) here?

viccis 10 hours ago [-]

>LLMs think

[1] https://en.wikipedia.org/wiki/Philosophical_skepticism#David...

mofeien 7 hours ago [-]

serbuvlad 8 hours ago [-]

But I do not think humans think like that by default.

When I spill a drink, I don't think "gravity". That's too slow.

And I don't think humans are particularly good at that kind of rational thinking.

viccis 8 hours ago [-]

>When I spill a drink, I don't think "gravity". That's too slow.

nerdponx 16 hours ago [-]

viccis 10 hours ago [-]

>Your model will be pretty damn good and absolutely will be able to generate "synthetic a priori" y outputs for any given x within the domain.

You don't seem to understand what synthetic a priori means. The fact that you're asking a model to generate outputs based on inputs means it's by definition a posteriori.

For example, I don't think you adequately proved this statement to be true:

>they would have to in order to continue decreasing validation loss

autobodie 16 hours ago [-]

> The key is to not become a slave to them. To realize that our worth consists not in our ability to think. And that we are more than that.

I cannot afford to consider whether you are right because I am a slave to capital, and therefore may as well be a slave to capital's LLMs. The same goes for you.

serbuvlad 12 hours ago [-]

I am not a slave to capital. I am a slave to the harsh nature of the world.

I get too hot in summer and too cold in winter. I die of hunger. I am harassed by critters of all sorts.

LinXitoW 7 hours ago [-]

serbuvlad 7 hours ago [-]

That depends on how you define your terms. A pro-capital laissez-faire policy is new, sure.

But the first civilizations in the world around 3000BC had trade, money, banking, capital accumulation, divison of labour etc.

jes5199 14 hours ago [-]

Brystephor 13 hours ago [-]

Forcing reasoning is analogous to requiring a student to show their work when solving a problem if im understanding the paper correctly.

> you’d have to either memorize the entire answer before speaking or come up with a simple pattern you could do while reciting that takes significantly less brainpower

qarl 13 hours ago [-]

> The paper doesnt mention the models coming up with the algorithm at all AFAIK.

And that's because they specifically hamstrung their tests so that the LLMs were not "allowed" to generate algorithms.

But according to the skeptical community - that is "cheating" because it's using tools. Nevermind that it is the most effective way to solve the problem.

https://chatgpt.com/share/6845f0f2-ea14-800d-9f30-115a3b644e...

zoul 10 hours ago [-]

This is not about finding the most effective solution, it’s about showing that they “understand” the problem. Could they write the algorithm if it were not in their training set?

boredhedgehog 9 hours ago [-]

If that's the point, shouldn't they ask the model to explain the principle for any number of discs? What's the benefit of a concrete application?

johnecheck 7 hours ago [-]

Because that would prove absolutely nothing. There are numerous examples of tower of Hanoi explanations in the training set.

elbear 5 hours ago [-]

How do you check that a human understood it and not simply memorised different approaches?

Too 10 hours ago [-]

How can one know that's not coming from the pre-trained data. The paper is trying to evaluate whether the LLM has general problem solving ability.

jsnell 9 hours ago [-]

The paper doesn't mention it because either the researchers did not care to check the outputs manually, or reporting what was in the outputs would have made it obvious what their motives were.

xtracto 10 hours ago [-]

[1] https://arxiv.org/abs/2506.02878

wohoef 18 hours ago [-]

Good article giving some critique to Apple's paper and Gary Marcus specifically.

https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-gen...

godelski 12 hours ago [-]

  > this is a preprint that has not been peer reviewed.

This conversation is peer review...

You don't need a conference for something to be peer reviewed, you only need... peers...

hintymad 17 hours ago [-]

Honest question: does the opinion of Gary Marcus still count? His criticism seems more philosophical than scientific. It's hard for me see what he builds or reasons to get to his conclusions.

zer00eyz 17 hours ago [-]

> seems more philosophical than scientific

hintymad 13 hours ago [-]

> The current crop of tech doesn't get us to AGI

Workaccount2 15 hours ago [-]

It puts LLMs in an impossible position; if they are right, they memorized it, if they are wrong, they cannot reason.

thrwaway55 11 hours ago [-]

How many r's really are in Strawberry?

Jensson 14 hours ago [-]

> It puts LLMs in an impossible position; if they are right, they memorized it, if they are wrong, they cannot reason.

Both of those can be true at the same time though. They memorize a lot of things, but its fuzzy and when they remember wrong they cannot fix it via reasoning.

Workaccount2 13 hours ago [-]

chongli 5 hours ago [-]

DanAtC 16 hours ago [-]

[flagged]

diego898 16 hours ago [-]

Sorry - what do you mean by yud-cult? Searching google didn’t help me (as far as I can tell) - I view LW from an outside perspective as well, but don’t understand the reference

jazzypants 15 hours ago [-]

They're referring to the founder of that website, Eliezer Yudkowsky, who is controversial due to his 2023 Time article that called for a complete halt on the development of AI.

https://en.m.wikipedia.org/wiki/Eliezer_Yudkowsky

https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-no...

heavyset_go 15 hours ago [-]

Yudkowsky is controversial for much more than an article from 2023.

diego898 1 hours ago [-]

Do you have any references you can share for sexual abuse and/or cults associated with rationalists?

ninjin 14 hours ago [-]

It was not so much the call for a complete halt that caused controversy, but rather this part of his piece in Time (my emphasis):

That's the kind of policy change that would cause my partner and I to hold each other, and say to each other that a miracle happened, and now there's a chance that maybe [our daughter] will live."

astrange 10 hours ago [-]

[flagged]

tomhow 9 hours ago [-]

Please don't do this here.

9 hours ago [-]

lexh 15 hours ago [-]

Use the debatably intelligent machines for this sort of question not Google.

It seems “Yud” here is a shorthand for Yudkowsky. Hinted by the capitalization.

f33d5173 16 hours ago [-]

Elezier yudkowsky, often referred to as yud, started lw.

labrador 19 hours ago [-]

swat535 18 hours ago [-]

woopsn 17 hours ago [-]

That alone would be revolutionary - but still aspirational for now. The other day Gemini mixed up left and right on me in response to basic textbook problem.

Jabrov 19 hours ago [-]

I’m so tired of hearing this be repeated, like the whole “LLMs are _just_ parrots” thing.

It’s patently obvious to me that LLMs can reason and solve novel problems not in their training data. You can test this out in so many ways, and there’s so many examples out there.

______________

Edit for responders, instead of replying to each:

In terms of some backing points/examples:

1) Next token prediction can itself be argued to be a task that requires reasoning

4) Lots of examples of research about out-of-context reasoning (eg. https://arxiv.org/abs/2406.14546)

In terms of specific rebuttals to the post:

aucisson_masque 18 hours ago [-]

> It’s patently obvious that LLMs can reason and solve novel problems not in their training data.

Would you care to tell us more ?

« It’s patently obvious » is not really an argument, I could say just as well that everyone know LLM can’t resonate or think (in the way we living beings do).

socalgal2 10 hours ago [-]

travisjungroth 13 hours ago [-]

Copied from a past comment of mine:

I just made up this scenario and these words, so I'm sure it wasn't in the training data.

Kwomps can zark but they can't plimf. Ghirns are a lot like Kwomps, but better zarkers. Plyzers have the skills the Ghirns lack.

Quoning, a type of plimfing, was developed in 3985. Zhuning was developed 100 years earlier. I have an erork that needs to be plimfed. Choose one group and one method to do it.

> Use Plyzers and do a Quoning procedure on your erork.

If that doesn't count as reasoning or generalization, I don't know what does.

firesteelrain 13 hours ago [-]

It’s just a truth table. I had a hunch that it was a truth table and then I asked AI how it figured it out and it confirmed it built a truth table. Still impressive either way

* Goal: Pick (Group ∧ Method) such that Group can plimf ∧ Method is a type of plimfing

* Only one group (Plyzers) passes the "can plimf" test

* Only one method (Quoning) is definitely plimfing

Therefore, the only valid (Group ∧ Method) combo is: → (Plyzer ∧ Quoning)

Source: ChatGPT

travisjungroth 13 hours ago [-]

So? Is the standard now that reasoning using truth tables or reasoning that can be expressed as truth tables doesn’t count?

krackers 13 hours ago [-]

firesteelrain 11 hours ago [-]

Right, that’s my point. LLMs are doing pattern abstraction and in this way can mimic logic. They are not trained explicitly to do just truth tables even thought truth tables are fundamental.

16 hours ago [-]

goalieca 17 hours ago [-]

Workaccount2 15 hours ago [-]

So if they have bad memory, then they must be reasoning to get the correct answer for the problems they do solve?

Jensson 14 hours ago [-]

A clock that is right twice a day is still broken.

Workaccount2 13 hours ago [-]

I think it's more fair to say a clock that is wrong twice a day is still broken...

astrange 10 hours ago [-]

> It’s patently obvious to me that LLMs can reason and solve novel problems not in their training data.

So can real parrots. Parrots are pretty smart creatures.

18 hours ago [-]

bfung 18 hours ago [-]

Any links or examples available? Curious to try it out

lossolo 1 hours ago [-]

Your entire edit essentially walks back your earlier strong claims.

None of your current points actually support your position.

1. No, it doesn't. That's a ridiculous claim. Are you seriously suggesting that statistics require reasoning?

2. If you map that language to tokens, it's obvious the model will follow that mapping.

etc.

Here are papers showing that these models can't reason:

https://arxiv.org/abs/2311.00871

https://arxiv.org/abs/2309.13638

https://arxiv.org/abs/2311.09247

https://arxiv.org/abs/2305.18654

https://arxiv.org/abs/2309.01809

You're mistaking pattern matching and the modeling of relationships in latent space for genuine reasoning.

If you're working on yet another CRUD app, and you've never implemented transformers yourself or understood how they work internally, then I understand why LLMs might seem like magic to you.

labrador 18 hours ago [-]

I've done this excercise dozens of times because people keep saying it, but I can't find an example where this is true. I wish it was. I'd be solving world problems with novel solutions right now.

People make a common mistake by conflating "solving problems with novel surface features" with "reasoning outside training data." This is exactly the kind of binary thinking I mentioned earlier.

jjaksic 16 hours ago [-]

"Solving novel problems" does not mean "solving world problems that even humans are unable to solve", it simply means solving problems that are "novel" compared to what's in the training data.

Can you reason? Yes? Then why haven't you cured cancer? Let's not have double standards.

jhanschoo 17 hours ago [-]

andrewmcwatters 18 hours ago [-]

They. Cannot. Do it.

18 hours ago [-]

multjoy 18 hours ago [-]

Lol, no.

lossolo 18 hours ago [-]

jjaksic 16 hours ago [-]

Since when is reasoning synonymous with invention? All humans with a functioning brain can reason, but only a tiny fraction have or will ever invent anything.

lossolo 2 hours ago [-]

labrador 18 hours ago [-]

"I don't think today's systems can invent, you know, do true invention, true creativity, hypothesize new scientific theories. They're extremely useful, they're impressive, but they have holes."

Demis Hassabis On The Future of Work in the Age of AI (@ 2:30 mark)

https://www.youtube.com/watch?v=CRraHg4Ks_g

lossolo 18 hours ago [-]

Yes, this one. Thanks

gjm11 16 hours ago [-]

Again, for all I know maybe he does believe that transformer-based LLMs as such can't be truly creative. Maybe it's true, whether he believes it or not. But that interview doesn't say it.

2 hours ago [-]

aucisson_masque 18 hours ago [-]

That’s the opposite of reasoning tho. Ai bros want to make people believe LLM are smart but they’re not capable of intelligence and reasoning.

Reasoning mean you can take on a problem you’ve never seen before and think of innovative ways to solve it.

labrador 18 hours ago [-]

Prompt: "Let's try a reasoning test. Estimate how many pianos there are at the bottom of the sea."

I tried this on three advanced AIs* and they all choked on it without further hints from me. Claude then said:

    Roughly 3 million shipwrecks on ocean floors globally
    Maybe 1 in 1000 ships historically carried a piano (passenger ships, luxury vessels)
    So ~3,000 ships with pianos sunk
    Average maybe 0.5 pianos per ship (not all passenger areas had them)
    Estimate: ~1,500 pianos

*Claude Sonnet 4, Google Gemini 2.5 and GPT 4o

kgeist 16 hours ago [-]

GPT4o isn't considered an "advanced" LLM at this point. It doesn't use reasoning.

I gave your prompt to o3 pro, and this is what I got without any hints:

  Historic shipwrecks (1850 → 1970)
  • ~20 000 deep water wrecks recorded since the age of steam and steel  
  • 10 % were passenger or mail ships likely to carry a cabin class or saloon piano   
  • 1 piano per such vessel 20 000 × 10 % × 1 ≈ 2 000

  Modern container losses (1970 → today)
  • ~1 500 shipping containers lost at sea each year  
  • 1 in 2 000 containers carries a piano or electric piano   
  • Each piano container holds ≈ 5 units   
  • 50 year window 1 500 × 50 / 2 000 × 5 ≈ 190

  Coastal disasters (hurricanes, tsunamis, floods)
  • Major coastal disasters each decade destroy ~50 000 houses  
  • 1 house in 50 owns a piano   
  • 25 % of those pianos are swept far enough offshore to sink and remain (50 000 / 50) × 25 % × 5 decades ≈ 1 250

  Add a little margin for isolated one offs (yachts, barges, deliberate dumping): ≈ 300

  Best guess range: 3 000 – 5 000 pianos are probably resting on the seafloor worldwide.

yen223 15 hours ago [-]

The difference between o3 and o4-mini is so substantial I think this is the reason why people can't agree on how capable LLMs are nowadays.

theendisney 12 hours ago [-]

The correct answer is: I'm sorry, I don't have time for this.

FINDarkside 16 hours ago [-]

Combining our estimates:

From Shipwrecks: 12,500 From Dumping: 1,000 From Catastrophes: 500 Total Estimated Pianos at the Bottom of the Sea ≈ 14,000

Also I have to point out that 4o isn't a reasoning model and neither is Sonnet 4, unless thinking mode was enabled.

Jabrov 17 hours ago [-]

That seems like a totally reasonable response ... ?

labrador 17 hours ago [-]

I think you missed the part where I had to give them hinits to solve it. All 3 initially couldn't or refused saying it was not a real problem on their first try.

ej88 17 hours ago [-]

Can you share the chats? I tried with o3 and it gave a pretty reasonable answer on the first try.

https://chatgpt.com/share/684e02de-03f0-800a-bfd6-cbf9341f71...

Jabrov 16 hours ago [-]

You must be on the wrong side of an A/B test or very unlucky.

Because I gave your exact prompt to o3, Gemini, and Claude and they all produced reasonable answers like above on the first shot, with no hints, multiple times.

gjm11 16 hours ago [-]

[1] I would bet pretty heavily that they aren't, at least not on the sort of timescale that would be relevant here, but better safe than sorry.

[2] I picked something a bit more obscure than pianos.

dialup_sounds 16 hours ago [-]

How much of that is inability to reason vs. being trained to avoid making things up?

YeGoblynQueenne 21 minutes ago [-]

To summarise: we spent billions to make intelligent machines and when they're asked to solve toy problems all we get is excuses.

ummonk 18 hours ago [-]

thomasahle 18 hours ago [-]

> I’d expect a smart human to just say “that’s too much” or “that’s beyond my abilities” rather than do a best effort faulty answer)?

That's what the models did. They gave the first 100 steps, then explained how it was too much to output all of it, and gave the steps one would follow to complete it.

They were graded as "wrong answer" for this.

---

Source: https://x.com/scaling01/status/1931783050511126954?t=ZfmpSxH...

sponnath 13 hours ago [-]

Didn't they start failing well before they hit token limits? I'm not sure what the point the source you linked to is trying to make.

thomasahle 9 hours ago [-]

OP said:

> I’d expect a smart human to just say “that’s too much” or “that’s beyond my abilities” rather than do a best effort faulty answer)?

And that's what the models did.

This is a good answer from the model. Has nothing to do with token limits.

emp17344 16 hours ago [-]

Why should we trust a guy with the following twitter bio to accurately replicate a scientific finding?

>lead them to paradise

>intelligence is inherently about scaling

>be kind to us AGI

Who even is this guy? He seems like just another r/singularity-style tech bro.

andy12_ 2 hours ago [-]

Not to be that guy but... clearly Ad Hominem.

FINDarkside 17 hours ago [-]

chongli 15 hours ago [-]

It's especially weird argument considering that LLMs are already ahead of humans in Tower of Hanoi

ummonk 16 hours ago [-]

You make a good point though that the question of whether LLMs reason or not should not be conflated with the question of whether they're on the pathway to AGI or not.

FINDarkside 16 hours ago [-]

Dzugaru 5 hours ago [-]

> just as humans shouldn’t serve as calculators

But they definitely could and were [0]. You just employ multiple, and cross check - with the ability of every single one to also double check and correct errors.

LLMs cannot double check, and multiples won't really help (I suspect ultimately for the same reason - exponential multiplication of errors [1])

[0] https://en.wikipedia.org/wiki/Computer_(occupation)

[1] https://www.tobyord.com/writing/half-life

neoden 15 hours ago [-]

> Puzzles a child can do

thomasahle 17 hours ago [-]

Why is he talking about "downloading" code? The LLMs can easily "write" out out the code themselves.

If the student wrote a software program for general differentiation during the exam, they obviously would have a great conceptual understanding.

autobodie 17 hours ago [-]

If the student could reference notes a fraction of the size of the LLM then I would not be convinced.

Workaccount2 14 hours ago [-]

LLMs are (suspected) a few TB in size.

Gemma 2 27B, one of the top ranked open source models, is ~60GB in size. LLama 405B is about 1TB.

Mind you that they train on likely exabytes of data. That alone should be a strong indication that there is a lot more than memory going on here.

sigotirandolas 3 hours ago [-]

Similarly TBs of Twitter/Reddit/HN add near zero new information per comment.

If anything you can fit an enormous amount of information in 1MB - we just don't need to do it because storage is cheap.

Workaccount2 1 hours ago [-]

Also in my own experience, people are stunned when they learn that the models are not exabytes in size.

exe34 17 hours ago [-]

I suspect human memory consists of a lot more bits than LLMs encode.

autobodie 17 hours ago [-]

I rest my case — the question concerns a quality, not a quantity. These juvenile comparisons are mere excuses.

exe34 6 hours ago [-]

Oh we've shifted the goal post to quality now, very good! That does rest the case.

hrldcpr 19 hours ago [-]

In case anyone else missed the original paper (and discussion):

https://news.ycombinator.com/item?id=44203562

dang 18 hours ago [-]

Thanks! Macroexpanded:

The Illusion of Thinking: Strengths and limitations of reasoning models [pdf] - https://news.ycombinator.com/item?id=44203562 - June 2025 (269 comments)

Also this: A Knockout Blow for LLMs? - https://news.ycombinator.com/item?id=44215131 - June 2025 (48 comments)

Were there others?

bluefirebrand 19 hours ago [-]

I'm glad to read articles like this one, because I think it is important that we pour some water on the hype cycle

If we want to get serious about using these new AI tools then we need to come out of the clouds and get real about their capabilities

Are they impressive? Sure. Useful? Yes probably in a lot of cases

But we cannot continue the hype this way, it doesn't serve anyone except the people who are financially invested in these tools.

senko 19 hours ago [-]

Gary Marcus isn't about "getting real", it's making a name for himself as a contrarian to the popular AI narrative.

This article may seem reasonable, but here he's defending a paper that in his previous article he called "A knockout blow for LLMs".

Many of his articles seem reasonable (if a bit off) until you read a couple dozen a spot a trend.

steamrolled 18 hours ago [-]

> Gary Marcus isn't about "getting real", it's making a name for himself as a contrarian to the popular AI narrative.

Either he makes good points or he doesn't. Unless he has a track record of distorting facts, his ideological leanings should be irrelevant.

senko 17 hours ago [-]

He makes many very good points:

But in my mind, he's as much BSer as the AGI singularity hypers.

ImageDeeply 16 hours ago [-]

Very true!

sinenomine 18 hours ago [-]

adamgordonbell 19 hours ago [-]

This!

woopsn 16 hours ago [-]

Given that the links work, the quotes were actually said, numbers are correct, cited research actually exists etc we can immediately rule that out.

2muchcoffeeman 18 hours ago [-]

What’s the argument here that he’s not considering all the information regarding GenAI?

That there’s a trend to his opinion?

If I consider all the evidence regarding gravity, all my papers will be “gravity is real”.

In what ways is he only choosing what he wants to hear?

senko 17 hours ago [-]

Replied elsewhere in the thread: https://news.ycombinator.com/item?id=44279283

2muchcoffeeman 6 hours ago [-]

I’m not sure I buy your longer argument either.

I have a feeling the nay sayers are right on this. The next leap in AI isn’t something we’re going to recognise. (Obviously it’s possible - humans exist)

ramchip 15 hours ago [-]

I was very put off by his article "A knockout blow for LLMs?", especially all the fuss he was making about using his own name as a verb to mean debunking AI hype...

ninjin 14 hours ago [-]

[1]: https://arxiv.org/abs/1801.00631

newswasboring 18 hours ago [-]

What exactly is your objection here? That the guy has an opinion and is writing about it?

senko 17 hours ago [-]

Replied elsewere in the thread: https://news.ycombinator.com/item?id=44279283

bobxmax 18 hours ago [-]

[flagged]

g-b-r 17 hours ago [-]

I see the opposite, the wide majority of people commenting on Hacker News seem now very favorable to LLMs.

ramchip 15 hours ago [-]

bobxmax 17 hours ago [-]

Not at all.

fhd2 19 hours ago [-]

Even of the people invested in these tools, hype only benefits those attempting a pump and dump scheme, or those selling training, consulting or similar services around AI.

People who try to make genuine progress, while there's more money in it now, might just have to deal with another AI winter soon at this rate.

bluefirebrand 19 hours ago [-]

> hype only benefits those attempting a pump and dump scheme

aeronaut80 19 hours ago [-]

You probably can’t find a good source because sources say he has a negligible stake in OpenAI. https://www.cnbc.com/amp/2024/12/10/billionaire-sam-altman-d...

bluefirebrand 19 hours ago [-]

Interesting

When I did a cursory search, this information didn't turn up either

Thanks for correcting me. I suppose the stuff I saw the other day was just BS then

aeronaut80 17 hours ago [-]

To be fair I struggle to believe he’s doing it out of the goodness of his heart.

spookie 18 hours ago [-]

Think the same thing, we need more breakthroughs. Until then, it is still risky to rely on AI for most applications.

Zigurd 17 hours ago [-]

This is the thing of it: "for most applications."

LLMs are not thinking. They way they fail, which is confidently and articulately, is one way they reveal there is no mind behind the bland but well-structured text.

mountainriver 19 hours ago [-]

I’ll take critiques from someone who knows what a test train split is.

The idea that a guy so removed from machine learning has something relevant to say about its capabilities really speaks to the state of AI fear

Spooky23 18 hours ago [-]

The idea that practitioners would try to discredit research to protect the golden goose from critique speaks to human nature.

mountainriver 17 hours ago [-]

No one is discrediting research from valid places, this is the victim alt-right style narrative that seems to follow Gary Marcus around. Somehow the mainstream is "suppressing" the real knowledge

devwastaken 19 hours ago [-]

experts are often blinded by their paychecks to see how nonsense their expertise is

mountainriver 17 hours ago [-]

Not knowing the most basic things about the subject you are critiquing is utter nonsense. Defending someone who does this is even worse

bluefirebrand 11 hours ago [-]

I think it's pretty fair to be critical of what LLMs are producing and how they fit into the tools without necessarily understanding how they work

If you bought a chainsaw that broke when you tried to cut down a tree, then you can criticize the chainsaw without knowing how the motor on it works, right?

soulofmischief 19 hours ago [-]

[citation needed]

Spooky23 18 hours ago [-]

Remember Web 3.0? Lol

Zigurd 17 hours ago [-]

It's unfortunate that a discussion about LLM weaknesses is giving crypto bro. But telling. There are a lot of bubble valuations out there.

bandrami 18 hours ago [-]

2muchcoffeeman 18 hours ago [-]

I think it largely depends on what you’re writing. I’ve had it reply to corporate emails which is good since I need to sound professional not human.

If I’m coding it still needs a lot of baby sitting and sometimes I’m much faster than it.

Gigachad 18 hours ago [-]

And then the person on the end is using AI to summarise the email back to normal English. To what end?

js8 17 hours ago [-]

But look the GDP has increased!

bandrami 17 hours ago [-]

bandrami 18 hours ago [-]

So this would be an interesting output to measure but I have no idea how we would do that: has the volume of corporate email gone up? Or the time spent creating it gone down?

bigyabai 19 hours ago [-]

DiogenesKynikos 19 hours ago [-]

I don't understand what people mean when they say that AI is being hyped.

woopsn 17 hours ago [-]

If the claims about AI were that it is a great or even incredible chat app, there would be no mismatch.

I think normal people understand curing all disease, replacing all value, generating 100x stock market returns, uploading our minds etc to be hype.

I said a few days ago, LLM is amazing product. Sad that these people ruin their credibility immediately upon success.

FranzFerdiNaN 18 hours ago [-]

amohn9 17 hours ago [-]

chongli 4 hours ago [-]

Would you trust an AI that gets your banking transactions right only 70% of the time?

newswasboring 18 hours ago [-]

> I don’t need a tool that’s right maybe 70% of the time (and that’s me being optimistic).

Where are you getting this from? 70%?

18 hours ago [-]

travisgriggs 18 hours ago [-]

I get even better results talking to myself.

georgemcbay 18 hours ago [-]

AI, in the form of LLMs, can be a useful tool.

It is still being vastly overhyped, though, by people attempting to sell the idea that we are actually close to an AGI "singularity".

And even right now there's massive real world impact in the form of say, how much grok is polluting Georgia.

hellohello2 18 hours ago [-]

landl0rd 17 hours ago [-]

[flagged]

hellojimbo 9 hours ago [-]

The only real point is number 5.

> Huge vindication for what I have been saying all along: we need AI that integrates both neural networks and symbolic algorithms and representations

This is basically agents which is literally what everyone has been talking about for the past year lol.

> Do LLM’s conceptually understand Hanoi?

Yes and the paper didn't test for this. The paper basically tested the equivalent of, can a human do hanoi in their head.

starchild3001 16 hours ago [-]

We built planes—critics said they weren't birds. We built submarines—critics said they weren't fish. Progress moves forward regardless.

You have a choice: master these transformative tools and harness their potential, or risk being left behind by those who do.

Pro tip: Endless negativity from the same voices won't help you adapt to what's coming—learning will.

clbrmbr 2 hours ago [-]

doctor_blood 3 hours ago [-]

Is there a name for this authorial voice and cadence? I see midwits posting exactly like this on twitter and linkedin; it's insufferable.

sponnath 13 hours ago [-]

Toxic positivity is also not good.

skywhopper 19 hours ago [-]

The quote from the Salesforce paper is important: “agents displayed near-zero confidentiality awareness”.

Illniyar 14 hours ago [-]

It was simply comparing the effectiveness of reasoning and non reasoning models on the same problem.

hiddencost 19 hours ago [-]

Why do we keep posting stuff from Gary? He's been wrong for decades but he keeps writing this stuff.

jakewins 19 hours ago [-]

I thought this article seemed like well articulated criticism of the hype cycle - can you be more specific what you mean? Are the results in the Apple paper incorrect?

astrange 19 hours ago [-]

(em-dash avoided to look less AI)

Of course, the main issue with the field is the critics /should/ be correct. Like, LLMs shouldn't work and nobody knows why they work. But they do anyway.

So you end up with critics complaining it's "just a parrot" and then patting themselves on the back, as if inventing a parrot isn't supposed to be impressive somehow.

foldr 19 hours ago [-]

kadushka 19 hours ago [-]

The funny thing is, if you asked “what is AGI” 5 years ago, most people would describe something like o3.

foldr 19 hours ago [-]

Even Sam Altman thinks we’re not at AGI yet (although of course it’s coming “soon”).

kadushka 16 hours ago [-]

Markus has been consistently wrong over the many years predicting the (lack of) progress of the current deep learning methods. Altman has been correct so far.

foldr 8 hours ago [-]

Marcus has made some good predictions and some bad ones. That’s usually the way with people who make specific predictions — there are no prophets.

Not sure I’d agree that SA has been any more consistently right. You can easily find examples of overconfidence from him (though he rarely says anything specific enough to count as a prediction).

9 hours ago [-]

16 hours ago [-]

barrkel 19 hours ago [-]

You can see this in this article too.

krackers 18 hours ago [-]

Must be some sort of cognitive sunk cost fallacy, after dedicating your life to one sect, it must be emotionally hard to see the other "keep winning". Of course you'd root for them to fall.

[1] https://norvig.com/chomsky.html

charcircuit 17 hours ago [-]

>with tool use

A LLM with tool use can solve anything. It is interesting to try and measure its capabilities without tools.

barrkel 6 hours ago [-]

I don't think the first is true at all, unless you imagine some powerful oracle tools.

I think the second is interesting for comparing models, but not interesting for determining the limits of what models can automate in practice.

It's the prospect of automating labour which makes AI exciting and revolutionary, not their ability when arbitrarily restricted.

NoahZuniga 19 hours ago [-]

None of the arguments presented in this piece depend on his authority as an expert, so this is largely irrelevant.

mountainriver 19 hours ago [-]

It’s insane, he doesn’t know what a test train split is but he’s an AI expert? Is this where we are?

marvinborner 19 hours ago [-]

Is this supposed to be a joke reflecting point (3)?

19 hours ago [-]

brcmthrowaway 17 hours ago [-]

In classic ML, you never evaluste against data that was in the training set. In LLMs, everything is the training set. Doesn't this seem wrong?

16 hours ago [-]

avsteele 19 hours ago [-]

This doesn't rebut anything from the best critique of the Apple paper.

https://arxiv.org/abs/2506.09250

Jabbles 19 hours ago [-]

Those are points (2) and (5).

foldr 19 hours ago [-]

It does rebut point (1) of the abstract. Perhaps not convincingly, in your view, but it does directly addresses this kind of response.

avsteele 19 hours ago [-]

Papers make specific conclusions based on specific data. The paper I linked specifically rebuts the conclusions of the paper. Gary makes vague statements that could be interpreted as being related.

It is scientific malpractice to write a post supposedly rebutting responses to a paper and not directly address the most salient one.

foldr 19 hours ago [-]

avsteele 18 hours ago [-]

Malpractice slightly hyperbolic.

But anybody relying on Gary's posts in order to be be informed on this subject is being being mislead. This isn't an isolated incident either.

People need to be made be aware when you read him it is mere punditry, not substantive engagement with the literature.

spookie 19 hours ago [-]

A paper citing arxiv papers and x.com doesn't pass my smell test tbh

revskill 12 hours ago [-]

I'm shorting Apple.

akomtu 12 hours ago [-]

baxtr 16 hours ago [-]

The last paragraph:

>Talk about convergence evidence. Taking the SalesForce report together with the Apple paper, it’s clear the current tech is not to be trusted.

mentalgear 19 hours ago [-]

dang 18 hours ago [-]

Can you please make your substantive points without name-calling or swipes? This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.

sieabahlpark 16 hours ago [-]

[dead]

neepi 19 hours ago [-]

Indeed. I completely agree with this.

Which is why I am not poking it with a 10 foot long shitty stick any time in the near future. The failure mode scares me, not the technology which arguably does have some use in non-idiot hands.

wongarsu 18 hours ago [-]

neepi 17 hours ago [-]

Someone might host DeepSeek for you but you'll pay through the nose for it and it'll be frozen in time because the training cost doesn't have the revenue to keep the ball rolling.

I'm not sure the GPU market won't collapse with it either. Possibly taking out a chunk of TSMC in the process, which will then have knock on effects across the whole industry.

wongarsu 17 hours ago [-]

xoac 18 hours ago [-]

3abiton 18 hours ago [-]

bobxmax 18 hours ago [-]

> AI hype-bros like to complain that real AI experts are too much concerned about debunking current AI then improving it

You're acting like this is a common ocurrence lol

bowsamic 19 hours ago [-]

roywiggins 18 hours ago [-]

Isn't it worse for LLMs if an LLM that has been trained on the Towers of Hanoi still can't solve it reliably?

bowsamic 11 hours ago [-]

Yes

anonthrowawy 18 hours ago [-]

how could you prove that?

bowsamic 11 hours ago [-]

You couldn’t, so such a paper cannot be scientific

(Or it should not be based on that claim as a central point, which apples paper was)

16 hours ago [-]