So far all evidence that LLMs can perform few-shot reasoning on novel problems seems to boil down to "LLMs store patterns they can reapply to new inputs", i.e. it works for problems that follow a structure the model has seen before, but doesn't work on new problems.
This is circumstantially confirmed by the fact that, when translating ARC problems to sequences, the largest LLMs out there (not just GPT-3, but *much* larger ones as well) score close to zero. Problems that do get solved are known ones, such as a simple left/right flip.
In general, it should be expected that LLMs can solve any problem that has practice data available, and that becomes a pure pattern recognition problem after practice. This includes all IQ test tasks that weren't designed for novelty.
The only thing that makes ARC special is that it is designed for novelty. Each task (with a number of exceptions, because ARC is far from perfect) is unique and not seen anywhere else on the Internet. Especially the private test set (since the other sets are, in fact, online).
Despite this novelty element, kids as young as 5-6 can solve a large number of ARC tasks with no prior practice and no task-level explanation. Starting around 9-10 they can solve nearly all of ARC save for the most difficult tasks. This is "extreme generalization" in action.
Loading suggestions...