Machine Learning Street Talk
March 22, 2025

Test-Time Adaptation: the key to reasoning with DL

This episode features Mo, from the winning team of the ARC Challenge (scoring 58%), discussing their innovative approach using test-time adaptation and the future of reasoning with deep learning at Twofer AI Labs in Zurich, following MindsAI's acquisition.

Test-Time Fine-tuning: A Paradigm Shift

  • "Test-time fine-tuning is a new paradigm to deep learning... It's something completely outside of the deep learning paradigm...But there is a way to look at it in which it exactly fits the deep learning paradigm."
  • "What’s the most efficient way to learn at test time?"
  • Test-time fine-tuning allows models to adapt to novel perceptual problems, treating each ARC puzzle as a unique learning opportunity. This directly addresses the challenge of ARC's abstract nature and infinite potential solutions.
  • The winning team views ARC puzzles as perceptual reasoning problems, where finding the right level of abstraction is key to efficient solution searching. This contrasts with approaches focused solely on symbolic reasoning or programmatic solutions.
  • By prompting all examples and the test input simultaneously in the forward pass during both pre-training and test-time, the model learns a "meta" ability to contextualize and generalize, making it easier to fine-tune its reasoning.

Solution-Based Prediction vs. Program Synthesis

  • "You can perceive the problem...and then you can either output in Python or in a direct output manner."
  • The team prioritizes solution-based prediction, directly generating outputs rather than intermediate Python functions. They argue that neural networks, while not inherently compositional, can achieve approximate compositionality through test-time fine-tuning within a specific domain.
  • This contrasts with approaches like DreamCoder that focus on program synthesis, which the team believes suffers from a restrictive output space and neglects the critical perceptual aspect of ARC.
  • While acknowledging the benefits of code pre-training for improving reasoning, the team emphasizes the importance of flexibility and raw representation for tackling ARC’s novelty.

Encoding and Representation: Keeping it Raw

  • "The whole point of ARC is they're going to trick you. Whatever specialization you put in the input, you can create a problem that's adversarial to that tokenization scheme."
  • The team uses a plain representation of ARC problems (numbers as text) to maintain maximum flexibility. They caution against specialized encodings that could be exploited by the adversarial nature of ARC.
  • This minimalist approach contrasts with using vision language models (VLMs), which the team believes are ill-suited for ARC due to their fixed representations and pre-encoded biases.
  • They advocate for a focus on contextualization and flexibility rather than pre-encoding, aligning with the core principle of ARC's design.

Key Takeaways:

  • Test-time adaptation is a powerful technique for tackling abstract reasoning tasks like ARC, enabling neural networks to adapt to novel perceptual challenges and achieve state-of-the-art performance.
  • Prioritizing raw representations and flexible contextualization over specialized encodings or program synthesis can be crucial for handling ARC’s adversarial and abstract nature.
  • The future of reasoning with deep learning lies in exploring creative test-time compute strategies, including more nuanced pre-training and diverse benchmarking, to further unlock the potential of neural networks for complex reasoning.

For further insights, watch the full podcast: Link

This episode explores how MindsAI achieved the top score on the ARC challenge, revealing their innovative use of test-time fine-tuning and a unique voting mechanism within a deep learning framework.

Introduction to MindsAI's ARC Victory

  • Muhammad, along with Jack Cole and Michael Hoddle from MindsAI (now part of Twofer Labs), achieved the highest score on the ARC (Abstraction and Reasoning Corpus) challenge.
  • The team has been working on ARC for two years, emphasizing the benchmark's growing importance in the AI research community.
  • Muhammad: "We've always thought that this benchmark was going to get more important and more important and, you know, this is the case now."

Key Innovations: Test-Time Fine-Tuning

  • Test-time fine-tuning is presented as a novel approach, diverging from traditional deep learning by modifying model parameters during testing.
  • ARC is framed as a perceptual problem, requiring the model to dynamically adjust its understanding based on limited examples.
  • This method mirrors how deep learning tackles new perceptual tasks, applying the training process to each unique ARC puzzle at test time.
  • Muhammad states that many in the top 10 leaderboard positions used similar ideas.

Solution-Based Prediction vs. Program Synthesis

  • MindsAI's approach uses solution-based prediction, contrasting with methods that generate intermediate Python functions.
  • The discussion highlights the inherent lack of compositionality in neural networks, a challenge addressed through specific training techniques.
  • Test-time fine-tuning and deep bias encoding are used to achieve a form of compositionality, though acknowledged as not entirely elegant.

The Role of Code in Pre-training

  • Pre-training with code enhances the model's contextualization ability, crucial for handling the diverse and novel ARC problems.
  • Code pre-training forces the model to be more precise and contextual, unlike natural language, where shortcuts are possible.
  • This approach aligns with research showing code pre-training improves reasoning across various domains.

Meta-Model Training and Forward Pass Prompting

  • The team trains a "meta-model" by prompting all inputs and outputs in a single forward pass, enhancing the model's contextualization.
  • This meta-model is pre-trained on various ARC riddles, learning to generalize from context rather than memorizing specific transformations.
  • Tuning this meta-model at test time is more efficient, requiring smaller adjustments to achieve correct reasoning.

Model Architecture and Pre-training Details

  • The model starts with a pre-trained T5 (Text-to-Text Transfer Transformer) encoder-decoder model, chosen for its contextualization capabilities.
    • T5 models are designed for a variety of text-based tasks, using an encoder-decoder structure to process and generate text.
  • The pre-training recipe includes code and synthetic ARC tasks, focusing on developing a dynamic, steerable model.

Augment Inference Reverse Vote

  • MindsAI employs a voting mechanism, leveraging the idea that there are many ways to be wrong but only one correct solution in ARC.
  • Beam search and other sampling methods generate multiple solution candidates, with a majority vote determining the final answer.
    • Beam search is a search algorithm that explores a graph by expanding the most promising nodes in a limited set.
  • Augmentation involves applying transformations to input puzzles, generating predictions, reversing the transformations, and voting on consistent solutions.

Encoding and Representation of ARC Problems

  • ARC problems are encoded plainly, using numbers as text without any special formatting.
  • This approach avoids imposing biases, emphasizing the need for the model to flexibly interpret raw, novel problem representations.
  • Visual Language Models (VLMs) are deemed unsuitable for ARC due to their fixed representations, hindering flexibility.

Scaling Laws and Model Performance

  • The team observed scaling laws on the hidden ARC test set, indicating that model performance improves with size.
  • Discussion touches on potential information leakage from repeated testing on the hidden set, deemed minimal by the team.

Reflections on François Chollet's Perspective

  • François Chollet, the creator of ARC, is skeptical of test-time compute strategies, favoring neurally-guided program space search.
  • MindsAI critiques the limitations of approaches like DreamCoder, emphasizing the need for flexible perception and a broader output space.
    • DreamCoder is a system that learns to solve problems by synthesizing programs, guided by a neural network.

Future Directions at Twofer AI Labs

  • Twofer AI Labs, having acquired MindsAI, plans to focus on ARC and explore broader AI challenges, including compositionality.
  • The team aims to investigate different test-time compute methods and develop new benchmarks related to ARC.

Challenges and Patterns in ARC Performance

  • Counting tasks are identified as particularly challenging for neural networks, attributed to representational issues in transformers.
  • The discussion highlights the need to address fundamental architectural limitations to improve performance on tasks like counting and copying.

Conclusion

MindsAI's success on the ARC challenge highlights the potential of test-time fine-tuning and innovative prompting strategies for tackling complex reasoning problems. Crypto AI investors and researchers should note the emphasis on model flexibility, contextualization, and the ongoing need to address architectural limitations in neural networks for improved reasoning capabilities.

Others You May Like