Machine Learning Street Talk

March 18, 2025

The Gap Between Humans and Machines Is ___

This episode of MLST features Max, a researcher at Cohere, discussing the nuances of reasoning in large language models (LLMs), the limitations of human feedback, and the importance of dynamic benchmarking. Max, who built Cohere's post-training team, offers insights into the development and evolution of Cohere's LLMs, from early instruction-following models to the more sophisticated Command R and R+.

Reasoning and Robustness in LLMs

"I think reasoning and robustness are intertwined. If you can reason, I expect correct reasoning to imply robust reasoning."
"If humans are consistently successful at getting the model to fail on examples that are slightly different, it's fair to assume the model isn’t reasoning, it's just figured out how to do well on a specific benchmark."
LLMs demonstrate characteristics of both pattern matching and reasoning, excelling at tasks they've seen frequently but sometimes struggling with novel situations.
True reasoning implies consistency. While humans are fallible, the expectation for machines is higher, requiring consistent accuracy, especially in simpler tasks.
Adversarial settings, where humans intentionally probe for weaknesses, are crucial for testing true reasoning abilities and robustness.

Limitations of Human Feedback

"Human preference judgment is strongly informed by formatting and style."
"The more assertive generations were generally perceived by humans to be more correct. So we have this clear example of something counter to the behavior we want, but also doing so in a way that's barely recognized."
Human feedback, while valuable, isn't a gold standard for LLM evaluation. Factors like style, formatting, and assertiveness can bias human judgment, even among trained annotators.
Granular feedback, focusing on specific error types rather than overall preference, offers a more nuanced understanding of model strengths and weaknesses.
Personalization of model behavior, adapting to individual preferences without retraining, presents an exciting direction for future development.

Dynamic Benchmarking and Evaluation

"We are almost caught in a local benchmarking optimum… given where the technology is today, what's the right way to evaluate?"
"The moment that the benchmarks don’t keep up is the point at which we need to either rethink everything from scratch, or maybe we just reached the point that the models are better than expert humans across all domains."
Traditional static benchmarks become quickly saturated, failing to capture the evolving capabilities of LLMs. Dynamic benchmarks, which adapt to model advancements, are essential.
Benchmarks should evolve, like exams in human education, to reflect the increasing complexity and specialization of AI systems.
The focus should shift from general benchmarks to task-specific evaluation, aligning with real-world applications and user needs.

Key Takeaways:

Robustness, not just accuracy, is crucial for LLMs to gain trust and widespread adoption.
Human feedback should be used strategically and granularly to mitigate biases and optimize model behavior effectively.
Dynamic, evolving benchmarks tailored to specific tasks are necessary to truly evaluate the evolving capabilities of increasingly sophisticated AI systems.

For further insights, watch the full podcast: Link

This episode explores the evolving intersection of AI and reasoning, revealing how new research challenges conventional views on model capabilities and human feedback, with significant implications for model training, evaluation, and the future of AI applications.

Max's Background and Research Interests

Max, a researcher at Cohere, discusses his focus on post-training, adversarial data collection, and model evaluation. He emphasizes a continuous feedback loop where model performance is evaluated, improved, and then re-evaluated, highlighting his interest in enhancing model reasoning and robustness.
Max states, "I'm particularly interested in getting models to reason, to operate more robustly, and to generally be more useful."
He defines post-training as the refinement of models after their initial large-scale training, focusing on specific capabilities like instruction following.
Adversarial data collection is described as a method to improve models by identifying and addressing their weaknesses through targeted data.

Challenging Assumptions About Reasoning

Max and the host discuss a recent study led by Laura Ruis at Cohere, which investigated how models learn procedural knowledge during pre-training. The research challenged the initial assumption that models primarily retrieve facts, revealing a more complex process for reasoning tasks.
The study used influence functions, a technique to approximate the impact of specific training examples on model behavior, to analyze how models handle factual versus reasoning queries.
Findings showed that for reasoning queries, models relied on information distributed across many documents, suggesting they combine procedural knowledge in novel ways.
Max notes, "It definitely gave the impression that the model was just relying on procedural knowledge that it had picked up from these various different sources and was potentially combining them in interesting ways."
Control questions, lexically similar but lacking reasoning requirements, further supported the conclusion that models engage in some form of reasoning.

Technical Challenges and Model Size Implications

The discussion covers the technical challenges of using influence functions, particularly their computational cost. The team built upon work from Anthropic, using techniques like EAC (Efficient Approximations for Covariance matrices) to scale curvature estimation, enabling analysis of larger models.
EAC, originally released around 2018, is a method for approximating the curvature of the loss landscape, crucial for understanding the impact of training examples.
The research found little correlation between influential documents for 7B and 35B parameter models, suggesting different learning mechanisms at different scales.
This lack of correlation implies that scaling up models doesn't just amplify existing capabilities; it may fundamentally alter how they learn and process information.

Defining and Achieving Robust Reasoning

Max defines reasoning as intertwined with robustness, asserting that correct reasoning should imply consistent and reliable performance. He uses the example of a model calculating the slope of a line, arguing that even infrequent errors question the model's true reasoning ability.
Max emphasizes that machines, unlike humans, don't have limitations like fatigue, setting a higher standard for consistent reasoning.
He states, "If a model does that 999 times out of a thousand but fails one out of those thousand times…to me it really starts to…bring into the question is this model actually reasoning."
This perspective highlights the need for AI models to demonstrate not just occasional reasoning but consistent, reliable application of learned principles.

The Rise of AI-First Applications

The host highlights the transformative potential of AI, citing the rapid development of AI-first applications as an "iPhone moment." The discussion shifts to test-time training, heralded as a new scaling law, and its implications for centralized versus distributed AI development.
Test-time training, also known as transductive active fine-tuning, involves fine-tuning a pre-trained model for a specific situation, creating a more specialized and efficient model.
The host notes that this approach represents a more distributed paradigm of AI, contrasting with the centralized model of companies like Cohere.
Max anticipates a future with more distributed training, driven by increasing expertise and open access to model weights, like Cohere's command R models.

AI Alignment and Human Feedback Challenges

The conversation turns to AI alignment, with Max emphasizing the complexity of defining what humans want and expect from AI systems. He highlights the need for basic safeguards while acknowledging the societal challenges of aligning AI with diverse and evolving human values.
Max questions, "What do we want to align AI systems to?" pointing out the ambiguity and variability of human preferences.
He notes that aligning AI is not just a technical problem but also a societal one, requiring consideration of diverse perspectives and contexts.
The discussion underscores the importance of ensuring AI remains safe for general use while continuing to advance its capabilities.

Human Feedback is Not a Gold Standard

Max discusses his paper, "Human Feedback is Not a Gold Standard," co-authored with Tom and Phil. The paper critiques the reliance on human preference as the primary metric for training and evaluating AI models, particularly in the context of reinforcement learning from human feedback (RLHF).
RLHF is a technique where models are fine-tuned based on human preferences for different outputs, typically using a binary comparison.
The research found that human preference judgments are strongly influenced by factors like formatting, style, and assertiveness, often overshadowing crucial aspects like factuality.
Max explains, "Human preference judgment is strongly informed by formatting and by style…attributes or criteria like factuality were quite lowly ranked in terms of their contribution to overall quality."
The study revealed that models prompted to be more assertive were perceived as more correct, even when they were less factual, highlighting a significant bias in human evaluation.

Addressing Irrationality and Ambiguity in Human Feedback

The discussion explores the implications of human irrationality and cognitive limitations in evaluating AI. The host raises the challenge of scaling feedback to complex tasks, like comparing entire books, where ambiguity increases exponentially.
Max acknowledges the potential for creating an overwhelming number of attributes to capture human preferences, suggesting a focus on personalized model behavior.
He proposes the concept of a "portrait of preference," a data-driven representation of an individual's interaction style and preferences, enabling on-the-fly model customization without retraining.
Max envisions a future where a single, robust model can adapt to individual needs based on this personalized data, avoiding the impracticality of training separate models for everyone.

Prism: Understanding Diverse Human Preferences

Max introduces Prism, a project led by Hannah Kirk, which won a best paper award at NeurIPS. Prism investigates how demographics, culture, and other factors influence human preferences for model behavior at scale.
Prism analyzes over 100 pages of data, revealing insights into the diversity of human preferences and their impact on model evaluation.
The research highlights how even the topics people discuss with models are influenced by their backgrounds, potentially leading to biases in model development.
Max cites an example where conversations about the Israel-Palestine conflict were predominantly from users in the Middle East, emphasizing the need for representative data.
Prism demonstrates that model rankings can change significantly depending on who is providing the feedback, underscoring the limitations of a single, universal ranking system.

Adversarial Examples and Robustness

The conversation shifts to adversarial examples, with Max discussing Andrew Ng's paper, "Adversarial Examples Are Not Bugs, They Are Features." He emphasizes the importance of human-informed adversarial examples to improve model robustness.
Max explains that traditional adversarial examples, generated by functions like injecting Gaussian noise, often lead to models learning to counteract the specific function rather than addressing real-world noise.
He advocates for human-informed adversarial examples, where humans interact with models to identify weaknesses and create more complex and representative training data.
Max references his work on Adversarial QA, a project from 2019, which demonstrated that adversarial examples collected against weaker models could significantly improve the robustness of stronger models.

DynaBench: Dynamic Benchmarking for Robustness

Max discusses DynaBench, a platform for dynamic, adversarial data collection and benchmarking. DynaBench aims to address the limitations of static benchmarks, which often lead to models optimizing for specific metrics rather than achieving genuine robustness.
DynaBench is a research platform that facilitates model-in-the-loop adversarial data collection, where models are continuously challenged and improved based on human interaction.
The platform emphasizes dynamic benchmarking, where benchmarks evolve over time to reflect the current capabilities of models and prevent optimization for specific, outdated metrics.
Max highlights the problem of Goodhart's Law, where a measure becomes the target and ceases to be a good measure, as models optimize for the benchmark rather than the underlying capability.
DynaBench has powered various research projects, including work on hate speech detection, sentiment analysis, and improving the quality of pre-training data.

Rethinking Benchmarking for LLMs

The discussion explores the need for new benchmarking approaches for LLMs, considering their evolving capabilities. Max suggests drawing inspiration from human education systems, which use a hierarchical taxonomy of skills and progressively complex examinations.
Max argues that current benchmarks often fail to capture the full breadth of LLM capabilities, leading to potentially misleading evaluations.
He proposes a system of examinations designed to test models, potentially inspired by human testing but adapted to the unique characteristics of AI.
Max emphasizes the importance of dynamic benchmarks that keep pace with technological advancements, ensuring that evaluations remain relevant and challenging.

DataPerf: Focusing on Data-Centric AI

Max introduces DataPerf, a set of challenges run through the DynaBench platform, emphasizing the importance of data quality in AI development. DataPerf aims to shift focus back to the data used to train models, recognizing its central role in achieving better, more capable AI systems.
DataPerf highlights the data-centric approach, emphasizing that the quality and relevance of training data are crucial for model performance.
The challenges aim to promote improvements in data collection, curation, and utilization, recognizing that data is often the bottleneck in AI development.
Max notes that DataPerf emerged during a period when there was more focus on algorithms than on data, underscoring the need to re-emphasize data's importance.

Max's Work at Cohere: Post-Training and Command Models

Max describes his work at Cohere, focusing on post-training and the development of command models. He joined Cohere before the launch of ChatGPT, a period when language models were less refined and instruction-following capabilities were limited.
Max built out the post-training team at Cohere, focusing on improving models' ability to follow human instructions.
He describes an early initiative to collect instruction-following data internally, where employees provided questions and ideal responses, exceeding initial expectations.
This effort led to the first generation of command models, demonstrating improved instruction-following capabilities.
Cohere then moved to training a new model every week for a year, continuously improving performance and engaging with the developer community.
The latest generation of models, Command R and R+, achieved high rankings on various benchmarks, demonstrating significant advancements in capabilities.

Quantization and Its Trade-offs

The discussion turns to quantization, a technique for reducing model precision to improve efficiency. Max acknowledges its usefulness as a short-term solution but highlights potential blind spots in evaluation.
Quantization involves reducing the numerical precision of model parameters, making them smaller and faster to process.
Max notes that the objective of quantization is often to maintain performance on existing evaluations while gaining efficiency, potentially accepting a small drop in performance.
He warns that quantization might degrade performance in areas not captured by current evaluations, such as complex reasoning or long-context dependencies.
Max emphasizes the incompleteness of current evaluations, highlighting the risk of optimizing for measurable metrics while sacrificing unmeasured capabilities.

Reasoning: Synthesis vs. Execution

The host and Max delve into the nature of reasoning, distinguishing between the synthesis of rules and their execution. Max suggests that while synthesis is a requirement for true reasoning, execution can be rigorously tested.
The host uses the example of multiplication or long division, where humans can execute rules even with occasional mistakes, but also have the ability to synthesize those rules.
Max highlights work on mechanistic interpretability and test-time computation, where models generate reasoning chains, as areas of ongoing exploration.
He emphasizes the adversarial setting as a way to probe for specific reasoning capabilities, revealing limitations in current benchmarks.
Max argues that if a model consistently fails on slightly modified examples, it suggests a lack of true reasoning, even if it performs well on the original benchmark.

The Standard for Reasoning in AI

Max argues for a high standard for reasoning in AI, emphasizing that machines should be consistently correct, unlike humans who are prone to errors. He draws a parallel to traditional software, where users expect consistent and reliable results.
Max states, "The expectation of machines is consistently their right all the time…it's a function of traditional software."
He uses the example of a calculator, where inconsistent results would lead users to discard it, highlighting the higher expectations for machine accuracy.
Max acknowledges that the standard might be relaxed for more complex tasks, but for simpler tasks like grade school math, models should demonstrate near-perfect consistency.

Pattern Matching vs. Reasoning in LLMs

The discussion returns to the interplay of pattern matching and reasoning in LLMs. Max describes how models sometimes retrieve facts directly from documents, while other times engaging in more distributed reasoning processes.
The host highlights the example of asking "What is the color of the thing where the clouds are?" as a combination of pattern matching and reasoning.
Max suggests that models, like humans, likely learn simpler tasks first and then assemble them into more complex functions.
He notes that it's inefficient to reason if you don't need to, citing the example of a model asked "What is 1+1?" which it has likely seen many times in pre-training.
Max emphasizes that as long as the solution is correct, the method (retrieval or reasoning) doesn't necessarily matter.

The Arc Challenge and Future Directions

Max discusses the Arc challenge as another step in pushing the limits of current models. He emphasizes the need to ground future challenges in real-world applications and societal needs.
The Arc challenge is a benchmark designed to test models' reasoning abilities.
Max anticipates that future challenges will focus on how AI systems can operate in society, interact with humans, and benefit humanity.
He suggests that the key questions will shift from purely technical capabilities to practical applications and societal impact.

Connectionism and the Future of AI

The host asks whether connectionism can achieve reliable, robust reasoning, or if hybrid architectures are needed. Max expresses indifference, seeing no immediate architectural bottlenecks, but acknowledging limitations in current approaches.
Max cites the example of glitch tokens, where certain tokens cause unusual behaviors in models due to mismatches between tokenizer training data and model training data.
He sees this as a limitation of the current way of doing things, not necessarily a fundamental flaw in connectionism.
Max believes that larger, deeper models will continue to achieve more impressive results, and the limitations of connectionism will become apparent only when progress plateaus.
He highlights the efficiency gains and parallelism of transformers as key drivers of recent progress.

The Importance of Context Window Size

The discussion covers the importance of context window size in LLMs. Max notes the challenges of maintaining high performance across large context windows and the trade-offs between context length and efficiency.
Max describes the recent trend of increasing context window sizes, with Cohere's models moving from 4K to 128K tokens.
He acknowledges that while most user queries don't require long contexts, they are useful for tasks like retrieval-augmented generation, tool use, and long conversations.
Max highlights the potential of long contexts for personalization, where the entire conversation with the model represents the user's preferences.
He notes that long contexts are particularly valuable for working with code, enabling models to process entire codebases.
Max questions the point of continuing to feed things into the context if it doesn't improve the answer, emphasizing the need for efficiency.

Reasoning Models and Test-Time Computation

Max expresses interest in reasoning models and the idea of trading off test-time computation with performance. He describes ongoing work at Cohere exploring how models interact with users and how to make reasoning processes more controllable and customizable.
Max highlights the potential of allowing models to generate more tokens during reasoning to improve performance.
He envisions a future where models have well-calibrated confidence scores, allowing users to specify the desired level of certainty for different tasks.
Max suggests that for critical tasks, users might instruct the model to think as much as needed to ensure correctness, while for less critical tasks, a lower confidence threshold might be acceptable.
He emphasizes the importance of user interaction and customization in shaping model behavior.

Reflective and Strategic Conclusion

This episode reveals the evolving landscape of AI, where reasoning and robustness are becoming central to model development. For Crypto AI investors and researchers, the key takeaway is the need to move beyond traditional benchmarks and focus on dynamic, human-informed evaluations that capture the nuances of real-world applications and diverse human preferences, anticipating a shift towards personalized, adaptable AI systems.

The Gap Between Humans and Machines Is ___

Others You May Like

The Weird ChatGPT Hack That Leaked Training Data

Dev.Fun: Leading AI Vibe Coding Platform on Solana with 20,000+ Applications Live

Bittensor Novelty Search :: Opentensor Foundation Update :: Uniswap V3 update + more

The Gap Between Humans and Machines Is ___

Join 4,000+ smart readers to get access to all our research and tools for free.

Others You May Like

The Weird ChatGPT Hack That Leaked Training Data

Dev.Fun: Leading AI Vibe Coding Platform on Solana with 20,000+ Applications Live

Bittensor Novelty Search :: Opentensor Foundation Update :: Uniswap V3 update + more