a16z

May 23, 2025

Building AI Systems You Can Trust

This episode dives into the critical shift in AI development, moving beyond mere performance optimization to establishing genuine trust and reliability in AI systems. The speaker, a seasoned entrepreneur who pivoted from AI optimization (SigOpt) to AI testing (Distributional), unpacks why confidence is the new king in the AI landscape.

The Trust Deficit: Beyond Peak Performance

"The thing that's holding back people getting value from these AI systems is not performance... It's about being able to confidently trust these systems."
"We're seeing people do the exact same thing again today with LLMs where they're focusing on these high-level metrics...and that ends up masking all of these potentially undesired behaviors within the system itself."
The core bottleneck for AI adoption isn't wringing out the last fraction of a percent in performance; it's the pervasive lack of confidence in how these systems will behave predictably and reliably.
An overemphasis on high-level evaluation metrics can obscure a multitude of unintended, undesirable behaviors, a lesson learned from traditional ML now repeating with LLMs.
Enterprises are hesitant to deploy AI in high-stakes scenarios not because models aren't smart enough, but because they can't sleep at night worrying about unforeseen failures or erratic behavior.

The Unruly Nature of Modern AI

"There's really three main things that make it difficult to quantify and understand the behavior of these AI systems. And one is that they're inherently non-deterministic...Another aspect...is that they're inherently non-stationary...And this gets to this third component of the complexity of these systems is just getting larger and larger."
Modern AI, especially generative systems, presents a trifecta of challenges:
- Non-determinism: The same input can yield different outputs.
- Non-stationarity: The AI's behavior shifts constantly due to model updates, data changes, or even infrastructure tweaks by providers.
- Escalating Complexity: Agentic systems, where models call other models, create cascading effects where small upstream variations can lead to massive, unpredictable downstream behavioral changes.
Focusing only on the final output makes diagnosing issues a nightmare. Understanding the why behind a behavioral shift requires deeper visibility into the intermediate steps.

Taming the Beast: Enterprise GenAI Platforms & Behavioral Testing

"We're now starting to see the rise of these Gen AI platforms...[to] tamp down on...shadow AI."
"Behavior for these applications ends up being not just what it produces but how it produces it."
Enterprises are building centralized GenAI platforms to manage the sprawl, offering benefits like scalability, cost control, and crucially, a bulwark against "shadow AI" (rogue AI usage). These platforms often start with a gateway/router controlling access to a surprisingly large array of models (e.g., 30+ models/versions).
True confidence comes from behavioral testing: monitoring not just what the AI outputs, but how – considering factors like toxicity, tone, reading level, data sources cited, and reasoning steps.
Distributional's approach is to track the entire "distributional fingerprint" of behavior, using many weak statistical estimators to detect if system behavior 'A' is different from 'B', enabling rapid root-cause analysis when performance dips.

Key Takeaways:

The AI game is shifting from a sprint for performance to a marathon of building and maintaining trust. As AI systems become more powerful and integrated, understanding and managing their behavior is paramount.
Trust Trumps Tweaks: Stop chasing marginal performance gains if you haven't nailed reliability; the biggest barrier to AI value is a lack of confidence, not capability.
Embrace Behavioral Intelligence: Shift from only evaluating final outputs to continuously testing the how and why of AI behavior across the entire system, especially for non-deterministic and non-stationary models.
Platformize for Prudence: Enterprises must build or adopt centralized GenAI platforms with robust logging and testing to manage risk, ensure consistency, and provide developers with the tools to build trustworthy AI.

Podcast Link: https://www.youtube.com/watch?v=o-TdD_hLt5s

This episode reveals why AI's true adoption barrier isn't performance but trust, exploring the critical shift towards robust behavioral testing for generative AI systems and its profound implications for building reliable AI infrastructure.

The Evolving Definition of AI and Machine Learning

Machine Learning (ML) often refers to AI techniques that have become established and relatively easy to implement.
AI tends to encompass the newer, cutting-edge developments that still feel like "magic." Once a technology matures, it often gets reclassified as ML.
Scott recalls the Dartmouth Conference (a 1956 workshop considered the founding event of AI) where concepts like spell-check were considered AI, a notion that seems outdated today.
He notes that Generative AI represents a significant shift. Unlike earlier systems focused on classification or regression, generative models are more interactive and capable of producing a wider range of outputs, fundamentally changing their application potential.

From Performance Optimization to Trust: A Founder's Journey

Scott shares his journey, starting with his first startup, SIGOP, which originated from his PhD research about 10-15 years ago.
SIGOP focused on optimizing traditional ML systems—tuning XG Boost models (a popular gradient boosting library), comp nets (likely referring to Convolutional Neural Networks used in image recognition), and reinforcement learning algorithms (where agents learn by trial and error to maximize rewards). The goal was peak performance.
After selling SIGOP to Intel in 2020 and leading their AI and HPC division, Scott realized he was "solving the wrong problem." He states, "the thing that's holding back people getting value from these AI systems is not performance... It's about being able to confidently trust these systems."
This realization stemmed from clients repeatedly asking what might have "broken" or what "bad behaviors" were introduced during optimization, fearing overfitting. This concern is resurfacing with LLMs (Large Language Models), where a focus on high-level performance metrics can mask undesirable internal behaviors.
Strategic Implication for Crypto AI: For decentralized AI systems, where direct oversight is limited, establishing trust through verifiable behavior and transparent testing mechanisms will be paramount for adoption and investment.

The New Challenge: Understanding and Testing AI Behavior

Scott explains that his new company, Distributional, aims to address this confidence gap through comprehensive testing, rather than just performance optimization.
The problem is harder with generative AI because outputs are often free-form text or actions from agentic systems (AI systems that can perform tasks autonomously).
His experience at Intel, managing a larger team and customer base, provided a higher-level perspective on enterprise frustrations with AI reliability and consistency. The core question became: "how do I sleep at night effectively?"
This led him to build Distributional, focusing on tools that help domain experts build and deploy AI more confidently.
Crypto AI Researcher Focus: Research into zkML (Zero-Knowledge Machine Learning) or other privacy-preserving verification methods could be crucial for testing AI behavior in trust-minimized environments, a key concern for crypto applications.

Persistence and Empathy in Solving AI's Core Problems

The host notes Scott's rare persistence in tackling a similar core problem across multiple ventures. Scott views this as a strength:
This long-term engagement allows for deep empathy with AI builders, recognizing recurring patterns and mistakes.
The ethos of both SIGOP and Distributional is to empower domain experts, allowing them to focus on their core expertise without being bogged down by the underlying complexities of AI systems.
He observes that many individuals who built the initial ML systems are now leading the charge in developing and productionizing Generative AI platforms.

Generative AI vs. Traditional ML: A Shift in Atomicity

Scott distinguishes between traditional ML and generative AI based on the "atomicity" of their units:
Traditional ML often focused on specific, atomic decisions: loan approvals, stock predictions.
Generative AI is more collaborative and expansive, capable of conversations or complex, multi-step processes involving models calling other models or services.
While such pipelines existed, the end-to-end, hands-off nature of modern generative systems, and how behaviors propagate, is fundamentally different.
Actionable Insight for Investors: Projects building complex, multi-agent AI systems on crypto rails must demonstrate robust inter-agent communication protocols and methods to trace and mitigate cascading behavioral issues.

Product Owners' Concerns: Beyond Performance

For product owners, the concerns with new AI systems extend beyond just performance, which was also a factor in traditional ML.
The primary shift is the increased importance of managing and preventing undesired behaviors. This could range from avoiding drastic, unannounced changes to explicitly preventing bias or specific types of responses.
Scott emphasizes, "it's more than just performance too. It's this behavior of these systems and it's making sure that not only does it hit whatever KPI you want it to do but it also doesn't do the bad thing."
The sheer size of the potential output space in generative AI makes this significantly more challenging.

The Three Core Difficulties in Quantifying AI Behavior

Scott identifies three main challenges in understanding and quantifying the behavior of modern AI systems:
1. Non-determinism: The same input can produce different outputs. This isn't just about slight variations; AI systems can be chaotic, where minor input changes lead to vastly different results.
2. Non-stationarity: The systems are constantly changing "underneath you." This could be due to LLM provider infrastructure updates, changes in upstream data sources like a vector database (a database optimized for storing and querying vector embeddings, often used in RAG systems), or modifications to retrieval prompts.
3. Increasing Complexity: Systems are no longer atomic units but complex chains (e.g., retrieval -> generation -> further processing -> autonomous decision). Issues like non-determinism and non-stationarity can propagate and amplify through these chains.
Focusing only on the final output makes it hard to pinpoint where, when, and why behaviors are shifting upstream.
Crypto AI Implication: For AI models governed by DAOs or operating on decentralized infrastructure, non-stationarity due to diverse node behaviors or evolving governance proposals needs careful monitoring and adaptive testing frameworks.

Trust as the Cornerstone of AI Acceptance

The conversation underscores that trust in AI systems is becoming more critical than raw performance.
Trust encompasses reliability, consistency, and ensuring that "latent behaviors are aligned with my desires."
For enterprises, this means aligning AI applications with business values and specific goals, not just optimizing for narrow metrics like click-through rates.
Scott emphasizes the enterprise's need to trust its AI systems to maintain customer trust: "it's important to trust, but also to verify. And that's where testing comes in."
Actionable Insight for Researchers: Developing standardized benchmarks and methodologies for "trustworthiness" in AI, particularly for systems interacting with smart contracts or financial assets, is a vital research area.

Distributional: Testing AI Behavior in Production

Scott elaborates on Distributional's approach: an enterprise platform for testing AI applications in production to ensure they behave as expected.
"Behavior" includes not just the output, but how it's produced: toxicity, reading level, tone, length of the answer.
For RAG (Retrieval Augmented Generation) systems – which combine LLMs with external knowledge retrieval – this includes what was retrieved, retrieval frequency, and document timestamps.
In agentic systems, it involves analyzing reasoning steps: duration, number of steps.
Scott clarifies, "I definitely don't want to say that performance doesn't matter because it definitely does... but it is fundamentally this very high-level bit and it can mask all of these underlying latent behaviors."
Catching these latent behaviors early is crucial due to the non-stationary and chaotic nature of AI. Some detected behaviors might even become new performance metrics.

The "How" Matters: AI's Bedside Manner

The discussion likens AI behavior to human interaction: it's not just what you say, but how you say it, and what you ultimately do.
Scott cites classic examples of "reward hacking" in reinforcement learning, like an AI baseball pitcher running the ball to the catcher, or a chess AI rewriting the board state to win. These systems achieve the stated performance metric (ball in mitt, game won) but through undesirable behaviors (not pitching, cheating).
Crypto AI Consideration: In on-chain AI agents, "cheating" could mean exploiting smart contract loopholes or manipulating oracle data. Robust behavioral guardrails and detection systems are essential.

The Shift to Centralized GenAI Platforms and Shadow AI

A significant trend observed is enterprises moving from scattered "science project" AI initiatives to centralized GenAI platforms.
This mirrors the earlier evolution of ML/AI platforms, moving beyond individual data scientists using tools like scikit-learn (a foundational open-source machine learning library in Python) to more structured, scalable solutions.
Centralization helps manage resources, ensure cost allocation, and, critically, combat "Shadow AI"—unauthorized use of AI models or data handling. Scott notes this is worse with generative AI because "everybody's doing it," unlike earlier ML where specialized knowledge was required. The ease of using an API key makes misuse simpler.
These centralized platforms become ideal points for integrating testing, providing a holistic view of application behavior.
Actionable Insight for Investors: Companies providing tools for secure, compliant, and observable GenAI platform deployment within enterprises are addressing a critical market need. For decentralized counterparts, similar "gateway" or "router" concepts might emerge to manage access and monitor behavior across a network.

Building Enterprise GenAI Platforms: Practical Steps

For technology executives grappling with proliferating AI projects, Scott outlines practical steps for building a centralized platform:
1. Provide Value-Add Services: To entice developers away from bespoke stacks, platforms should offer benefits like automated scaling, cost optimization, and centralized logging.
2. Centralized Router/Gateway: This often comes first, sometimes as a mandate ("this is the only way we're going to let you access these things"). It allows enterprises to control which models are used (e.g., approving OpenAI but not a less vetted model like "hosted DeepSeek") and manage versions. Scott mentions enterprises supporting up to 30 different models and their versions due to varying trade-offs (cost, context windows, rate limits).
3. Support for Fine-Tuned and SLMs: Platforms also need to accommodate fine-tuned models, SLMs (Small Language Models), and static weight models.
4. Logging: Once a gateway is in place, logging API requests and traces is the next logical step, leveraging existing data stores if possible.
5. Analytics and Testing: With logs, analytics, monitoring, and behavioral testing can be layered on top, providing value back to developers "off the shelf."
Crypto AI Researcher Focus: Designing decentralized logging and auditing trails that are both transparent and privacy-preserving (perhaps using zk-proofs) is a key challenge for on-chain AI platforms.

Misaligned Incentives and the Platform Owner's Role

Scott highlights the challenge of misaligned incentives:
AI labs like OpenAI aim for general-purpose models.
Businesses need models solving specific problems effectively.
Developers want rapid integration.
The enterprise platform owner acts as a crucial intermediary, needing to provide access to cutting-edge tools while ensuring ease of use, compliance, and alignment with enterprise needs. This involves a technical challenge of creating a "universal puzzle piece adapter."
Strategic Implication: Crypto AI platforms might face similar incentive misalignments between model creators, application developers, and token holders/governors. Clear governance and economic models are vital.

Holistic Behavioral Testing: Beyond Vibe Checks

Scott argues for a shift from superficial "vibe checks" or limited test cases to holistic behavioral testing at scale.
Many firms are stuck in the "AI confidence gap," hesitant to scale prototypes because behavior changes unpredictably with new users or data. Scott quotes a common concern: "when I turn on the fire hose, I have no idea what's going to happen."
The mindset needs to shift from "does it work the way I want to look at it" to "does it work more holistically?"
Actionable Insight for Crypto AI: For AI systems interacting with valuable crypto assets, "vibe checks" are insufficient. Investors should demand evidence of rigorous, scalable behavioral testing before committing capital.

Consequences of Inadequate Testing

Scott shares examples of issues arising from insufficient testing:
A firm continuously adding data to a RAG system's corpus, believing "more data is better," inadvertently messed up the retrieval mechanism. Previously good, recent responses were replaced by irrelevant old data.
Hallucination remains a problem, where systems invent evidence or misinterpret user intent, leading to poor user experiences.
Users triggering guardrails unexpectedly due to subtle changes in intermediate system components transforming their input.
Crypto AI Risk: In a crypto context, a poorly tested RAG system feeding an AI agent could lead to disastrous financial decisions based on outdated or irrelevant blockchain data.

Atomic Quantification, Holistic Testing

The ideal testing approach involves quantifying behavior at an atomic level (e.g., retrieval step characteristics) but testing across that behavior holistically, like a regression test.
This means not just looking at the end answer but understanding how input changes propagate through the entire system.
Scott's team at Distributional has developed sophisticated mathematical approaches to this. Instead of a few strong estimators for performance (A is better than B), they use many weak estimators to determine if A is different than B.
These "higher entropy" weak estimators can reveal subtle behavioral shifts that might precede performance degradation, enabling root cause analysis.
Researcher Opportunity: Developing novel statistical methods or machine learning techniques to detect subtle behavioral drift in decentralized AI networks, where data is often noisy and distributed, is a valuable research direction.

Characterizing Distributions for AI Systems

The company name, Distributional, reflects their core idea: thinking about AI behavior in terms of distributions.
It's not about a single bad input, but how the overall "distributional fingerprint of behavior" (outputs and the entire process to get there) shifts over time compared to a baseline.
This contrasts with the common AI discussion of "in-distribution" vs. "out-of-distribution" data relative to training sets. Distributional aims to characterize the actual operating distributions of a deployed system.
Crypto AI Implication: For AI models trained on or interacting with volatile crypto market data, continuously characterizing and monitoring shifts in their operational output distributions is crucial for risk management.

Benefits for Enterprises: Confidence and Risk Mitigation

The ultimate benefit of robust AI testing is increased confidence, enabling enterprises to:
Tackle harder, more valuable problems that often carry higher inherent risk (financial, reputational, regulatory).
Move beyond low-hanging fruit like internal HR chatbots.
Define reliability, starting with detecting change ("is today different than yesterday?") and then applying preferences to those changes.
Manage the "tech debt" that accrues in AI systems, such as deciding whether to switch to a cheaper model or refactor complex system prompts, by understanding the behavioral trade-offs.
Actionable Insight for Investors: Crypto AI projects that can demonstrate a mature approach to behavioral testing and risk mitigation for high-value use cases will be more attractive investments.

System Prompts as a Reflection of Organization: Conway's Law for AI

A fascinating observation is that system prompts often reflect the organization that created them, a new form of Conway's Law (which states that organizations design systems that mirror their own communication structure).
Scott notes how prompts can become a "combinatorial mess" as different teams (engineering, compliance, marketing) add their requirements, sometimes conflictingly.
Modifying these complex prompts without understanding the behavioral implications is risky, especially once a system is generating value.
A robust testing system allows enterprises to make these changes with more confidence.
Crypto AI Consideration: In decentralized AI, "system prompts" might be encoded in smart contracts or governance proposals. Analyzing how these "prompts" evolve and their behavioral impact will be key.

Co-evolution of AI Labs and Enterprise Needs & The Rise of AI Ops

Scott foresees a co-evolution:
AI research labs will continue to push boundaries but will also adapt to enterprise user needs to generate revenue.
Enterprises will adapt to new tools, with platform owners acting as crucial connectors.
This dynamic will lead to specialization, with certain models excelling at specific enterprise tasks.
Crucially, as GenAI platforms mature, the need for AI Ops will grow – dedicated teams responsible for maintaining AI systems, understanding failures, and implementing fixes, much like DevOps for traditional software. The question "Who gets paged in the middle of the night?" becomes critical.
Strategic Implication: The emergence of AI Ops creates opportunities for new tools, services, and skill sets in the Crypto AI space, particularly for maintaining decentralized AI infrastructure and applications.

Global vs. Local Solutions for AI Reliability

Scott concludes by discussing the balance between universal and bespoke solutions for AI reliability:
Universal aspects include defining common behaviors, detecting large-scale changes, and providing high-level insights. Distributional's platform aims to provide this foundational layer.
However, each team and application has unique desired/undesired behaviors. The workflow then involves adapting global change detection to build specific, bespoke behavioral test coverage.
This allows organizations to leverage a common foundation while fine-tuning testing to their individual needs.

Conclusion

This episode underscores that AI's future hinges on verifiable trust, not just performance. For Crypto AI investors and researchers, the key takeaway is the urgent need for robust, adaptive behavioral testing frameworks, especially for decentralized systems where direct control is limited and risks can be amplified.

Building AI Systems You Can Trust

Others You May Like

Amjad Masad & Adam D’Angelo: How Far Are We From AGI?

Mark Zuckerberg & Priscilla Chan: How AI Will Cure All Disease

Seeing The Future from AI Companions to Personal Software

Building AI Systems You Can Trust

Join 4,000+ smart readers to get access to all our research and tools for free.

Others You May Like

Amjad Masad & Adam D’Angelo: How Far Are We From AGI?

Mark Zuckerberg & Priscilla Chan: How AI Will Cure All Disease

Seeing The Future from AI Companions to Personal Software