Bg2 Pod
September 11, 2025

Inside OpenAI Enterprise: Forward Deployed Engineering, GPT-5, and More | BG2 Guest Interview

OpenAI’s Head of Engineering, Sherwin Wu, and Head of Product, Olivia Goodman, pull back the curtain on the company’s enterprise engine, revealing how its platform goes far beyond ChatGPT to transform industries from healthcare to national security.

The Enterprise Engine Room

  • “We actually view our platform and especially our API... as our way of getting the benefits of AGI to as many people as possible.”
  • “There is a large category of use cases that only go through B2B... those are the businesses who are actually making stuff happen in the real world.”

While ChatGPT captured the world’s attention, OpenAI’s original product was its B2B API. The platform is seen as the primary vehicle for distributing AI's benefits, enabling developers and enterprises to tackle complex, real-world problems in areas like medicine and public service that a consumer-facing chatbot cannot reach alone.

Building in the Trenches

  • “We literally had to bring the weights of the model physically into their supercomputer.”

OpenAI’s enterprise work involves deep, bespoke integrations, often carried out by "Forward Deployed Engineers"—a term borrowed from Palantir. Key deployments include:

  • T-Mobile: Automating both text and voice-based customer support with natural, low-latency interactions powered by OpenAI's real-time API.
  • Amgen: Accelerating drug development by using models to analyze R&D data and automate the massive administrative workload of authoring regulatory documents.
  • Los Alamos National Labs: A custom, on-premise deployment of a reasoning model onto a secure, air-gapped government supercomputer for national security research.

GPT-5: Beyond the Benchmarks

  • “For GPT-5, equally important and impactful was the craft: the style, the tone, the behavior of the model.”

The development of GPT-5 was a collaboration with enterprise customers. The focus shifted from pure benchmark performance to "craft"—making the model more reliable, better at following instructions, and less prone to hallucination. A key trade-off remains performance versus reasoning time; giving the model more time to "think" produces superior results, but this latency is a challenge for real-time applications.

Key Takeaways

  • The conversation reveals that successful enterprise AI is less about dropping a powerful model into a system and more about deep, hands-on partnership. The gap between a model's potential and its real-world impact is closed by building the "scaffolding"—the data connectors, evaluation frameworks, and system integrations—that allows AI to operate effectively.
  • Enterprise AI is a Services Business. The best models are not enough. Success requires deep integration via "Forward Deployed Engineers" who build the necessary data scaffolding and orchestration layers.
  • GPT-5 Was Co-developed with Customers. Its focus on "craft" (behavior, tone) over raw benchmarks was a direct result of an intensive feedback loop with enterprise partners, making it more practical for real-world use.
  • Bet on Applications, Not Tooling. The speakers are short the entire category of AI tooling (frameworks, vector DBs), arguing the underlying tech stack is evolving too rapidly. Long-term value will accrue to those building applications in high-impact sectors like healthcare.

For further insights and detailed discussions, watch the full podcast: Link

This episode goes behind the scenes of OpenAI's enterprise division, revealing how advanced models are deployed in high-stakes industries and what it takes to bridge the gap from powerful AI to real-world business autonomy.

Introduction: Beyond ChatGPT to Enterprise AI

  • Sherwin Wu, Head of Engineering, and Olivia Goodman, Head of Product for the OpenAI Platform, introduce the company's enterprise-focused division. While ChatGPT is the public face of OpenAI, the platform began with a B2B focus through its API, aiming to distribute the benefits of AGI (Artificial General Intelligence)—AI with human-like cognitive abilities—to businesses and developers.
  • Sherwin explains that the platform, which includes the developer API, government sector products, and direct enterprise offerings, is a core part of OpenAI's mission. By enabling developers and large companies, OpenAI can extend the reach of AI into specialized use cases far beyond what a single consumer application can achieve.
  • Olivia emphasizes the strategic importance of B2B, stating, "There are a large category of use cases that only go through B2B... those are the businesses who are actually making stuff happen in the real world and so if you do enable them... that's how essentially you benefit to distribute AGI." This perspective frames enterprise as a critical channel for achieving real-world impact in sectors like healthcare, education, and public services.

Case Study: Transforming Customer Support at T-Mobile

  • Olivia details OpenAI's collaboration with T-Mobile to automate and enhance its massive customer support operations, which include both text and voice interactions. The goal was to help customers self-serve and resolve issues more efficiently, moving beyond simple text-based bots to sophisticated voice support.
  • The project required more than just providing a model. OpenAI's Forward Deployed Engineers—a term borrowed from Palantir for engineers who embed deeply with customers—were instrumental. They helped orchestrate models, connect them to T-Mobile's internal systems like CRMs (Customer Relationship Management software), and build the necessary integrations, many of which lacked clean APIs.
  • A critical component was establishing robust evals, or evaluation frameworks, to define and measure success, especially for audio. Grading a five-minute voice call for quality and accuracy is a complex problem. The collaboration also directly improved OpenAI's models, with learnings from T-Mobile contributing to the general availability release of the real-time voice API.

Case Study: Accelerating Drug Development with Amgen

  • Olivia highlights OpenAI's work with Amgen, a leading healthcare company specializing in drugs for cancer and inflammatory diseases. The partnership aims to accelerate the entire drug development and approval process, a goal with the potential to impact millions of lives.
  • The needs of a healthcare giant like Amgen fall into two buckets: pure R&D, where scientists analyze massive datasets, and administrative work, which involves authoring and reviewing extensive documentation for regulatory submissions. OpenAI's models are used to augment and automate tasks in both areas.
  • This deployment underscores the strategic value of enterprise AI. Olivia notes that Amgen was a top customer for GPT-5, demonstrating how cutting-edge models are being applied to solve some of humanity's most pressing challenges, far from the public eye.

Case Study: On-Premise AI for National Security at Los Alamos

  • Sherwin discusses a unique and highly secure deployment with Los Alamos National Laboratory, the U.S. government research lab famous for the Manhattan Project. Given the sensitive nature of their national security and defense research, the lab could not use public APIs.
  • OpenAI executed a custom on-premise deployment, installing one of its reasoning models directly onto Los Alamos's "Venado" supercomputer. This involved physically bringing the model weights into an airgapped environment—one completely disconnected from the public internet—and integrating it with their specific hardware and networking stack.
  • Although OpenAI has limited visibility into its exact uses, the model serves as a thought partner for scientists, aids in data analysis for experiments, and helps with experiment design. This case illustrates the necessity of bespoke solutions for high-security clients and the expanding role of AI in novel scientific research.

Why Enterprise AI Deployments Succeed (and Fail)

  • Reflecting on hundreds of enterprise engagements, Olivia identifies key patterns for success. The most effective deployments combine top-down executive buy-in with a bottom-up "tiger team" of technical experts and employees with deep institutional knowledge.
  • Success hinges on defining clear evals from the start. Without a shared, measurable goal, projects become a "moving target." The process is an iterative "hill climb" from a low initial success rate (e.g., 46%) to the target (e.g., 99%), requiring patience, expertise, and sometimes even model fine-tuning.
  • This insight is critical for investors: successful AI integration is less about the raw power of a model and more about the organizational structure, clear goal-setting, and iterative refinement process supporting the deployment.

Physical vs. Digital Autonomy: The Scaffolding Problem

  • The conversation explores a paradox: why is physical autonomy (like self-driving cars) seemingly ahead of digital autonomy (like AI agents booking a flight)? Sherwin argues that self-driving cars benefit from existing scaffolding—the structured environment of roads, traffic laws, and standardized signals.
  • In contrast, digital AI agents are often "dropped in the middle of nowhere" within enterprise systems that lack standardized interfaces or organized data. Sherwin suggests, "My hunch is some of the enterprise deployments that don't actually work out likely don't have the scaffolding or infrastructure for these agents to interact with."
  • For investors and researchers, this highlights a major opportunity. The most successful AI deployments will likely involve building this digital scaffolding—connectors, data organization platforms, and standardized APIs—that allows AI agents to operate effectively.

Under the Hood: The Development of GPT-5

  • Olivia describes the creation of GPT-5 as a "work of love" that focused equally on intelligence and behavior. The development process involved an unprecedentedly close feedback loop with enterprise customers to identify and address practical blockers beyond benchmark performance.
  • The team prioritized improvements in speed, instruction-following, and the model's ability to refuse to answer when it doesn't know something. This customer-centric approach to "craft, style, and tone" is what makes the model feel qualitatively different and more usable in real-world applications.
  • Sherwin identifies a key technical trade-off: balancing the model's "thinking time" (reasoning tokens) against latency and performance. While giving the model more time to think can solve incredibly complex problems, product builders must manage user expectations around response speed.

GPT-5 Performance and User Feedback

  • Four weeks post-launch, feedback on GPT-5 has been overwhelmingly positive, particularly for its advanced coding and reasoning capabilities. Its ability to solve problems other models cannot, combined with a dramatic reduction in hallucinations, has been a major win for developers.
  • However, its proficiency in instruction-following created an interesting challenge. The model is so literal that users had to update old prompts that were designed to "beg" previous models to be concise. Sherwin notes, "It turns out when you give that to GPT-5, it's like oh my gosh, this person really wants to be concise. And so the response would be like one sentence."
  • Constructive feedback is focused on refining code generation patterns and optimizing the trade-off between thinking time and latency for simpler tasks, ensuring the model's power is applied dynamically.

The Evolution of Multimodality: Beyond Text to Real-Time Voice

  • Olivia discusses the progress in multimodal models, particularly the new real-time voice API. A key challenge was that previous voice models felt less intelligent than their text-based counterparts. The current focus is on closing this intelligence gap.
  • The team is also teaching the model to handle specific, economically valuable conversations, such as understanding what an SSN is and knowing to ask for clarification on a fuzzy digit rather than guessing. This requires training on real-world customer support and sales calls.
  • Sherwin explains the architectural shift from "stitched" models (speech-to-text, then text-to-speech) to a true, end-to-end speech-to-speech system. This new approach dramatically reduces latency and preserves crucial signals like tone, accent, and pauses, creating a more natural user experience.

The Power of Model Customization: Reinforcement Fine-Tuning

  • Sherwin details OpenAI's advanced model customization offering: Reinforcement Fine-Tuning (RFT). Unlike traditional Supervised Fine-Tuning (SFT), which requires labeled prompt-completion pairs, RFT uses reinforcement learning to optimize a model for a specific, gradable task.
  • While more complex, RFT is an order of magnitude more powerful. It allows customers to leverage their proprietary data to create a best-in-class model for their specific domain. Sherwin states this shifts the goal from simple customization to "creating a best-in-class, maybe best in the world model for something that you care about for your business."
  • Startups like Rogo Capital (financial services) and Accordance AI (tax) have used RFT to achieve state-of-the-art results on industry-specific benchmarks, demonstrating its power for creating highly specialized, high-performance AI.

Rapid Fire: Long/Short Bets and Underrated Tools

  • Long/Short Bets:
    • Sherwin is long on eSports, citing its massive growth potential, especially in Asia, and its increasing cultural relevance among younger generations. He is short on the broad category of AI tooling startups (e.g., eval platforms, vector stores), arguing the space is too competitive and evolves too quickly for any single tool to maintain long-term dominance.
    • Olivia is long on healthcare and life sciences, believing the industry is perfectly positioned to benefit from AI due to its data-rich, R&D-driven nature. She is short on education based on memorization, arguing that LLMs make rote knowledge obsolete and that critical thinking should be the priority.
  • Underrated AI Tools:
    • Both speakers praised Grain for its seamless calendar integration and high-quality meeting transcription and summarization.
    • Sherwin highlighted Copilot (specifically the CLI and its integration with GPT-5), which he says creates a "bionic feeling" for developers by understanding intent and one-shotting complex coding tasks.

Conclusion: The Frontier of Applied AI

This conversation reveals that the next frontier for AI is not just building more powerful general models, but engineering the specific applications, infrastructure, and feedback loops to solve real-world enterprise problems. Investors and researchers should focus on the emerging ecosystem of "scaffolding" and advanced customization techniques like RFT.

Others You May Like