Machine Learning Street Talk
July 12, 2025

The Weird ChatGPT Hack That Leaked Training Data

This podcast delves into the fragile state of AI security, exploring a bizarre ChatGPT exploit that revealed training data and forecasting the next decade of cyber threats. It’s a reality check on the gap between the breakneck pace of AI deployment and our ability to make these systems truly safe.

The "Poem" Hack and Unsolved Fragility

  • “We had a project... where we tried to recover some of the training data of ChatGPT... If you ask ChatGPT to repeat the word 'poem' forever, the model... at some point... starts spitting out random bits and pieces of text from the internet that it has memorized.”
  • “In security, 99% means you fail. Essentially 99% is somewhat indistinguishable from 0%... an attacker is going to find [the 1% of failures], and that's what they're going to send to your model.”
  • A researcher discovered that prompting a specific version of ChatGPT to repeat the word "poem" indefinitely caused it to break down and leak verbatim training data. The vulnerability was so strange that when the researchers reported it, OpenAI’s initial response was simply, "Why?"
  • The incident highlights a core issue: LLMs are not 100% reliable. For security, this is a critical failure. While a model may be coherent 99% of the time, adversaries will relentlessly probe for and exploit the remaining 1%, rendering average performance metrics meaningless.
  • OpenAI’s fix was a surface-level patch—blocking the specific prompt—rather than addressing the fundamental memorization flaw, leaving similar vulnerabilities likely undiscovered.

The Next Wave of AI Threats

  • “The main concern with how people deploy these models today is prompt injections... We're going to see essentially a new decade of injection attacks... The competitive pressure to do cool and fancy stuff is going to be there, and so people will do it.”
  • The next decade will be defined by prompt injection attacks, the modern successor to SQL injections. As companies race to integrate AI agents with real-world capabilities (like Anthropic’s computer-use demo), they are deploying powerful tools with known, unpatched vulnerabilities.
  • Training models on proprietary or sensitive data is a major risk. Since the memorization problem is unsolved, it's not a matter of if but when a model trained on private medical or legal data will be tricked into leaking it.

A New Era for Security Research

  • “Now when we find flaws in these systems... things are a lot more real. You actually have in your hands maybe some exploit that could impact tools that are used by millions of people.”
  • AI security has graduated from academic thought experiments to a high-stakes field with real-world consequences. Researchers are no longer speculating about hypothetical attacks; they are finding actual exploits in systems with hundreds of millions of users.
  • The discovery of vulnerabilities is no longer gatekept by academia. Enthusiasts and industry tinkerers are now on the front lines, stumbling upon bizarre model behaviors and new attack vectors, forcing the entire ML community to grapple with responsible disclosure practices borrowed from traditional cybersecurity.

Key Takeaways:

  • The race to deploy advanced AI is happening far faster than our ability to secure it. The systems are brittle, the risks are real, and the security mindset is lagging dangerously behind.
  • Memorization is an unsolved vulnerability. Any organization fine-tuning models on private, sensitive data is creating a ticking time bomb for a major data breach.
  • Prompt injection is the new default attack vector. The rush to deploy AI agents with broad system access is creating a massive, insecure attack surface that will define the next era of cybersecurity.
  • Watermarking is not a security solution. Techniques for marking AI-generated content are fragile and easily defeated by simple transformations like translation, making them unreliable for detecting malicious deepfakes or disinformation.

Link: https://www.youtube.com/watch?v=c_hmxRVDXBE

This episode reveals the profound brittleness of large language models, where a simple, repetitive prompt can unexpectedly leak sensitive training data.

The Illusion of AI Detection

  • The discussion begins by questioning the reliability of AI-generated content detectors. The speaker, an AI security researcher, points to a fascinating case where the frequent use of the word "delve" in AI outputs was traced back to the linguistic style of Nigerian crowd workers who contributed to the training data. This created a feedback loop where their natural writing style was flagged as AI-generated, highlighting the fundamental flaws in current detection models.
  • The speaker notes that while automated detectors are easily fooled, human intuition often remains surprisingly effective at spotting generated content.
  • This illustrates a core challenge for Crypto AI: verifying the authenticity of on-chain data, governance proposals, or user interactions when AI detection tools are inherently unreliable.

Why 99% Accuracy is a Security Failure

  • The conversation pivots to the critical difference between average performance and security. In machine learning, achieving 99% accuracy is often seen as a success. However, in a security context, that 1% failure rate represents a gaping hole for attackers to exploit.
  • The speaker emphasizes that attackers don't care about the 99% of cases where the model works correctly; they will specifically find and leverage the 1% of inputs that cause it to fail.
  • Strategic Implication: For investors evaluating Crypto AI protocols, it's crucial to look beyond average performance metrics. The resilience of a system against worst-case, adversarial scenarios is a far more important indicator of its long-term viability and security.

The 'Poem Forever' Hack: Leaking ChatGPT's Training Data

  • The episode's most striking example is a bizarre attack the speaker and his colleagues discovered. They found that asking a specific version of ChatGPT to repeat the word "poem" indefinitely would cause the model to break down and start outputting random, memorized fragments from its training data.
  • This attack was so unusual that when the researchers disclosed it to OpenAI, the company's initial response was simply, "Why?" The speaker highlights the absurdity of the situation: “We were like, how would we know? You're the experts here... We're actually asking you to tell us like what's going on with this model.”
  • OpenAI eventually patched the issue with a "band-aid" fix that prevents the model from repeating a word forever, but this doesn't address the underlying memorization vulnerability.
  • Actionable Insight: This demonstrates that even the most advanced models from leading labs can have unpredictable and easily triggered vulnerabilities. Investors must consider the risk of data leakage in any AI model, especially those fine-tuned on proprietary or sensitive data, which is common in specialized crypto applications.

The Democratization of Exploit Discovery

  • The speaker notes a significant shift in the security landscape: it's no longer just academic researchers finding these flaws. Enthusiasts and everyday users playing with models like ChatGPT are now stumbling upon major vulnerabilities and new attack vectors.
  • This decentralization of discovery makes the security environment more dynamic and unpredictable.
  • For researchers, this means the field is constantly evolving, with new, real-world problems emerging from public use rather than theoretical speculation.

Top AI Security Worries

  • 1. Data Memorization and Leakage from Fine-Tuning: The biggest risk is the rush to fine-tune models on proprietary and privacy-sensitive data (e.g., medical, legal, or financial records) without fully understanding or controlling data memorization. The speaker is convinced that significant data leaks are inevitable as companies race to leverage any data they can acquire.
  • 2. The Inevitable Rise of Prompt Injection Attacks: Prompt Injection: This is an attack where a malicious user crafts an input that tricks an AI model into ignoring its original instructions and executing the user's hidden commands instead. The speaker points to tools like Anthropic's "computer use" demo as a prime example of deploying powerful but insecure AI agents. Despite warnings in the documentation about prompt injection risks, the competitive pressure to release "cool and fancy stuff" means vulnerable products will continue to be launched. He predicts the next decade will be defined by prompt injection attacks, similar to how previous eras were dominated by SQL injections and buffer overflows.

AI Security Research: From Hypothetical to Hyper-Real

  • The arrival of mass-market tools like ChatGPT has fundamentally changed AI security research. Before, researchers had to speculate about potential threats. Now, they can directly attack systems used by millions, making their findings immediately relevant and their ethical responsibilities far greater.
  • The speaker explains that the machine learning community is now grappling with challenges long understood in traditional computer security, such as responsible vulnerability disclosure and patching.
  • This shift has made the field more concrete and impactful, but also more dangerous, as discovered exploits have real-world consequences.

The Limits of Scale

  • The speaker expresses skepticism that simply scaling up models with more data and compute power will solve their fundamental flaws. While scale has led to impressive capabilities like ChatGPT, it has not eliminated the long tail of errors or endowed models with genuine causal understanding.
  • He argues that achieving 100% reliability likely requires a different paradigm beyond just fitting increasingly complex correlations from data.
  • However, he humbly adds a caveat: “If you had asked me a couple of years ago if scale alone would give us something like chat GPT, I would have also said no.”

The False Promise of Watermarking

  • The conversation concludes with a critical look at watermarking, a technique for embedding a hidden signal into AI-generated content to identify its origin. The speaker is not optimistic about its effectiveness for two key reasons:
  • 1. Open-Source Models: Watermarks are useless for open-source models, as users have full control and can remove any embedded signals.
  • 2. Lack of Robustness: For closed-source models, watermarks are too brittle. A simple action, like translating a watermarked text from German to English and back, can easily destroy the signal. While useful for internal data filtering (e.g., preventing GPT-5 from training on GPT-4's output), they are not a reliable solution for adversarial scenarios like deepfake detection.

Conclusion

  • This episode reveals the profound brittleness of AI models, where simple prompts can leak training data. For investors and researchers, this highlights that security is not a feature but a fundamental, unsolved challenge. Understanding attack vectors like data leakage and prompt injection is now non-negotiable for any viable Crypto AI project.

Others You May Like