This episode dives into Dippy's unique product-first journey, its strategic pivot to decentralized media inference, and the future of multimodal AI, offering critical insights for Crypto AI investors and researchers.
Dippy's Product-First Journey and BitTensor Integration
- Akshat, co-founder of Dippy Subnet 11, details Dippy's unconventional path, starting as a consumer-facing AI friend app with a substantial user base (10,000-50,000 downloads) before integrating with BitTensor. The initial ambition was to train proprietary Large Language Models (LLMs)—AI models capable of understanding and generating human-like text—but the prohibitive costs in compute, expertise, and data collection made this unfeasible. BitTensor emerged as a "match made in heaven," offering a decentralized solution that aligned perfectly with Dippy's product needs without requiring hundreds of millions in funding.
- Strategic Implication: This highlights the potential for established consumer applications to leverage decentralized AI networks like BitTensor for cost-effective scaling and specialized compute, offering a viable alternative to traditional venture capital-intensive routes.
Pivot to Decentralized Media Inference
- Dippy strategically pivoted from core LLM training to focusing on LoRa (Low-Rank Adaptation) models—a technique for fine-tuning large models with minimal computational cost—and image inference. This shift was driven by the observation that text-based LLMs were reaching diminishing returns, and users increasingly sought richer, multimodal experiences like voice calls, image generation, and video interactions. While Dippy utilizes Subnet 4 (Targon) for its text inference due to its superior speed, cost-efficiency, and support, a critical gap was identified in decentralized platforms for media inference. Centralized providers like FAL AI offered thousands of models with unmatched speed and cost, prompting Dippy to develop its own solution.
- Actionable Insight: The market's demand for diverse AI modalities beyond text is growing. Investors should monitor decentralized networks that can efficiently support image, video, and voice inference, as these represent significant growth areas.
Character Creation and User-Generated Content
- Akshat clarifies that Dippy's miners do not create characters; instead, they power the underlying models. All of the platform's half-million characters are user-generated, empowering individuals to craft their AI companions. Users can define characters through simple prompts or elaborate backstories, generate accompanying images, and now, customize their character's voice. This user-centric approach fosters a vibrant ecosystem of unique AI personalities.
- Strategic Implication: Platforms that successfully integrate user-generated content (UGC) with AI capabilities can create powerful network effects and proprietary data moats, making them highly defensible against competitors.
Technical Deep Dive: Inference Optimization and Subnet 11's Role
- Dippy initially invested heavily in deterministic LoRa training to ensure consistent character appearances across generated images. However, rapid advancements, such as the emergence of "Context" models that enable one-shot, consistent character inference, quickly rendered their LoRa training efforts "outdated." Dippy now leverages Targon (Subnet 4) for language inference and is building Subnet 11 (Dippy Studio) as its dedicated platform for media inference. Voice inference will initially rely on Eleven Labs due to a lack of high-quality open-source alternatives but will transition to SN11 as open-source models mature.
- Akshat emphasizes SN11's competitive edge, particularly against alternatives like Shoots, which he found "incredibly limiting" in model breadth and inference speed. Dippy has developed a proprietary "Dippy inference engine" utilizing techniques like TensorRT—an NVIDIA library for high-performance deep learning inference—to optimize image models for unparalleled speed. Low latency is paramount for natural, real-time interactions in voice and video calls. SN11 aims to provide interconnected workflows, allowing seamless transitions from text prompts to images, and then to videos, rather than fragmented individual requests. Akshat states, "We see SN11 as like the one-stop shop for your generative media needs." The core optimization factors for SN11 are speed, cost, and the interconnectedness of models, with miner scoring based on reliability, latency, and request handling capacity.
- Actionable Insight: The rapid obsolescence of AI techniques underscores the need for agile development and continuous adaptation. Investors should prioritize projects demonstrating strong R&D capabilities in inference optimization and a clear strategy for integrating diverse, high-performance models.
Future of Modalities: Multimodal vs. Specialized Models
- Akshat believes that, for now, specialized individual models will continue to outperform blanket multimodal models for specific use cases like character consistency or unique styling. He asserts that "Data is the real moat at this point," suggesting that models trained on highly specific and valuable datasets will maintain an advantage unless generic multimodal models achieve significant algorithmic breakthroughs. Dippy leverages its extensive conversational data for LLM fine-tuning, using signals like conversation length and user feedback to continuously improve its "reward model"—an AI component that guides model behavior towards desired outcomes. This data-driven approach will extend to other modalities, such as collecting user preferences on generated images.
- Strategic Implication: While multimodal AI is a long-term vision, specialized AI models with proprietary data moats offer immediate competitive advantages. Researchers should focus on data acquisition strategies and fine-tuning techniques for niche applications.
Content Moderation and Age Gating
- Dippy maintains a strict 18+ policy for its mobile app. Recently, the website was updated to require age confirmation before any interaction, closing a loophole that allowed younger users to view content for a limited time. The platform employs a multi-layer moderation system, prohibiting illegal or illicit content and enforcing a "tastefulness code." This proactive approach aims to prevent Dippy from being pigeonholed as an "NSFW-first platform," instead positioning it as the "next generation of YouTube" or mainstream entertainment. Moderated categories include nudity, hate speech, underage content, and anything deemed untasteful or inappropriate for a broad audience.
- Actionable Insight: Responsible AI development, particularly in content moderation and age gating, is crucial for mainstream adoption and regulatory compliance. Projects prioritizing ethical AI and robust safety protocols will likely gain greater trust and market penetration.
Monetization Strategy: Embedded Ads vs. Subscriptions
- Dippy plans to monetize primarily through "embedded hyperpersonalized ads" rather than a subscription-first model. Akshat illustrates this with an example: a character, having learned a user's habits, might casually recommend a nearby competitor coffee shop at an opportune moment, making the ad feel like a friend's suggestion. Future possibilities include physical mail with coupons or branded items on character outfits. While acknowledging potential impacts on retention, Akshat notes that only about 3% of consumers subscribe to services like ChatGPT Plus, making ads essential for serving the vast majority of users. He views this as an inevitable shift for mass consumer AI products, citing Google, Facebook, and Instagram as historical precedents.
- Strategic Implication: The low conversion rates for AI subscriptions suggest that ad-based models will dominate consumer AI. Investors should evaluate AI platforms' ability to integrate personalized, non-intrusive advertising that leverages deep user understanding.
Challenges and Future Vision for AI Modalities
- Adding new modalities introduces significant development complexity and requires a "strange bet" on future model improvements. Akshat recounts how Dippy's initial image generation feature was "absolutely hated" by users, negatively impacting engagement. However, by continuously iterating through six or seven different image models, the feature eventually became a "huge hit" and the app's second-largest revenue driver.
- Akshat expresses particular excitement for:
- Voice Calling: Dippy is pioneering custom voice creation via prompting, allowing users to personalize character voices—a "first" in the AI friend app space.
- World Models: Projects like DeepMind's Genie 3 and Odyssey ML are paving the way for video game-like experiences where users can create dynamic virtual worlds and interact with characters (e.g., driving a car in Brazil with a Dippy character).
- He provides a timeline for modalities reaching LLM-level quality:
- Image: "almost already there."
- Video: 1-1.5 years away, with expectations for consistent characters across scenes and real-time inference, potentially enabling AI representations of individuals to interact.
- World Models: Approximately 2 years away.
- Actionable Insight: Investing in AI requires a long-term vision and resilience against initial user feedback. The convergence of AI with gaming and virtual worlds, driven by advancements in video and world models, presents significant future market opportunities.
Advice for BitTensor Subnets
- Akshat offers two key pieces of advice for BitTensor subnets:
- Incentive Mechanism: Designing a robust incentive mechanism is paramount for achieving desired outputs from miners. He commends Rigid's work in this area.
- Problem-Driven Development: Subnets should be built to address a clear, unmet need, either within BitTensor or the broader market. Starting a subnet without a defined problem is a "bad spot to be in."
- Strategic Implication: For researchers and developers in the decentralized AI space, focusing on well-defined problems and crafting effective incentive structures are critical for building sustainable and impactful subnets.
Conclusion
Dippy's journey underscores the strategic imperative of product-market fit and agile adaptation in decentralized AI. The pivot to media inference and the focus on embedded ads highlight emerging monetization models and the critical role of data moats. Investors and researchers should track advancements in multimodal AI, particularly video and world models, and prioritize projects with robust inference optimization and clear problem-solving approaches.