🧬 The Bitter Lesson Comes to Biology — Plus Self-Play, Streamin…

🔬 Scaling Laws Hit Biology — And They Work

One of the most compelling developments in applied AI comes from the biology domain, where researchers are asking a fundamental question: Do the neural scaling laws that govern language models also apply to proteins? The answer, according to recent work from the Biohub team, appears to be a resounding yes — with some critical caveats.

Proteins, the workhorses of cellular biology, are essentially sequences of 20 amino acids that fold into complex 3D shapes. Their structure determines their function. The bet behind protein language models like ESMC (ESM Cambrian) is simple: train a massive masked language model on hundreds of millions of years of evolved sequences, and it will learn the "grammar" of proteins — including emergent properties like structure and function — without ever being explicitly taught about 3D shapes.

"You'll know a protein by the amino acids it keeps" — the biological analog to "you'll know a word by the company it keeps."

The research shows that these models do scale predictably. When measuring the ability to predict long-distance protein contacts (a proxy for structural understanding), performance follows a clean log-linear curve as compute increases — just like in language models. Critically, this emergent structural knowledge comes purely from sequence co-occurrence patterns during masked token prediction.

📊 The Data Wall in Biology — And Why It's Different

Here's where biology diverges from the language model trajectory. Earlier protein models like ESM2 hit a plateau around 50 million training samples, with diminishing returns from adding more parameters. The fix wasn't clever architecture — it was data scaling. By expanding to 2.8 billion sequences (pulling in metagenomic data from dirt, oceans, and human guts), the team broke through the plateau.

Unlike the human-generated text powering LLMs, evolution has been generating training data for 4 billion years. We've sampled less than 1% of known protein sequence diversity, and that's only what exists right now — let alone everything evolution has ever tried. The data wall conversation in biology looks fundamentally different from the one in language.

🧪 Beating AlphaFold at Its Own Game

Perhaps the most striking result: ESMFold 2, using only raw sequences, approaches AlphaFold 3's performance — and in some critical applications like antibody design, it actually wins.

AlphaFold's power comes from hand-crafted features called Multiple Sequence Alignments (MSAs) — essentially stacking hundreds of evolutionary cousins of a protein to extract co-variation patterns. It's brilliant domain engineering, but it's also the kind of "human knowledge baked in" approach that Richard Sutton's Bitter Lesson suggests should eventually lose to more general methods.

On general protein-protein docking tasks, ESMFold 2 lands within 3 percentage points of AlphaFold 3 on the DOCQ pass rate metric — without any MSAs. On antibody applications, where sequence variation is sparse relative to structural diversity, the single-sequence approach actually outperforms. The headline isn't that MSAs are dead, but that handcrafted features only help where data is abundant — and often fail precisely where drug designers need them most.

Plus, it's just faster. MSA construction takes significant computational biology time; ESMFold delivers comparable or better results with dramatically lower latency.

🧠 Mechanistic Interpretability in Protein Space

One of the most fascinating aspects of this work involves applying sparse autoencoders (SAEs) — tools developed by the mechanistic interpretability community for language models — to protein models. The question: do these models develop interpretable, monosemantic features?

The answer is yes. From pure fill-in-the-blank pretraining, the model's latent space decomposes into clean features corresponding to real biological concepts: individual amino acids, structural motifs, protein domains, functional sites, and whole protein roles. None of this was supervised — the model organized its representations purely through masked language modeling.

One striking example: researchers identified a feature that activates for the "nucleophilic elbow," a catalytic domain that has evolved independently multiple times across unrelated proteins. The model learned to recognize this convergent motif across structurally diverse backgrounds — evidence of genuine biological intuition rather than simple memorization.

The team even built an atlas of 7 billion predicted protein structures, laid out by similarity in latent space. The result looks like a Google Maps of proteins, with clear evolutionary families and functional clusters emerging naturally. CRISPR-Cas enzymes cluster together; so do other functional families — all discovered as a byproduct of the pretraining objective.

⚔️ Self-Play for Language Models: Why It's Harder Than It Looks

Shifting to post-training strategies, recent work on self-play for LLMs reveals both promise and fundamental challenges. The idea is appealing: just as AlphaGo Zero surpassed human champions by playing against itself — unbiased by human suboptimal play — could LLMs recursively improve by generating their own training tasks?

The current RL post-training recipe is straightforward: collect thousands of tasks (coding, math, tool use), have the agent attempt them, collect rewards, and update the model by upweighting successful rollouts. As the Cursor Composer 2 technical report beautifully illustrated, this scales smoothly and reliably — as long as you keep feeding in new tasks. But here's the rub: those tasks must be hand-collected by humans.

Self-play proposes automating this: have the model generate new RL tasks and attempt to solve them, training on both capabilities simultaneously. In symmetric self-play (like AlphaGo), an older version of the model plays the opponent role. In asymmetric self-play (more common for LLMs), a "conjecturer" generates entire tasks with verification criteria, and a "solver" attempts them.

🚧 The Problem: Vanilla Self-Play Doesn't Work

When tested on 3,000 formal math problems in Lean, vanilla self-play performed no better than standard RL — both plateaued around 60% solve rate. Why? The conjecturer was rewarded for producing problems the solver couldn't solve (to keep tasks at the frontier of capability), but the easiest way to do this turned out to be generating artificially complex, inelegant junk.

"It's the equivalent of giving someone a three-page high school calculus problem where they'll make some tiny mistake somewhere — technically hard, but completely useless for learning."

The solution, implemented in Self-Guided Self-Play (SGS), introduces two mechanisms:

Grounding: For each unsolved problem, the conjecturer generates a related synthetic problem, anchoring the distribution to known-good tasks.
Dual reward: The conjecturer is rewarded both for difficulty and for relevance/elegance, judged by a third model component called a "guide."

The result: a 7B parameter model trained with SGS matched the performance of a 670B model on the same benchmark — at 8x the compute cost, but with a dramatically smaller base model. Still, performance plateaued well below 100%, indicating this remains an open problem with significant headroom.

🎙️ Streaming RAG: The Latency Challenge for Voice Agents

As voice AI systems proliferate, a new challenge emerges: how do you reduce latency without sacrificing factual grounding?

Traditional RAG (Retrieval-Augmented Generation) waits for the user to finish speaking, retrieves relevant information, and then generates a response. In text interfaces, this is acceptable. In voice, a 10-second delay destroys the conversational experience. The solution proposed in recent work from Meta: run RAG while the user is still speaking.

Two approaches were explored:

Fixed-interval streaming RAG: Divide the audio into blocks and run RAG on each intermediate chunk. Use fast components of the RAG pipeline (like initial document retrieval) to decide whether intermediate results match the final query well enough to proceed with full retrieval.
Adaptive triggering: Fine-tune a model to decide, for each incoming chunk, whether it contains critical new information requiring a new query, or whether past retrievals are sufficient.

Both methods aim to identify the "semantic inflection point" — the moment when enough of the user's intent is clear to begin retrieval. Results on converted text-to-audio benchmarks showed latency reductions of 0.5 seconds on synthetic datasets and 1.5 seconds on human speech, with no degradation in accuracy compared to end-of-query RAG.

The key insight: in voice, it's better to retrieve early based on partial information than to wait for perfect information. The tradeoff favors responsiveness over precision, as long as the retrieval quality remains adequate.

🔐 Formal Verification: The Future of Trustworthy AI

Lean, the interactive theorem prover, is rapidly becoming the go-to tool for verified intelligence — systems that not only generate solutions but prove them correct.

Why Lean? It's fast, it's expressive (based on dependent type theory), it's both a theorem prover and a functional programming language, and it's supported by Mathlib — the largest formalized math library in existence, covering everything from topology to algebraic geometry. Unlike informal math, where proofs can handwave or skip steps, Lean is unforgiving: you cannot fool the kernel.

Recent breakthroughs illustrate the momentum:

AI systems have achieved gold medals at the International Mathematical Olympiad (IMO) using Lean for verification.
OpenAI claimed to solve an 80-year-old Erdős problem with O3, with Lean providing the formal proof.
DeepMind's recent work uses formal verification not just for math but across multiple scientific domains.

🖥️ From Math to Code: Program Verification

Lean's utility extends beyond mathematics. Program verification — proving that code satisfies a formal specification — is a trillion-dollar problem. Bugs are expensive, and as AI-generated code proliferates, the need for guarantees intensifies.

Recent work introduces tools like Bridge and Torch-Lean (a PyTorch-style tensor system in Lean) that enable:

Writing neural network code in Lean with full verification capabilities
Proving properties like "Flash Attention is equivalent to standard attention" at the spec level
Verifying non-determinism in LLM inference down to floating-point arithmetic and GPU kernel behavior

"We should shift from wide coding to verified coding — code that comes with proofs of correctness."

The CSLab project at Stanford is building a repository of verified computer science concepts, and contributions are open to the community.

⚡ Agentic Coding: From Chess to Real-Time Strategy

The final presentation took a radically different approach, arguing that programming with AI agents is less like chess and more like StarCraft.

In chess, you think linearly, predict the future, and design robust systems. In real-time strategy games, you:

Parallelize everything: Your economy, production, and units must all be active simultaneously.
Balance imperfect execution: No single aspect can be perfect; you optimize across many dimensions at once.
Maximize APM (actions per minute): High-level players maintain constant activity, even if individual actions are imperfect.

The coding workflow mirrors this:

Spawn many agents in parallel: Use git worktrees to maintain separate development environments. One orchestrator agent (typically Claude) manages multiple worker agents, each tackling different tickets simultaneously.
Minimize human keystrokes: The goal is to go from idea to work-in-progress as fast as possible. Agents attempt to take tasks all the way to PR stage, making assumptions and pushing forward even if they'll need correction later.
High visibility, low latency: Like clicking around an RTS minimap, the developer monitors multiple agents via status dashboards, using audio cues (actual Warcraft/StarCraft sound effects) to signal when attention is needed.
Satisfice, don't perfect: Do things "good enough" and iterate. Perfect is the enemy of shipped.
Max your APM: Track tool calls per minute as a proxy for productivity. Idle agents = wasted resources.

The results speak for themselves: the team 3.5x'd PRs per engineer per month, with a 60% boost in the most recent month after fully adopting this workflow.

🧠 The Knowledge Base as Competitive Advantage

A critical component of this system is an automated, linked knowledge base — structured markdown files that encode:

Codebase patterns and conventions
Business logic and product knowledge
Known failure modes (e.g., "Claude is terrible at estimating task duration — never trust it")

Agents traverse this knowledge base to inform decisions, and after each task, learnings are fed back into the system. The knowledge base is much faster to traverse than raw code, making it a cheap source of truth for future agents.

"This presentation was made the same way I do tickets: I told Claude what to present, it looked at our knowledge base, generated the slides, I gave ~15 edits, and then I fed everything back into the knowledge base so future Claude instances would be better."

🎯 Key Principles for Agentic Workflows

Run everything in the cloud: Portability is critical when work needs to move between developers or scale to different compute.
Dangerously skip permissions mode: If you're giving feedback at a regular pace, you're going too slow. Automate approvals wherever possible (in sandboxed environments).
Macro by default, micro when it counts: You can't win an RTS by perfectly microing one unit if you didn't build an army. Spawn lots of tasks; tunnel vision only on critical blockers.
Audio and visual cues: Color-coded tabs, themed sound effects, and APM trackers help manage cognitive load across many parallel threads.
Use tokens aggressively: Never let your Claude tokens sit idle. Spend your resources — it's inefficient economy otherwise.

🔮 Open Questions and Future Directions

Several themes emerged across the presentations:

Memory architectures: From mem-zero to recursive language models, how do we build systems that improve continuously with each new sample?
Intelligence per sample vs. intelligence per watt: As models scale, efficiency becomes paramount. Smaller models may outperform larger ones on power-adjusted metrics.
The F-minus-H problem: If we train on human-generated data (H), can test-time compute and self-improvement ever reach the full solution space (F)? Or will we always be limited to the "typical set" explored by humans?
Alternatives to backpropagation: The brain likely doesn't compute weight matrix transposes. What other learning procedures might unlock better sample efficiency?

📢 Call for Contributions

The AI research club is seeking presentations on:

Novel memory architectures and dynamic chunking
Alternatives to backprop (SPSA, etc.)
Robotics, speech, and multimodal breakthroughs
"Unhinge founder hacks" for applied AI systems

Ideas for club improvements include lightning rounds, shared benchmarks, collaborative open-source projects, and challenge competitions.

The common thread? Whether in biology, math, coding, or voice interfaces, the frontier is moving from scaling compute on fixed tasks to scaling learning itself — building systems that generate their own training signal, verify their own correctness, and improve continuously without human bottlenecks. The era of verified, agentic, self-improving AI is here.

🧬 The Bitter Lesson Comes to Biology — Plus Self-Play, Streaming RAG & Formal Verification

🔬 Scaling Laws Hit Biology — And They Work

📊 The Data Wall in Biology — And Why It's Different

🧪 Beating AlphaFold at Its Own Game

🧠 Mechanistic Interpretability in Protein Space

⚔️ Self-Play for Language Models: Why It's Harder Than It Looks

🚧 The Problem: Vanilla Self-Play Doesn't Work

🎙️ Streaming RAG: The Latency Challenge for Voice Agents

🔐 Formal Verification: The Future of Trustworthy AI

🖥️ From Math to Code: Program Verification

⚡ Agentic Coding: From Chess to Real-Time Strategy

🧠 The Knowledge Base as Competitive Advantage

🎯 Key Principles for Agentic Workflows

🔮 Open Questions and Future Directions

📢 Call for Contributions

More from Y Combinator

Pedro Franceschi on Refounding Brex for the AI Era — and Why Every CEO Must Beco

How Emergent Became the World's Fastest-Growing AI Company by Democratizing Soft

How a Swedish Legal Tech Startup Went from $1M to $100M+ ARR in 12 Months

Why the Biggest Companies of the Next Decade Won't Be Software — They'll Be AI-N

From Rejected EdTech Idea to AI Unicorn Territory: Inside GigaML's Playbook

The Dawn of Organizational Superintelligence: Inside YC's Radical AI Transformat