Watch a neural net discover molecules by arguing with itself.
An RL agent generates SMILES strings one token at a time. Every episode, two clones train side-by-side — one cautious, one curious. The clone whose own critic learned more this round wins, and its weights carry forward. No human picks the explore-vs-exploit knob. The mechanism does, episode by episode.
Training console
Pick a target. Hit run. The duel begins. Charts and 3D samples update live.
Top molecules so far
Best Tanimoto similarity to the target. Diversified — same SMILES never appears twice.
The epiplexity duel, in one minute
Two clones, two temperatures
At the start of every episode, the base policy is forked into clone A (low entropy bonus) and clone B (high entropy bonus). Both see the same env seed, so any difference in behavior is purely the entropy knob.
Both train an episode
Each clone runs a standard A2C update with an episodic-memory transformer + LSTM policy. During the update we measure epiplexity = drop in same-batch value-loss before vs. after the optimizer step. It's how much the critic actually learned.
Winner-takes-all
Whichever clone scored higher epiplexity becomes the new base policy. Its weights and optimizer state propagate. Early on, the curious B clone tends to win (you need exploration to find any signal). Later, the cautious A clone wins (exploit it).
Why this matters
On dense reward (e.g. QED+SA) the duel rarely beats baseline. On sparse reward — like Tanimoto-to-aspirin where 99% of episodes return ~0 — the duel auto-tunes its own exploration schedule. No human picks ε. The critic's own learning rate picks it.