meta-rl-chemistry
epiplexity-duel · in your browser
SPARSE-REWARD RL · IN-BROWSER TRAINING

Watch a neural net discover molecules by arguing with itself.

An RL agent generates SMILES strings one token at a time. Every episode, two clones train side-by-side — one cautious, one curious. The clone whose own critic learned more this round wins, and its weights carry forward. No human picks the explore-vs-exploit knob. The mechanism does, episode by episode.

~75kparams
parallel envs
TF.jsin browser
RDKitfor Tanimoto
Acetic acid
CC(=O)O

Training console

Pick a target. Hit run. The duel begins. Charts and 3D samples update live.

Live metrics
A (low entropy) B (high entropy)
Episode
Rolling reward (20)
Best Tanimoto
Wins A / B
— / —
Validity rate
Episodes / sec
Mean reward per episode
Cumulative wins (A vs B)
Epiplexity per episode (A vs B)
Rolling win rate (30-ep)

Top molecules so far

Best Tanimoto similarity to the target. Diversified — same SMILES never appears twice.

Start training to populate this grid.

The epiplexity duel, in one minute

1

Two clones, two temperatures

At the start of every episode, the base policy is forked into clone A (low entropy bonus) and clone B (high entropy bonus). Both see the same env seed, so any difference in behavior is purely the entropy knob.

2

Both train an episode

Each clone runs a standard A2C update with an episodic-memory transformer + LSTM policy. During the update we measure epiplexity = drop in same-batch value-loss before vs. after the optimizer step. It's how much the critic actually learned.

3

Winner-takes-all

Whichever clone scored higher epiplexity becomes the new base policy. Its weights and optimizer state propagate. Early on, the curious B clone tends to win (you need exploration to find any signal). Later, the cautious A clone wins (exploit it).

4

Why this matters

On dense reward (e.g. QED+SA) the duel rarely beats baseline. On sparse reward — like Tanimoto-to-aspirin where 99% of episodes return ~0 — the duel auto-tunes its own exploration schedule. No human picks ε. The critic's own learning rate picks it.