SPARSE-REWARD RL · IN-BROWSER TRAINING

Watch a neural net discover molecules by arguing with itself.

An RL agent generates SMILES strings one token at a time. Every episode, two clones train side-by-side — one cautious, one curious. The clone whose own critic learned more this round wins, and its weights carry forward. No human picks the explore-vs-exploit knob. The mechanism does, episode by episode.

▶ Train your own How it works

~75kparams

8×parallel envs

TF.jsin browser

RDKitfor Tanimoto

Acetic acid

CC(=O)O

Training console

Pick a target. Hit run. The duel begins. Charts and 3D samples update live.

Setup

Target molecule

Episodes 200 Parallel envs 8 Architecture

Imitation pretrain (30 steps)

Bias the policy toward the target's SELFIES before the duel. Without this, sparse-reward targets rarely converge in reasonable wall-clock time.

Entropy spread (A vs B) Max SMILES length 15

Idle — press Run to start.

Live metrics

A (low entropy) B (high entropy)

Episode

—

Rolling reward (20)

—

Best Tanimoto

—

Wins A / B

— / —

Validity rate

—

Episodes / sec

—

Mean reward per episode

Cumulative wins (A vs B)

Epiplexity per episode (A vs B)

Rolling win rate (30-ep)

Top molecules so far

Best Tanimoto similarity to the target. Diversified — same SMILES never appears twice.

Start training to populate this grid.

The epiplexity duel, in one minute

Two clones, two temperatures

At the start of every episode, the base policy is forked into clone A (low entropy bonus) and clone B (high entropy bonus). Both see the same env seed, so any difference in behavior is purely the entropy knob.

Both train an episode

Each clone runs a standard A2C update with an episodic-memory transformer + LSTM policy. During the update we measure epiplexity = drop in same-batch value-loss before vs. after the optimizer step. It's how much the critic actually learned.

Winner-takes-all

Whichever clone scored higher epiplexity becomes the new base policy. Its weights and optimizer state propagate. Early on, the curious B clone tends to win (you need exploration to find any signal). Later, the cautious A clone wins (exploit it).

Why this matters

On dense reward (e.g. QED+SA) the duel rarely beats baseline. On sparse reward — like Tanimoto-to-aspirin where 99% of episodes return ~0 — the duel auto-tunes its own exploration schedule. No human picks ε. The critic's own learning rate picks it.