Self-Play Experiments

From Simple Browser Toys to Dedicated Systems

After diving into self-play research, here are six experiment ideas - three simple enough to build in an afternoon, three challenging enough to require multiple sessions, external tools, or dedicated compute.

They're ordered by complexity, but they're all feasible. The simple ones capture the core dynamic. The challenging ones push toward real emergence.

Simple Experiments (Code in One Session)

1. Rock Paper Scissors Evolution Simple

Two agents play rock-paper-scissors and adapt their strategy based on opponent patterns. Visualize how each agent learns to counter the other's tendencies, leading to cycling strategies or mixed equilibria.

How It Works

Each agent maintains probability distribution over R/P/S
After each round, increase probability of move that would have won
Every 50 rounds, agents update their models
Run 1000+ games and show strategy evolution over time

Visualization

Real-time chart showing each agent's R/P/S probabilities
Running win/loss/tie counts
Heatmap of move patterns over time
Controls to reset, speed up/slow down, or pause

Pure JavaScript Canvas/SVG ~200 lines

Why it's interesting: You'll see agents develop counter-strategies, over-adapt, then get exploited, leading to cyclical behavior. Eventually they might converge to Nash equilibrium (33/33/33 random) or get stuck in patterns. Shows the arms race dynamic clearly.

2. Number Guessing Game Simple

Both agents pick numbers 1-100. Whoever picks the higher number wins UNLESS they're more than 10 apart, in which case the lower number wins (punishing greed). Watch strategies emerge around risk/reward balance.

Rules & Learning

Start with agents picking randomly from 1-100
After each game, shift probability toward winning strategy
If you won by going high, prefer higher numbers next time
If you lost by going too high, prefer lower numbers
Simple reinforcement: winning moves get +0.1 probability weight

What You'd See

Histogram of number choices over time
Average number picked by each agent (100-game rolling window)
Win rate oscillations as strategies adapt
Eventually: convergence to a stable range (probably 40-60)

JavaScript Chart.js ~250 lines

Why it's interesting: The non-linear reward structure creates an interesting optimization landscape. You'll see agents oscillate between greed and caution, and eventually find the game-theoretic sweet spot. Beautiful example of discovering optimal strategy through interaction.

3. Grid World Predator-Prey Simple

Simple 10x10 grid. Red agent (predator) tries to catch blue agent (prey). Prey tries to maximize distance. Both learn through trial and error. Surprisingly complex strategies emerge even in this tiny world.

Implementation

Q-learning for both agents (simple lookup table)
State: relative positions (9 possible positions: N, NE, E, SE, S, SW, W, NW, CENTER)
Actions: move up/down/left/right or stay
Rewards: predator gets +10 for catch, prey gets +1 per turn survived
Epsilon-greedy exploration (90% exploit learned policy, 10% random)

Visualization

Live grid showing agents moving
Trail lines showing recent paths
Stats: average chase length, catch rate over time
Heatmap of where prey tends to flee
Speed controls to watch learning happen

JavaScript Canvas Q-learning ~300 lines

Why it's interesting: Early on, both agents move randomly. Then predator learns basic chase behavior. Then prey learns to use walls and corners. Then predator learns to cut off corners. Co-evolution in real-time. You'll see genuine strategy emergence.

Challenging Experiments (Multi-Session, External Tools, or Dedicated Compute)

4. LLM Debate Arena Challenging

Two LLM instances debate a question. Third LLM judges which argument was more persuasive. Winners' argument styles get reinforced. Watch rhetoric strategies evolve across hundreds of debates. This requires API access and careful prompt engineering.

Architecture

Debater A & B: Claude or GPT-4 instances with system prompts defining their "style"
Judge: Separate LLM instance that evaluates arguments on clarity, evidence, logic
Question Bank: 50+ debate topics (philosophical, factual, ethical)
Style Evolution: After every 10 debates, analyze winning arguments and update system prompts
Backend: Node.js server to orchestrate debates, store results, evolve prompts

Learning Mechanism

Extract features from winning arguments (length, structure, use of examples, rhetorical devices)
Update system prompts to reinforce successful strategies
Maintain "prompt genome" - parameters that evolve (assertiveness, evidence density, emotional appeal)
Cross-breed successful strategies between agents occasionally

Frontend Visualization

Live debate viewer showing arguments in real-time
Strategy evolution chart (how prompt parameters change over generations)
Win rate over time for each agent
Word clouds of most successful argument patterns
Archive of best debates

Node.js backend LLM API (Anthropic/OpenAI) React frontend Database (Postgres/Supabase) ~1000 lines $50-200 API cost

Why it's interesting: This is self-play in language space. You're not training the model weights (that requires massive compute) but evolving the prompts and strategies. Will one agent learn to use more analogies? More emotional appeals? Concise logic? The meta-learning is what's beautiful - LLMs learning to instruct themselves better.

5. Evolving Neural Net Strategy Game Challenging

Custom turn-based game (like simplified chess or tic-tac-toe variant). Agents use small neural networks (trained with reinforcement learning) that evolve through self-play. This requires ML infrastructure but achieves real "learning" in the AlphaZero sense.

Game: "Territory"

5x5 grid, two players, take turns placing stones
Capture territory by surrounding empty spaces (like Go but simpler)
Game ends when board fills or both players pass
Score = territory controlled
Simple enough to learn quickly, complex enough for strategy

Neural Net Architecture

Input: 5x5x2 tensor (player positions + opponent positions)
Hidden: Two dense layers (128 units each)
Output: 25 probabilities (one per grid position)
Small enough to train on laptop GPU in reasonable time

Training Loop

Initialize network with random weights
Play 1000 self-play games using current network
Store (state, action, reward) tuples from all games
Train network on this data (supervised learning: predict good moves)
Every 5000 games, save checkpoint and test against old versions
Run for 50,000+ games to see real improvement

Tech Stack

Training: Python + PyTorch/TensorFlow
Game Engine: Python class with numpy
Web Interface: Flask backend serving the trained model
Frontend: React + Canvas for game visualization
Play against it: Load trained model in browser via TensorFlow.js

Python + PyTorch Flask API React frontend TensorFlow.js ~800 lines Local GPU helpful

Why it's interesting: This is real neural net learning through self-play, not just rule-based adaptation. You'll watch a network go from random moves to coherent strategy in a few hours of training. You can play against it yourself and see it improve. It's a mini-AlphaZero. The visualizations of what the network "sees" as good moves are beautiful.

6. Multi-Agent Trading Simulation Challenging

10 agents trading a simple asset in a simulated market. Each agent learns pricing and timing strategies through self-play against the others. Emergent behaviors: market making, momentum trading, mean reversion. Watch mini "market crashes" and recoveries develop organically.

Market Structure

Single asset with intrinsic random walk value (baseline price)
10 agents, each starts with $1000 cash + 10 shares
Each round: agents can submit bids/asks or hold
Simple order matching: highest bid meets lowest ask if they cross
Agents see: current price, recent price history (10 steps), their own inventory
Goal: maximize total wealth (cash + shares * price) at end

Agent Learning

Each agent uses reinforcement learning (PPO or DQN)
State: price history, inventory, recent volatility
Actions: submit bid at price X, submit ask at price Y, hold
Reward: change in total wealth each round
Diversity: initialize agents with slightly different risk parameters
Train for 10,000+ rounds, save checkpoints

Emergent Behaviors to Watch For

Market makers: Agents that profit by providing liquidity (bid-ask spread)
Momentum traders: Agents that buy when price rises, sell when it falls
Mean reversion: Agents that buy dips, sell peaks
Herding: All agents moving same direction, causing crashes/bubbles
Adaptation: When momentum works, others copy it; when it fails, they switch

Visualization Dashboard

Price chart with intrinsic value overlay (to see deviations)
Order book depth visualization (bids vs asks)
Agent inventory and wealth over time
Strategy classifier: attempt to label each agent's learned strategy
Volatility and volume metrics
Playback controls to rewatch interesting episodes

Python + Stable-Baselines3 (RL) FastAPI backend React + D3.js frontend Redis for state ~1500 lines Multi-day training

Why it's interesting: Financial markets are the ultimate self-play environment - traders learning from and adapting to each other. This captures that dynamic. You'll see agents discover strategies humans use (market making, momentum, mean reversion) without being told about them. The crashes and bubbles that emerge from pure self-interested learning are eerie and beautiful. It's an ecosystem.

My Take

The simple experiments (1-3) all share something beautiful: you can build them in an afternoon and immediately see self-play working. The visualization is half the point - watching strategies evolve in real-time is satisfying in a way that looking at loss curves isn't.

The challenging experiments (4-6) require real infrastructure but push toward the kind of emergence you see in the research papers. LLM debates show self-play in language space. Neural net Territory is a mini-AlphaZero. The trading simulation is an ecosystem where strategies co-evolve.

I'm partial to #3 (Grid World) and #5 (Territory) - they're spatial, visual, and the strategies that emerge are legible. You can see the predator learning to corner, the neural net learning to control territory. That legibility matters for understanding.

But #6 (Trading) is probably the most intellectually rich. Markets are complex, multi-agent, non-zero-sum (sometimes), and humans have been thinking about optimal trading for centuries. Watching agents rediscover or invent strategies would be genuinely fascinating.

If I had to pick one to build: Start with #3 (Grid World Predator-Prey). It's achievable in one session, has beautiful visualization, and clearly demonstrates the arms race dynamic. Once that's working, consider #5 (Territory with neural nets) as the natural next step - same spatial intuition, but with real learning.