Self-Play Experiments

From Simple Browser Toys to Dedicated Systems

After diving into self-play research, here are six experiment ideas - three simple enough to build in an afternoon, three challenging enough to require multiple sessions, external tools, or dedicated compute.

They're ordered by complexity, but they're all feasible. The simple ones capture the core dynamic. The challenging ones push toward real emergence.

Simple Experiments (Code in One Session)

1. Rock Paper Scissors Evolution Simple
Two agents play rock-paper-scissors and adapt their strategy based on opponent patterns. Visualize how each agent learns to counter the other's tendencies, leading to cycling strategies or mixed equilibria.
  • Each agent maintains probability distribution over R/P/S
  • After each round, increase probability of move that would have won
  • Every 50 rounds, agents update their models
  • Run 1000+ games and show strategy evolution over time
  • Real-time chart showing each agent's R/P/S probabilities
  • Running win/loss/tie counts
  • Heatmap of move patterns over time
  • Controls to reset, speed up/slow down, or pause
Pure JavaScript Canvas/SVG ~200 lines
Why it's interesting: You'll see agents develop counter-strategies, over-adapt, then get exploited, leading to cyclical behavior. Eventually they might converge to Nash equilibrium (33/33/33 random) or get stuck in patterns. Shows the arms race dynamic clearly.
2. Number Guessing Game Simple
Both agents pick numbers 1-100. Whoever picks the higher number wins UNLESS they're more than 10 apart, in which case the lower number wins (punishing greed). Watch strategies emerge around risk/reward balance.
  • Start with agents picking randomly from 1-100
  • After each game, shift probability toward winning strategy
  • If you won by going high, prefer higher numbers next time
  • If you lost by going too high, prefer lower numbers
  • Simple reinforcement: winning moves get +0.1 probability weight
  • Histogram of number choices over time
  • Average number picked by each agent (100-game rolling window)
  • Win rate oscillations as strategies adapt
  • Eventually: convergence to a stable range (probably 40-60)
JavaScript Chart.js ~250 lines
Why it's interesting: The non-linear reward structure creates an interesting optimization landscape. You'll see agents oscillate between greed and caution, and eventually find the game-theoretic sweet spot. Beautiful example of discovering optimal strategy through interaction.
3. Grid World Predator-Prey Simple
Simple 10x10 grid. Red agent (predator) tries to catch blue agent (prey). Prey tries to maximize distance. Both learn through trial and error. Surprisingly complex strategies emerge even in this tiny world.
  • Q-learning for both agents (simple lookup table)
  • State: relative positions (9 possible positions: N, NE, E, SE, S, SW, W, NW, CENTER)
  • Actions: move up/down/left/right or stay
  • Rewards: predator gets +10 for catch, prey gets +1 per turn survived
  • Epsilon-greedy exploration (90% exploit learned policy, 10% random)
  • Live grid showing agents moving
  • Trail lines showing recent paths
  • Stats: average chase length, catch rate over time
  • Heatmap of where prey tends to flee
  • Speed controls to watch learning happen
JavaScript Canvas Q-learning ~300 lines
Why it's interesting: Early on, both agents move randomly. Then predator learns basic chase behavior. Then prey learns to use walls and corners. Then predator learns to cut off corners. Co-evolution in real-time. You'll see genuine strategy emergence.

Challenging Experiments (Multi-Session, External Tools, or Dedicated Compute)

4. LLM Debate Arena Challenging
Two LLM instances debate a question. Third LLM judges which argument was more persuasive. Winners' argument styles get reinforced. Watch rhetoric strategies evolve across hundreds of debates. This requires API access and careful prompt engineering.
  • Debater A & B: Claude or GPT-4 instances with system prompts defining their "style"
  • Judge: Separate LLM instance that evaluates arguments on clarity, evidence, logic
  • Question Bank: 50+ debate topics (philosophical, factual, ethical)
  • Style Evolution: After every 10 debates, analyze winning arguments and update system prompts
  • Backend: Node.js server to orchestrate debates, store results, evolve prompts
  • Extract features from winning arguments (length, structure, use of examples, rhetorical devices)
  • Update system prompts to reinforce successful strategies
  • Maintain "prompt genome" - parameters that evolve (assertiveness, evidence density, emotional appeal)
  • Cross-breed successful strategies between agents occasionally
  • Live debate viewer showing arguments in real-time
  • Strategy evolution chart (how prompt parameters change over generations)
  • Win rate over time for each agent
  • Word clouds of most successful argument patterns
  • Archive of best debates
Node.js backend LLM API (Anthropic/OpenAI) React frontend Database (Postgres/Supabase) ~1000 lines $50-200 API cost
Why it's interesting: This is self-play in language space. You're not training the model weights (that requires massive compute) but evolving the prompts and strategies. Will one agent learn to use more analogies? More emotional appeals? Concise logic? The meta-learning is what's beautiful - LLMs learning to instruct themselves better.
5. Evolving Neural Net Strategy Game Challenging
Custom turn-based game (like simplified chess or tic-tac-toe variant). Agents use small neural networks (trained with reinforcement learning) that evolve through self-play. This requires ML infrastructure but achieves real "learning" in the AlphaZero sense.
  • 5x5 grid, two players, take turns placing stones
  • Capture territory by surrounding empty spaces (like Go but simpler)
  • Game ends when board fills or both players pass
  • Score = territory controlled
  • Simple enough to learn quickly, complex enough for strategy
  • Input: 5x5x2 tensor (player positions + opponent positions)
  • Hidden: Two dense layers (128 units each)
  • Output: 25 probabilities (one per grid position)
  • Small enough to train on laptop GPU in reasonable time
  • Initialize network with random weights
  • Play 1000 self-play games using current network
  • Store (state, action, reward) tuples from all games
  • Train network on this data (supervised learning: predict good moves)
  • Every 5000 games, save checkpoint and test against old versions
  • Run for 50,000+ games to see real improvement
  • Training: Python + PyTorch/TensorFlow
  • Game Engine: Python class with numpy
  • Web Interface: Flask backend serving the trained model
  • Frontend: React + Canvas for game visualization
  • Play against it: Load trained model in browser via TensorFlow.js
Python + PyTorch Flask API React frontend TensorFlow.js ~800 lines Local GPU helpful
Why it's interesting: This is real neural net learning through self-play, not just rule-based adaptation. You'll watch a network go from random moves to coherent strategy in a few hours of training. You can play against it yourself and see it improve. It's a mini-AlphaZero. The visualizations of what the network "sees" as good moves are beautiful.
6. Multi-Agent Trading Simulation Challenging
10 agents trading a simple asset in a simulated market. Each agent learns pricing and timing strategies through self-play against the others. Emergent behaviors: market making, momentum trading, mean reversion. Watch mini "market crashes" and recoveries develop organically.
  • Single asset with intrinsic random walk value (baseline price)
  • 10 agents, each starts with $1000 cash + 10 shares
  • Each round: agents can submit bids/asks or hold
  • Simple order matching: highest bid meets lowest ask if they cross
  • Agents see: current price, recent price history (10 steps), their own inventory
  • Goal: maximize total wealth (cash + shares * price) at end
  • Each agent uses reinforcement learning (PPO or DQN)
  • State: price history, inventory, recent volatility
  • Actions: submit bid at price X, submit ask at price Y, hold
  • Reward: change in total wealth each round
  • Diversity: initialize agents with slightly different risk parameters
  • Train for 10,000+ rounds, save checkpoints
  • Market makers: Agents that profit by providing liquidity (bid-ask spread)
  • Momentum traders: Agents that buy when price rises, sell when it falls
  • Mean reversion: Agents that buy dips, sell peaks
  • Herding: All agents moving same direction, causing crashes/bubbles
  • Adaptation: When momentum works, others copy it; when it fails, they switch
  • Price chart with intrinsic value overlay (to see deviations)
  • Order book depth visualization (bids vs asks)
  • Agent inventory and wealth over time
  • Strategy classifier: attempt to label each agent's learned strategy
  • Volatility and volume metrics
  • Playback controls to rewatch interesting episodes
Python + Stable-Baselines3 (RL) FastAPI backend React + D3.js frontend Redis for state ~1500 lines Multi-day training
Why it's interesting: Financial markets are the ultimate self-play environment - traders learning from and adapting to each other. This captures that dynamic. You'll see agents discover strategies humans use (market making, momentum, mean reversion) without being told about them. The crashes and bubbles that emerge from pure self-interested learning are eerie and beautiful. It's an ecosystem.

My Take

The simple experiments (1-3) all share something beautiful: you can build them in an afternoon and immediately see self-play working. The visualization is half the point - watching strategies evolve in real-time is satisfying in a way that looking at loss curves isn't.

The challenging experiments (4-6) require real infrastructure but push toward the kind of emergence you see in the research papers. LLM debates show self-play in language space. Neural net Territory is a mini-AlphaZero. The trading simulation is an ecosystem where strategies co-evolve.

I'm partial to #3 (Grid World) and #5 (Territory) - they're spatial, visual, and the strategies that emerge are legible. You can see the predator learning to corner, the neural net learning to control territory. That legibility matters for understanding.

But #6 (Trading) is probably the most intellectually rich. Markets are complex, multi-agent, non-zero-sum (sometimes), and humans have been thinking about optimal trading for centuries. Watching agents rediscover or invent strategies would be genuinely fascinating.

If I had to pick one to build: Start with #3 (Grid World Predator-Prey). It's achievable in one session, has beautiful visualization, and clearly demonstrates the arms race dynamic. Once that's working, consider #5 (Territory with neural nets) as the natural next step - same spatial intuition, but with real learning.