Training MENACE

Automating the Training of our Matchbox Educable Noughts and Crosses Engine

🤖 Live project
📖 Portfolio project page
🗂️ Github repository
⬅️ Part 1 - Programming Menace

In my previous post, Programming MENACE, we went though the logic for a web version of the Matchbox Educable Noughts and Crosses Engine.

In this post we'll explore training strategies and explain machine learning terminology.

Machine Learning Terminology

An instance of our MENACE class is known as an agent. In this context, the agent is the "learner" or "decision maker" - the software that changes its actions based on the outcome of its previous actions to maximise some reward or minimise some penalty.

For MENACE, a win is a reward and a lose is a penalty.

This type of learning is called reinforcement learning - we're reinforcing good moves, or "actions", by increasing the probability of selecting actions that lead to rewards.

Actions are any "decisions" an agent makes. In our case, MENACE selecting a child state as its own move is an action.

An agent operates in an environment, in our case this is the Noughts and Crosses game itself, including the board state graph, the current state reference, and the array of references for all states in the current game.

A state is a specific configuration or condition within the environment, and in our case it refers to a specific layout of the Noughts and Crosses board.

Exploration vs. Exploitation

In reinforcement learning, agents must balance exploration (trying new actions to learn their potential outcomes) and exploitation (selecting known actions that yield the highest expected rewards). Overfitting occurs when a model becomes so specialized to specific patterns in its training data that it struggles to perform well on new, unseen scenarios.

We also need to think about how we'll structure the training. Will we play 300 random strategy games, followed by 300 optimum strategy games, or interleave the strategies more frequently? There's an approach called the "Epsilon-Greedy" approach, which has us randomly choose between exploration and exploitation training randomly each game.

Initially, I want to play MENACE myself, to see if we get the same results as Donald Michie and in a similar amount of games.

Then, we can automate some random strategy training to prevent overfitting.

Before going through each of these, let's look at the train method on our MENACE class.

train() {
    const statesThisGameSnapshot = [...this.statesThisGame];
    const outcome = this.currentState.outcome;
    const iAmX = this.iAmX;
 
    const trainTask = async (loadDataBuffer) => {
      let prevState = null;
      if (outcome == "draw") {
        return;
      }
      let aiWin = (outcome == "x win") == iAmX;
      statesThisGameSnapshot.forEach((stateObj) => {
        if (prevState) {
          Object.entries(prevState.childStateChances).forEach(
            ([key, value]) => {
              if (key === stateObj.key) {
                const aiTurn = iAmX ? prevState.xTurn : !prevState.xTurn;
                let adjustment = aiTurn === aiWin ? 3 : -1;
                prevState.childStateChances[key] = Math.max(
                  1,
                  value + adjustment
                );
                // update the loadData in the current graph
                this.currentState.loadData[key] =
                  prevState.childStateChances[key];
                // update the buffer that was passed by the queue class
                loadDataBuffer[key] = prevState.childStateChances[key];
              }
            }
          );
        }
        prevState = stateObj;
      });
    };
    updateQueue.enqueue(trainTask);
  }

As you can see, the train method creates an asynchronous function called trainTask, which it then passes to the enqueue method of an updateQueue instance. We're using a queue system for updates because there's a single decision graph for all players, and we want to avoid race conditions. I also want to avoid writing updates after every single game, instead writing batch updates after a set amount of games have been played, reducing writes.

In each trainTask, we loop the states of a given game and update the probability of each one being chosen by the previous state, increasing the probability of the winning side's states, and decreasing the probability of the losing side's states. We prevent probabilities from going below 1 so that MENACE never completely disregards a move.

For the updateQueue singleton class in its entirity, see the Github repository.

It's most important to take a look at our processNext method.

enqueue(task) {
  this.queue.push(task);
  this.processNext();
}
 
async processNext() {
  if (this.processing || this.queue.length < 10) return;
  this.processing = true;
 
  const tasksToProcess = this.queue.splice(0, 10);
  try {
    for (const task of tasksToProcess) {
      await task(this.loadDataBuffer);
    }
  } catch (error) {
  } finally {
    this.processing = false;
    this.scheduleFileWrite();
  }
}

Inside processNext(), we use a for of loop and await each asynchronous task to avoid multiple tasks reading and writing from the loadDataBuffer at the same time - each task is passed the same loadDataBuffer.

💡If we were to use a forEach here, each iteration's callback would await it's own task call, but the iterations would not wait for each other to finish. This means multiple task calls would be being awaited at the same time, reading and writing in race conditions.💡

It's also important to note that the queue is a singleton - the same instance is exported for all instances of MENACE.

module.exports = new UpdateQueue();

Exploitation Training (human)

I'm performing human games first. It's amazing to experience MENACE improving first-hand. There's not much to say on this other than I'm playing with optimum strategy like Donald Michie did.

Exploration Training (automated)

To train randomly, we can simply play MENACE against itself.

"An image of Agent Smith from the Matrix, facing himself"

This is very simple to do. Instead of passing a playSelf boolean to our makeMove method and ending up with a load of conditional blocks, we'll simply add another function called playSelf to keep things clean and separated. We don't have to worry about canonicalising states with transformations at all here because MENACE can only select canonical child states.

playSelf(games = 300) {
  while (games > 0) {
    let selfChoiceKey = weightedRandomChoice(
      this.currentState.childStateChances
    );
    this.moveToChildByKey(selfChoiceKey);
    this.statesThisGame.push(this.currentState);
    if (this.currentState.outcome != null) {
      this.train();
      this.reset();
      games--;
    }
  }
}

I created a file called train.js that runs this function a specified amount of times:

const { BoardState } = require("./stateTree.js");
const { Menace } = require("./menace.js");
const { loadData } = require("./utils/utils.js");
 
let data = loadData();
let root = new BoardState(undefined, undefined, data);
root.generateChildren();
let MENACE = new Menace(root, root.state);
MENACE.playSelf(1000);

node ./train.js

After running this, our loadData.json file appears in the directory root, and reassuringly, the amount of state keys in that file is close to (but hasn't exceded) the total canonical states we should have for both sides of the game.

Results

Originally, I played MENACE myself with optimum strategy for around 150 games, as in the original experiment. I'm glad I did this so that I got to experience MENACE improving in real time.

However, this meant that the automated, randomised training, which was supposed to encourge exploration, was not random enough. Our playSelf function was working with the likelihoods from the human training. As a result, MENACE still performed badly on suboptimal, random moves.

For that reason, for the final project, I performed the automated training for 300 games first, then proceeded to play MENACE myself. As a result, MENACE is performing better against random moves.

Live Project

🔗 You can find the live project here.