I wanted to do a short write up on a personal project I developed recently: an AI to play the game hangman.
You can find the rough Python script I wrote out for it here: https://github.com/DazKins/HangmanAI. It's a bit rough round the edges but hopefully this article will explain a bit better what it's actually doing.
If you're familiar with the game of hangman, feel free the skip this section.
Hangman is a word guessing game where a player has to discover a secret word by guessing individual letters. The player is initially only told the length of the word. For each guessed letter, the player has revealed to them all the locations in the word of that letter. In the special case where the letter is not in the word at all, the player will “lose a life” (often denoted by the iterative drawing of a hanged man, hence the name).
Here's how an example game might play out:
Current state: "-----"
Guess: "a"
Current state: "a----"
Guess: "e"
Current state: "a---e"
Guess: "r"
Current state: "a---e" (incorrect guess, life lost)
Guess: "p"
Current state: "app-e"
Guess: "l"
Current state: "apple"
Game won
A typical game will have a set number of lives to be lost after which the player loses.
Any AI needs to start off with some given knowledge. Some ground truth about the real world that it axiomatically accepts. For this AI, we give it not just a list of English words, but a relative frequency of those words as well. i.e. a probability of any given english word being that word. For example, in the dataset used, the word “the” has a relative frequency of almost 4% as opposed to the word “christmas” with a relative frequency of 0.01%.
Pairing the relative frequency with the words is certainly a stylistic choice for this AI, but leads to far more interesting results. Without them, the AI would think the word “this” is just as likely as the word “doup” (I'll let you google that one) in a 4 letter game.
For this project I used this dataset: https://www.kaggle.com/datasets/rtatman/english-word-frequency. It has some questionable entries in there (such as miss-spellings like “dissappointment” and non-english words like “djakovica”) but those entries will have very low probability so shouldn't impact our model too badly. Note that the csv file in the GitHub repo is slightly different to the Kaggle link. Some pre-processing was done to remove any words with non-alphabetic characters and converting of uppercase to lowercase.
At every step of the game, the algorithm will maintain a list of words that could still be the correct answer. Every time a guess is made and feedback is received this list can be whittled down until eventually only one word remains and the algorithm can simply guess through any remaining letters.
The first and most easy filtering is at the start of the game when we are given the word length. We can simply remove all words from our list that are not the given length.
After this, we need to decide which letter to guess first. In order to “know” the correct answer, the word list must be of length 1 so at each step of the algorithm the objective should be to make the list as small as possible. This means making a guess that is the most likely to reduce the word list the most.
Information theory is a great mathematical tool for making informed probabilistic decisions like this. I'll try to talk through most of the ideas here but you might want to read up on the concepts of “Information” and “Entropy” a bit.
Information theory tells us, that the “information” of an event is-log2(p) where p is the probability of the event (the choice of a base 2 log is arbitrary but makes the calculations a bit nicer). This means a more probable event contains more information than a less probable one. Here's an example: if you have a rat in a 20 m² cage, with a small pressure plate of size 1m² in the corner connected to a bulb that lights up. Assuming the rat's random motion, the probability of the bulb lighting is 1/20 = 5%. The probability of it not lighting up is 95%. The “information” of the bulb lighting up is therefore higher than not lighting up. This makes sense because if the bulb is lit up we can be more sure of the rat's position. We have more information.
Log base 2 information is referred to as “bits”. So we would say the probably 0.05 event has information of -log2(0.05) ≈ 4.32 bits.
The next very important information theory concept is entropy. Entropy is simply the expected value (or weighted sum) of the information of an event. It is mathematically defined as ∑ -p.log2(p) where the summation is over p for all possible outcomes of the event. The classic example is a coin flip: 50% heads 50% tails. The entropy for this scenario would be:-0.5*log2(0.5) + -0.5*log2(0.5) = 1 bit. Entropy simply tells us how much information we expect to get from an event.
Let's relate the information theory concepts we just discussed to our hangman game.
At each step, the algorithm needs to choose the letter that will give it the most information. When dealing with probabilities, as seen in the last section, we can calculate this expected information with entropy.
So, for the entropy summation (∑ -p.log2(p)) what are the different outcomes? They are all the different possible positions we could be told the letter is in. For example, if we guessed letter “a” in a 5 letter game we could get a---- or -a--a or just ----- if the letter is not present anywhere in the word. It is not hard to see that for an nletter game there are 2^n different possible outcomes.
So now all that's left to do is to find the probability of each outcome so we can calculate the total entropy. This is simple as we can just sum the relative frequencies of each word that gives the same outcome (we'd also need to divide by a correcting factor to bring the total probability to 1 since we've filtered out words from our list already).
In my implementation, I chose to iterate over every word in the current word list, calculate the received outcome with the given guess letter and accumulate the probabilities in a map keyed by the pattern (represented by a binary number e.g. -a-a-- being 010100 ).
So now we can calculate the entropy of all the 26 possible letters to guess and simply chose the highest one! Information theory can make complex decisions quite simple with it's modelling of information!
We've covered most of the interesting stuff but there's still a final part to discuss.
How do we deal with 2nd, 3rd, 4th etc. guesses when we already have some correct letters. This is actually quite easy. For example, if we're playing a 5 letter game and after our first guess of “p” we get -pp--, we can then ignore the “p”s and just recursively play a new 3 letter game of hangman with the remaining missing letters.
One small caveat here: our AI is trained to play with English words, so by removing letters like this the English dataset is no longer valid. So at the recursive step we need to remove the guessed letter from every word in the word list as well. Continuing the previous example: after guessing “p” the word “apple” is in our world list so when we recurse to the 3 letter game we replace it with the word “ale”.
I'm not going to discuss these in detail here but there's a few other interesting questions that I'll leave you to discover in the code:
This implementation was heavily inspired by 3blue1brown's Wordle AI which I highly recommend you check out: https://www.youtube.com/watch?v=v68zYyaEmEA