Magic 8 Ball: An app to maximize wins in competitive pool matches (Part 2)

How machine learning and SQL can help you outsmart your opponent in sequential team games

Published in

Towards Data Science

10 min readMay 31, 2021

Welcome back to this three-part series on data driven pool strategy!

In Part 1, we explored the quirks of competitive team pool, and determined that player lineup selections for pool games could be improved with data science. We developed a classification model that predicts the probability that one player will beat another player based on past results. Here, in this part, we will go one step further and use those predictions to actually inform strategy for player selection.

Part 1: Introduction and predictive modelling.

Part 2: Strategising with SQL.

Part 3: ‘Potting’ it to the test!

Match winning probability

In this previous article, we developed a logistic regression model which uses historical performance statistics to predict the probability that one player will beat another player.

While we can use our model to predict the winning probabilities for each possible player pairing, what we are really interested in is the overall match winning probability.

Recall that in a match there are 5! = 120 distinct possible lineup permutations; thus, each of these distinct permutations will have a different associated match winning probability.

In order to rank the lineups from best to worst, we need to calculate the individual probabilities for each pairing permutation.

Here, we assume that winning the overall match requires winning three out of the five rounds; so, for each lineup permutation, we need to calculate the probability of winning at least three rounds. The sum of all of the probabilities for at least three wins can be achieved with the following equation:

Assuming that the probabilities of winning each round are independent, we can calculate each of the terms on the right hand side of the equation above by evaluating the expression below:

Where p_i is the probability of winning round i. The probability of winning exactly m rounds is given by P(m wins), the coefficient of x^m, where x is a dummy variable.

e.g. To calculate P(4 wins) we expand the left hand side of the expression and use the coefficient of x^4:

Python implementation

Our logistic regression model takes the Race Margin, Skill Margin and Win % Margin values for a player pairing as inputs and returns a predicted probability that player A will win the round. We can store a match lineup in a five-row Python DataFrame, where each row corresponds to a single player pairing.

If we apply our model to a lineup DataFrame, we generate a vector of probabilities — one probability for each pairing. The following function below takes this probability vector and calculates P(win match) using the above equations.

Let’s say we have stored all of the lineup DataFrames in a list called lineups. We can now apply the predictive model to each lineup and calculate the match winning probabilities for every lineup permutation.

Organizing the match predictions

Team Captains alternate in the role of selecting a player first. At the start of each round the Team Captain wants to ensure they select the best player to play for their team. Because we aimed to deploy a user-friendly web application, it made sense for us to store the predictions for each permutation in a relational database. With this database, we only needed to generate the predicted probabilities once per match. At the start of each round, we can retrieve the recommended player as the result of reusable SQL queries. We used postgreSQL because of its seamless integration with app deployment platforms such as Heroku.

Figure 1 shows the schema for each table contained in the database:

all_perms contains the predicted winning probability for player_a against player_b, along with the numeric id (from 1–120) of the permutation containing that pairing.
perm_score contains the overall match winning probability for each permutation.
team_a contains the list of player ids, names, and skill_levels for the players on Team A.
team_b contains the list of player ids, names, and skill_levels for the players on Team B.
lineup is continually updated with selected player ids and names as the match unfolds. pos is an integer from 1–10 indicating the position in the lineup, with odd numbers corresponding to Team A selections and even numbers corresponding to Team B selections.

Player selection strategies — match example:

Now that we have a database of predictions for each possible lineup between two teams, we can easily find out which lineup has the highest match winning probability. However, since we can’t be sure which players the opposing captain will select for each round, getting there is far from guaranteed. Nevertheless, is there a selection strategy that consistently yields a close-to-optimal lineup?

Let’s put ourselves in the shoes of a Team Captain and play an example match.

We are up against Slick Jill, captain of the Blue Team, and just like us, she will be trying her hardest to create a lineup which maximizes her chances of winning. The model has been applied to predict winning probabilities for each of our players (the Red Team) against each player on the Blue Team, which are summarized in Table 1. These probabilities are taken from the all_perms table in the database.

Table 1: Predicted winning probabilities for players on Team A (red) against players on Team B (blue).

Round One: Responding to your opponents selection

Slick Jill wins the coin toss and chooses to select first in round one. She calls Paul over to the table to play. What is our response strategy?

It might be tempting to look at the matrix and immediately respond by selecting Curtis. He has the highest winning probability, so it’s a no brainer, right? A selection strategy using this logic would be a form of greedy algorithm. This could end up being a short-sighted strategy over many matches. For example, Tessa has almost an equally high chance of beating Paul as Curtis does, whereas Curtis has by far the highest winning probability against Juana. It is probably prudent to save him for that potential matchup. Scenarios could also arise where it is advantageous to ‘sacrifice’ one round if it means the team has a higher probability of winning more of the subsequent rounds.

With this in mind, how can we be confident that we are selecting the best player without overwhelming our brain with all the different possibilities?

We have already done a lot of tedious work by calculating the match winning probability for each possible lineup. These match probabilities are stored in the perm_score table in our database. Time to put them to use!

Now, Table 2 shows a new match matrix, where this time the values represent the mean match winning probability for all lineups containing that pairing.

Table 2: In contrast to table 1, this match matrix shows the mean match winning probability of the lineup permutations containing that player pairing.

The pairings corresponding to the optimal lineup are highlighted in grey. The match winning probability of this lineup is 0.77, compared to a mean match winning probability of 0.66 across all 120 permutations!

Now we can see that pairing Milton with Paul will, on average, lead to a higher probability of winning the match than if we were to choose any other player. This wasn’t at all obvious from the first table! No matter who Slick Jill chose, we would have been able to respond with a player which improved our match winning prospects.

The following code shows how we would write a function in Python to generate something like Table 2 using a query to the postgreSQL database. The function takes two arguments: The player ID corresponding to Slick Jill’s selection, player_b_id, and a connection to the postgreSQL database, con.

That’s a lot to digest. The two Python functions on lines 4 and 5 retrieve the current lineup, which is required to enable the database to be filtered to only include active permutations. The rest of the code consists of a SQL query to retrieve the optimal player choice. Let’s untangle it line by line, printing the step by step output in the order in which the code is executed.

Line 11: Select all the IDs of active permutations from the all_perms table. “clause” is a WHERE statement which filters out permutations which are no longer possible due to previous rounds being completed. See the full code in GitHub. Since all permutations are still possible, this returns a table, f, consisting of a single column containing numbers 1–120.

2. Line 12: Inner join to the all_perms table. Likewise, since we have not yet eliminated any possible outcomes, this just returns the entire all_perms table with the permutation column duplicated.

3. Line 13: Join result to the perm_score table to access match winning probabilities. Note that a.probability refers to the predicted win probability for a single round in a given permutation, whereas s.probability refers to the predicted match winning probability for the given permutation.

4. Line 14: Filter the table to only include possible rounds including the selected player on team B (Chris).

5. Lines 9,10,15: This is where the final table is assembled. We GROUP BY the players on team A to compute the average match winning probabilities (AVG(s.probability) as avg_prob) for all permutations containing these possible pairings. Note that although we only need to group by the a.id_a column, we need to include player_a and player_b in the GROUP BY statement if we also include these columns in the SELECT statement.

6. Line 14: Finally, the table is ordered by descending avg_prob.

Once the Judith v Chris pairing is confirmed, we can add these players to the lineup table in our database, and move on to the next round.

Round Two: Our turn to pick first

The situation is a bit more complicated when it is our turn to select a player first, since Slick Jill can spring the element of surprise with her choice in response. Nevertheless, we can formulate a strategy based around minimizing the possible damage that Slick Jill can do to our average match winning probability. In other words, we want to make a player selection which will lead to the highest possible average match winning probability regardless of who Slick Jill chooses in response. This is referred to as a maximin (minimax) strategy.

Let’s update the match matrix from Figure 3. Now that one round has been played, the only remaining possible lineups are the 24 permutations which include Milton v Paul as a pairing. In Table 3 the average match winning probability has been recalculated for each pairing over the 24 remaining permutations.

Table 3: Updated match matrix after round one has been completed. The minimum winning probabilities for our team are found by taking the minimum value across the rows.

For each row corresponding to a red team player, the cell highlighted in grey corresponds to the worst possible pairing for that player. If Slick Jill’s choice leads to one of these pairings, then our winning prospects will take a nosedive!

Therefore, Tessa is the safest selection for round 2, since she has the highest minimum match winning probability (0.66).

Now for the Python/SQL implementation:

Note that the subquery in lines 11 to 16 is very similar to the query we used in round one, the only difference being the absence of a WHERE clause, which is not required when we are considering all possible pairings, and the fact we are now grouping by player_b then id_a, instead of vice versa.

Lines 12–15: Join the tables in the same way as in the previous section, however this time the only possible permutations are those containing Milton v Paul.

2. Lines 11, 16: GROUP BY player_b, then id_a and player_a, aggregating the average match winning probability (AVG(s.probability) as avg_prob). This produces a tabular form of the matrix shown earlier in Table 3.

3. Line 9, 10, 17: GROUP the match matrix table by id_a and player_a, SELECTing the minimum average match winning probability (MIN(avg_prob) as min_prob) for each group.

4. Line 16: Notice that Paul’s name is present in the Figure 10 table, despite the fact he already played in round one! To remove impossible player pairings, it is necessary to include a HAVING statement which filters out player ids that have already been entered into the lineup table.

5. Line 11: Finally, the aggregated rows are ordered by decreasing min_prob.

Putting it all together:

We now have all the ingredients we need to build an application! Let’s run through our example match and see what lineup our selection strategies lead us to. If Slick Jill makes random selections, then we could end up with the lineup in Table 4.

Table 4: A possible final lineup where we used the maximin strategy while Slick Jill made random selections. The colours indicate which team picked first in each round. The probability column shows the winning probability for the red team player for each pairing.

We are a heavy favourite to win in three of the five rounds, with an overall match winning probability of 0.76. Of all the 120 possible permutations, this lineup actually has the 5th highest match winning probability, so we couldn’t have given ourselves a much better chance!

While it seems like we have the magic formula to outsmart Slick Jill in this match, it would be poor practice to draw a conclusion from a single data point. In the next part, statistical methods will be applied to quantify the effectiveness of the selection process.