Statistical Football prediction is a method used in

sports betting Sports betting is the activity of predicting sports results and placing a wager on the outcome. The frequency of sports bet upon varies by culture, with the vast majority of bets being placed on association football, American football, basket ...

, to predict the outcome of football matches by means of statistical tools. The goal of statistical match prediction is to outperform the predictions of

bookmakers A bookmaker, bookie, or turf accountant is an organization or a person that accepts and pays off bets on sporting and other events at agreed-upon odds. History The first bookmaker, Ogden, stood at Newmarket in 1795. Range of events Bookma ...

, who use them to set odds on the outcome of football matches. The most widely used statistical approach to prediction is

ranking A ranking is a relationship between a set of items such that, for any two items, the first is either "ranked higher than", "ranked lower than" or "ranked equal to" the second. In mathematics, this is known as a weak order or total preorder of ...

. Football ranking systems assign a rank to each team based on their past game results, so that the highest rank is assigned to the strongest team. The outcome of the match can be predicted by comparing the opponents’ ranks. Several different football ranking systems exist, for example some widely known are the

FIFA World Rankings The FIFA Men's World Ranking is a ranking system for men's national teams in association football, led by Brazil . The teams of the men's member nations of FIFA, football's world governing body, are ranked based on their game results with the ...

or the

World Football Elo Ratings The World Football Elo Ratings are a ranking system for men's national association football teams that is published by the website eloratings.net. It is based on the Elo rating system but includes modifications to take various football-specific va ...

. There are three main drawbacks to football match predictions that are based on ranking systems: # Ranks assigned to the teams do not differentiate between their attacking and defensive strengths. # Ranks are accumulated averages which do not account for skill changes in football teams. # The main goal of a ranking system is not to predict the results of football games, but to sort the teams according to their average strength. Another approach to football prediction is known as rating systems. While ranking refers only to team order, rating systems assign to each team a continuously scaled strength indicator. Moreover, rating can be assigned not only to a team but to its attacking and defensive strengths, home field advantage or even to the skills of each team player (according to Stern ).

History

Publications about statistical models for football predictions started appearing from the 90s, but the first model was proposed much earlier by Moroney, who published his first statistical analysis of soccer match results in 1956. According to his analysis, both

Poisson distribution In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known co ...

and negative binomial distribution provided an adequate fit to results of football games. The series of ball passing between players during football matches was successfully analyzed using negative binomial distribution by Reep and Benjamin in 1968. They improved this method in 1971, and in 1974 Hill indicated that soccer game results are to some degree predictable and not simply a matter of chance. The first model predicting outcomes of football matches between teams with different skills was proposed by Michael Maher in 1982. According to his model, the goals, which the opponents score during the game, are drawn from the

. The model parameters are defined by the difference between attacking and defensive skills, adjusted by the home field advantage factor. The methods for modeling the home field advantage factor were summarized in an article by Caurneya and Carron in 1992. Time-dependency of team strengths was analyzed by Knorr-Held in 1999. He used

recursive Bayesian estimation In probability theory, statistics, and machine learning, recursive Bayesian estimation, also known as a Bayes filter, is a general probabilistic approach for estimating an unknown probability density function (PDF) recursively over time using inco ...

to rate football teams: this method was more realistic in comparison to soccer prediction based on common average statistics.

Football Prediction Methods

All the prediction methods can be categorized according to tournament type, time-dependence and regression algorithm. Football prediction methods vary between Round-robin tournament and

Knockout competition A single-elimination, knockout, or sudden death tournament is a type of elimination tournament where the loser of each match-up is immediately eliminated from the tournament. Each winner will play another in the next round, until the final matc ...

. The methods for

are summarized in an article by Diego Kuonen. The table below summarizes the methods related to Round-robin tournament. :

Time Independent Least Squares Rating

This method intends to assign to each team in the tournament a continuously scaled rating value, so that the strongest team will have the highest rating. The method is based on the assumption that the rating assigned to the rival teams is proportional to the outcome of each match. Assume that the teams A, B, C and D are playing in a tournament and the match outcomes are as follows: : Though the ratings

r_

r_

r_

and

r_

of teams A, B, C and D respectively are unknown, it may be assumed that the outcome of match #1 is proportional to the difference between the ranks of teams A and B:

y_=r_-r_+\varepsilon _1

. In this way,

y_

corresponds to the score difference and

\varepsilon _1

is the noise observation. The same assumption can be made for all the matches in the tournament: :

\begin
   y_=r_-r_+\varepsilon _  \\
   y_=r_-r_+\varepsilon _  \\
   ...  \\
   y_=r_-r_+\varepsilon _  \\
\end

By introducing a selection matrix X, the equations above can be rewritten in a compact form: :

\mathbf=\mathbf+\mathbf

Entries of the selection matrix can be either 1, 0 or -1, with 1 corresponding to home teams and -1 to away teams: :

\\ \end

If the matrix

\mathbf^\mathbf

has full rank, the algebraic solution of the system may be found via the Least squares method: :

\mathbf=\left( \mathbf^\mathbf \right)^\mathbf^\mathbf

If not, one can use the Moore–Penrose pseudoinverse to get: :

\mathbf=\mathbf^+ \mathbf

The final rating parameters are

\mathbf= .625,\ 0.75,\ -0.875,\ -1.5 .

In this case, the strongest team has the highest rating. The advantage of this rating method compared to the standard ranking systems is that the numbers are continuously scaled, defining the precise difference between the teams’ strengths.

Time-Independent Poisson Regression

According to this model (Maher ), if

X_

and

Y_

are the goals scored in the match where team i plays against team j, then: :

\begin
 X_ &\sim \text(\lambda ) \\ 
 Y_ &\sim \text(\mu ) \\ 
\end

X_

and

Y_

are independent random variables with means

\lambda

and

\mu

. Thus, the joint probability of the home team scoring x goals and the away team scoring y goals is a product of the two independent probabilities: :

P\left( X_=x,Y_=y \right)=\frac\frac

while the generalized log-linear model for

\lambda

and

\mu

according to Kuonen and Lee is defined as:

\log \left( \lambda  \right)=c^+a_-d_+h

and

\log \left( \mu  \right)=c^+a_-d_

, where

a_,d_,h > 0

refers to attacking and defensive strengths and to home field advantage respectively.

c^

and

c^

are correction factors which represent the means of goals scored during the season by home and away teams. Assuming that C signifies the number of teams participating in a season and N stands for the number of matches played until now, the team strengths can be estimated by minimizing the negative log-likelihood function with respect to

\lambda

and

\mu

: :

\begin
  & L(a_,d_,h;\ i=1,..C)=-\log \prod\limits_^ \\
 & =-\sum\limits_^ \\ 
 & =\sum\limits_^+\sum\limits_^-\left( \sum\limits_^ \right)-\left( \sum\limits_^ \right)+\sum\limits_^+\sum\limits_^ \\ 
\end

Given that

x_

and

y_

are known, the team attacking and defensive strengths

\left( a_,d_ \right)

and home ground advantage

\left( h \right)

that minimize the negative log-likelihood can be estimated by

Expectation Maximization Expectation or Expectations may refer to: Science * Expectation (epistemic) * Expected value, in mathematical probability theory * Expectation value (quantum mechanics) * Expectation–maximization algorithm, in statistics Music * ''Expectati ...

: :

\underset\,L(a_,d_,h,i=1,..C)

Improvements for this model were suggested by Mark Dixon (statistician) and Stuart Coles. They invented a correlation factor for low scores 0-0, 1-0, 0-1 and 1-1, where the independent Poisson model doesn't hold. Dimitris Karlis and Ioannis Ntzoufras built a Time-Independent Skellam distribution model. Unlike the Poisson model that fits the distribution of scores, the Skellam model fits the difference between home and away scores.

Time-Dependent Markov Chain Monte Carlo

On the one hand, statistical models require a large number of observations to make an accurate estimation of its parameters. And when there are not enough observations available during a season (as is usually the situation), working with average statistics makes sense. On the other hand, it is well known that team skills change during the season, making model parameters time-dependent. Mark Dixon (statistician) and Coles tried to solve this trade-off by assigning a larger weight to the latest match results. Rue and Salvesen introduced a novel time-dependent rating method using the Markov Chain model. They suggested modifying the generalized linear model above for

\lambda

and

\mu

: :

\begin
  & \log \left( \lambda  \right)=c^+a_-d_-\gamma \cdot \Delta _ \\ 
 & \log \left( \mu  \right)=c^+a_-d_+\gamma \cdot \Delta _ \\ 
\end

given that

\Delta _=\frac

corresponds to the strength difference between teams i and j. The parameter

\gamma >0

then represents the psychological effects caused by underestimation of the opposing teams’ strength. According to the model, the attacking strength

\left( a \right)

of team A can be described by the standard equations of Brownian motion,

B_\left( t \right)

, for time

t_>t_

: :

a_^=a_^+\left( B_\left( t_/\tau  \right)-B_\left( t_/\tau  \right) \right)\cdot \frac

where

\tau

and

\sigma _^

refer to the loss of memory rate and to the prior attack variance respectively. This model is based on the assumption that: :

,  \; \sim N\left( a_^,\ \frac\sigma _^ \right)

Assuming that three teams A, B and C are playing in the tournament and the matches are played in the following order:

t_

: A-B;

t_

: A-C;

t_

: B-C, the joint probability density can be expressed as: :

\begin
  & P(a_,d_,\gamma ,\,\tau ;\ A,B,C)=P\left( \lambda _,t_ \right)\cdot P\left( \lambda _,t_ \right)\cdot P\left( \lambda _,t_ \right) \\ 
 & \times P\left( X_=x,Y_=y, \lambda _,\mu _,t_ \right)\cdot P\left( X_=x,Y_=y, \lambda _,\mu _,t_ \right) \\ 
 & \times P\left( \lambda _,t_, \lambda _,t_ \right)\cdot P\left( \mu _,t_, \mu _,t_ \right) \\ 
\end

Since analytical estimation of the parameters is difficult in this case, the

Monte Carlo method Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be determi ...

is applied to estimate the parameters of the model.

Usage for other sports

Models used for

association football Association football, more commonly known as football or soccer, is a team sport played between two teams of 11 players who primarily use their feet to propel the ball around a rectangular field called a pitch. The objective of the game is ...

can be used for other sports with the same counting of goals (points), i.e.

ice hockey Ice hockey (or simply hockey) is a team sport played on ice skates, usually on an ice skating rink with lines and markings specific to the sport. It belongs to a family of sports called hockey. In ice hockey, two opposing teams use ice h ...

water polo Water polo is a competitive sport, competitive team sport played in water between two teams of seven players each. The game consists of four quarters in which the teams attempt to score goals by throwing the water polo ball, ball into the oppo ...

field hockey Field hockey is a team sport structured in standard hockey format, in which each team plays with ten outfield players and a goalkeeper. Teams must drive a round hockey ball by hitting it with a hockey stick towards the rival team's shooting ...

, floorball, etc. Marek, Ťoupal and Šedivá (2014) build on research of Maher (1982), Dixon and Coles (1997), and others who used models for

. They introduced four models for

: * Double Poisson distribution model (same as Maher (1982)), * Bivariate Poisson distribution model that uses generalisation of bivariate

that allows negative correlation between random variables (this distribution was introduced in Famoye (2010)). * Diagonal inflated versions of previous two models (inspired by Dixon and Coles (1997)) where probabilities of ties 0:0, 1:1, 2:2, 3:3, 4:4, and 5:5 are modelled with additional parameters. Older information (results) are discounted in the process of estimation in all four models. Models are demonstrated on the highest-level ice hockey league in the Czech Republic –

Czech Extraliga The Czech Extraliga ( cs, Extraliga ledního hokeje, ELH) is the highest-level ice hockey league in the Czech Republic. It was created by the 1993 split of the Czechoslovak First Ice Hockey League following the breakup of Czechoslovakia. The le ...

between seasons 1999/2000 and 2011/2012. Results are successfully used on fictive

betting Gambling (also known as betting or gaming) is the wagering of something of value ("the stakes") on a random event with the intent of winning something else of value, where instances of strategy are discounted. Gambling thus requires three elem ...

against bookmakers.

References

{{DEFAULTSORT:Statistical Association Football Predictions Association football records and statistics Association football rankings