Abstract

Intransitive player dominance, where player A beats B, B beats C, but C beats A, is common in competitive tennis. Yet, there are few known attempts to incorporate it within forecasting methods. We address this problem with a graph neural network approach that explicitly models these intransitive relationships through temporal directed graphs, with players as nodes and their historical match outcomes as directed edges. We find the bookmaker Pinnacle Sports poorly handles matches with high intransitive complexity and posit that our graph-based approach is uniquely positioned to capture relational dynamics in these scenarios. When selectively betting on higher intransitivity matchups with our model (65.7% accuracy, 0.215 Brier Score), we achieve significant positive returns of 3.26% ROI with Kelly staking over 1903 bets, suggesting a market inefficiency in handling intransitive matchups that our approach successfully exploits.

=0 =1

Research highlight 1 - We develop a novel tennis match outcome forecasting model using temporal directed graphs encoding player matchup histories and a spectral graph convolutional network that learns from intransitive player relationships.

Research highlight 2 - Our model predicts match outcomes with 65.7% accuracy and a Brier score of 0.215 across both men’s and women’s tours, performing competitively against the Elo rating system.

Research highlight 3 - We devise a profitable betting strategy by using our predictive model and targeting matchups belonging to intransitive local neighbourhoods. It achieves 3.26% ROI from 1903 bets, potentially indicating a market inefficiency regarding intransitive matchups.

Research highlight 4 - Our model is more robust to highly intransitive matchups than the bookmaker Pinnacle Sports, with our findings suggesting the betting market does not effectively consider intransitive matchups.

Sports forecasting ,Betting ,Market efficiency ,Probability forecasting ,Robustness ,Tennis prediction

1 Introduction↩︎

Over the past two decades, the field of sports analytics has evolved rapidly, with researchers exploring a variety of challenges, from assessing competitive balance in team sports [1], to generating artificial datasets [2]. Considerable effort has also been devoted to developing highly accurate models for forecasting match outcomes [3].

Tennis is a sport well-suited to predictive modelling, with dense tournament schedules generating extensive historical data. The official ranking systems of the Association of Tennis Professionals (ATP) and Women’s Tennis Association (WTA) have been shown to exhibit some predictive power for match outcomes [4], [5], but there are notable limitations: for example, ranking points accumulate over a 52-week period, without decay, which can mask recent changes in player form; while match-specific factors, such as surface type, tournament progression difficulty, and margin of victory in individual matches, are overlooked.

Some well-known methods have been applied to tennis and modified to accommodate these factors, such as a Bradley-Terry model with surface-specific adjustments [6] or Elo rating systems that incorporate margin of victory [7], [8].

Bookmakers are considered the most accurate predictors of match outcomes [9], with sophisticated models that adjust odds based on betting patterns and proprietary methods. Yet, despite the multi-billion dollar betting industry, one limitation that persists is the poor consideration of intransitivity [10]. Intransitivity is analogous to rock-paper-scissors. In tennis, it occurs where player A tends to defeat B, B defeats C, yet C defeats A, violating the assumption of transitive dominance. A famous example is Nadal beats Federer, Davydenko beats Nadal, yet Federer beats Davydenko. Such patterns are known to be common in tennis, with two recent papers finding evidence in both the WTA and ATP tours [11], [12], and recent work has shown these patterns create systematic biases in rating systems like Elo [13].

Representing historical tennis matches as a graph (where each player-matchup can be represented as an edge joining two player-nodes) can preserve intransitive dynamics. Several existing graph-based approaches have reported strong performance, such as the eigenvector centrality approach by [14]. However, global ranking methods such as eigenvector centrality cannot capture intransitive matchups, where player-specific dynamics matter more than overall strength. To better handle these scenarios, a graph model must effectively learn from the local neighbourhoods of players.

We tackle this fundamental limitation of sports forecasting models with graph neural networks (GNNs), =1a predictive approach that has not previously been applied to pre-match tennis prediction.¹

We construct gender-specific surface graphs that summarise match history and preserve player relationships, and apply a spectral graph convolutional network for directed graphs named MagNet [15]. After establishing that our model performs competitively against established approaches such as the Weighted Elo approach by [8] and the odds-implied probabilities by the bookmaker Pinnacle Sports, we analyse our model’s performance in predicting highly intransitive matchups. We adapt an existing graph-based measure of intransitivity by [13], finding that women’s tennis exhibits 11.5% more intransitivity than men’s. Using the adapted measure, we find that our model is more robust than the bookmaker when forecasting intransitive matchups. We then develop a profitable betting strategy that targets matches expected to be mispriced.

In Section 2, we describe the related background of tennis forecasting, graph-based predictive methods, and intransitivity. In Section 3, we describe the necessary data for our research. Section 4 introduces our graph-based method, while in Section 5, we report the models’ out-of-sample accuracy. In Section 6, we test for model robustness and market inefficiency. Finally, in Section 7, we summarise our findings, acknowledge key limitations, and outline future research directions.

2 Background↩︎

Tennis presents an ideal domain for predictive modelling due to several of its unique characteristics. Its scoring structure is inherently hierarchical, advancing from points to games to sets, which lends itself to mathematical modelling. Furthermore, the individual nature of singles tennis eliminates the complexity of team dynamics found in other sports, where player transfers and injuries can confound prediction models. The worldwide tournament calendar also ensures that the same players face one another repeatedly on multiple surfaces and across various event tiers, thereby generating a dense record of head-to-head encounters that can be analysed for statistical inference.

Various predictive modelling approaches have been developed for tennis. Point-based models determine match win probabilities by first estimating on-serve and return strengths to obtain point win probabilities and then applying hierarchical equations that reflect the structure of tennis scoring. In contrast, pairwise comparison models use head-to-head match outcomes to dynamically update relative player strength scalars. Regression models using player statistics as predictors can also be used. [9] compared the performance of various published methods across these categories. The Elo rating system [16] is a popular pairwise comparison model and a specialised implementation of the Elo rating system by [17] achieved 70% accuracy in the survey, second only to the Bookmaker Consensus Model (BCM) by [18], which achieved 72%.

While the original Elo rating model by [16] was intended for chess, its application to tennis is well-established. [7] developed four distinct models that extended Elo by incorporating various margin of victory measures, including: set scores, games won, break points won, total points won, and serve percentage won. [8] proposed a similar method, named Weighted Elo, that adjusts rating updates based on the proportion of games won in each match. Both methods demonstrated improvements over the standard Elo system, confirming that incorporating greater match information leads to more accurate predictions.

One group of models the survey by [9] did not consider was graph theoretic methods, such as a PageRank approach by [19]. A research direction of particular relevance to this paper is the application of graph theory to tennis match forecasting. Rather than directly forecasting match outcomes, two early studies that applied graph theory to tennis offered alternative rating systems to the official ATP ratings. [20] applied PageRank, a node importance algorithm originally developed by [21], to identify the most successful men’s tennis players, using a dataset of matches played from January 1968 to October 2010. They observed that their ranking method outperformed ATP and ITF rankings by anticipating the best player of the subsequent year by these official rankings. However, they found active players were penalised compared to retired ones due to their incomplete career data.

[19] extended Radicchi’s PageRank approach by evaluating its performance as a predictor of match outcomes. The authors found that PageRank-based rankings slightly outperformed official rankings in classification accuracy, achieving 67.0% versus 66.4% for ATP matches when selecting the higher-ranked player as the winner. Though they noted limitations, including tournament seeding effects that disadvantage lower-ranked players, resulting in fewer matches against varied opponents and insufficient data for accurate ranking of weaker players. Additionally, their study did not consider different court surfaces.

[22] incorporated court surface considerations with a coloured directed network. With the use of 4-node subgraphs, the authors developed an Orbit-score ranking system that captured indirect dominance relationships between players from different time periods. The model identified surface specialists, all-round players, and players with an Achilles-heel on a specific surface. While their model enabled more detailed player differentiation, it was not extended to directly predict individual match outcomes.

More recently, [23] used three graph-based centrality measures (out-in-degree-difference, Hubs, and PageRank) to estimate a surface-specific score for players. The score was used as a feature in several machine learning models to predict individual matches, with an SVM+ model achieving the highest classification accuracy of 66% on a dataset of 21,083 matches between 2012 and 2020. The authors’ approach demonstrated the effectiveness of graph-based features, but prioritised machine learning sophistication over advancing the graph-theoretical methods needed to fully capture the complex directed relationships in tennis competition.

[14] developed a new model based on the eigenvector centrality, deriving global ratings that were subsequently used as covariates in a simple logit model for match prediction. The eigenvector centrality approach attained a Brier score of 0.194 on ATP matches between 2016 and 2020, outperforming several competing methods, including standard Elo (0.206) and the margin of victory Elo extension (0.204) by [7]. This impressive performance demonstrated the growing effectiveness of graph-theoretic methods in tennis prediction while suggesting potential for even more sophisticated approaches.

In other domains, recent advances in deep learning applied to graphs, particularly Graph Neural Networks (GNNs), have shown considerable promise. There have been successful applications of GNNs in other sports, such as American football and Counter-Strike: Global Offensive [24], association football [25], and basketball [26].

Unlike traditional GNN architectures designed for undirected or symmetric graphs, recent work on spectral methods for directed graphs enables learning from cyclic structures through complex-valued representations. [15] used the magnetic Laplacian (a complex Hermitian matrix) to encode directed cycle information in its eigenvalue spectrum, with a tunable parameter \(q\) that controls sensitivity to cyclical patterns. In their experiments, they found that values of \(q\) approaching 0.25 were optimal when directional cycles were critical to the prediction task.

Such spectral methods can address intransitive relationships between nodes, where \(A \to B \to C \to A\) forms a directed cycle that cannot be resolved into a strict hierarchical ordering. The capability to learn from directed cycles, rather than through symmetrisation, makes spectral methods for directed graphs specifically suited to tennis, where intransitive matchup patterns are prevalent [11], [12]. Representing tennis players as nodes and their matchup histories as directed edges, GNN architectures like MagNet [15] can convert intransitive matchup patterns into reliable forecasts.

An additional motivation is that betting markets have been shown to exhibit limited sensitivity to intransitive matchup dynamics. In association football, [10] examined persistent intransitive triads in the English Premier League and found that bookmakers’ odds remain cardinally transitive (interaction terms near zero) rather than reflecting matchup-specific evidence. Consequently, some simple triad betting rules yielded statistically significant profits in several cases (e.g., the Tottenham–Man City–Newcastle triad: £3.28 per season; £69 cumulative over 21 seasons), suggesting value in models that capture intransitive patterns.

3 Data↩︎

We collected historical data on professional tennis matches organised by both the Association of Tennis Professionals (ATP) and the Women’s Tennis Association (WTA) from tennis-data.co.uk, spanning 1 January 2014 to 8 June 2025.

The dataset contains information such as match date, competitors, games won by each competitor for each set played, tournament, surface, round of the tournament, and betting odds from several bookmakers; from which we select Pinnacle Sports due to the consistent availability of their odds across the dataset timespan. Additionally, Pinnacle Sports is recognised as a “sharp” bookmaker—one that operates with low margins and adjusts odds aggressively based on large betting volumes, which leads to accurate outcome forecasting [27], [28]. We also gathered the following player characteristics from tennisexplorer.com: height, weight, date of birth, and handedness. For missing characteristic values, the median of the population is imputed.

We filter for matches played in the highest tiers of professional tennis for both men and women: Grand Slams, the season-ending tour finals, and all 1000 and 500 series events. We treat men’s and women’s tennis as completely separate prediction tasks, conducting distinct walk-forward validation training approaches for each, with the exception that hyperparameter tuning is run jointly across both datasets to share an optimal configuration.

After removing incomplete matches and entries with missing or incorrect values, the final dataset comprises 16,663 matches played by 598 men and 16,447 matches played by 567 women. We use a validation set from 29 August 2019 to 20 November 2022 for our parameter optimisation, followed by an out-of-sample test set from 01 January 2023 to 08 June 2025 for model evaluation and the betting simulation.

4 Graph Model↩︎

In their overview of temporal graph networks, [29] describe two primary approaches to temporal graph representation: snapshot-based (using a sequence of static graphs capturing network states at different time points) and event-based (representing edges as a continuous stream of timestamped interactions). Here, we adopt a snapshot-based approach by interpreting each tournament round (e.g., Wimbledon 2010 Quarter Finals) as a discrete timestamped snapshot.

We now describe our graph construction, the use of MagNet for match prediction, and our walk-forward validation approach. Table 1 summarises the key mathematical notation used in this section.

Table 1: Mathematical Notation.
Symbol	Description
\(s, t\)	Court surface (clay, grass, hard)
\(i, n\)	Snapshot indices: arbitrary (\(i\)), target (\(n\))
\(k\)	Historical match index
\(\tau_n, \tau_k\)	Timestamp: snapshot \(n\) (earliest scheduled match); match \(k\)
\(u, v\)	Player indices
\(G^s_i\)	Surface-specific graph at snapshot \(i\)
\(V, E_i\)	Node set (players); edge set at snapshot \(i\)
\(W^s_i, w_{uv}\)	Edge weights for surface \(s\); weight from player \(u\) to \(v\)
\(\mathbf{X}^V_i\)	Node feature matrix at snapshot \(i\)
\(D^s_n(u,v)\)	Dominance score of \(u\) against \(v\) on surface \(s\) at snapshot \(n\)
\(g_k(u,v)\)	Proportion of games won by \(u\) versus \(v\) in match \(k\)
\(\lambda\)	Time decay rate
\(\alpha_{s,t_k}\)	Surface similarity (skill transfer from surface \(t_k\) to \(s\))
\(\beta_k\)	Tournament prestige parameter; value determined by tier of match \(k\)
\(\phi_k\)	Time decay coefficient for match \(k\)
\(\hat{p}_{uv}\)	Predicted set-win probability for players \(u\) against \(v\)
\(\hat{P}_3, \hat{P}_5\)	Predicted match-win probabilities (best-of-3, best-of-5)

4.1 Graph representation↩︎

In professional tennis, players compete on three distinct court surfaces: clay, grass, and hard. Each surface requires different playing styles and favours different skill sets, with researchers such as [30] finding that “surface on which a game is played on contributes significantly towards a player’s performance”. Players compete separately on the men’s tour (ATP) and women’s tour (WTA), with no cross-gender competition. We construct separate graph representations for each gender and each surface, accounting for this well-documented variation in player performance across surfaces, although we incorporate cross-surface information through our dominance score calculation, which aggregates match results from all surfaces with appropriate weighting.

A static directed graph can be formally defined as \(G=(V,E,\mathbf{X}^V, \mathbf{X}^E)\), where \(V\) represents a set of nodes (or vertices), \(E\) represents the set of edges between nodes, \(\mathbf{X}^V\) represents the associated node features, and \(\mathbf{X}^E\) represents the associated edge features. The edges are directed, meaning \((u,v)\) does not necessitate \((v,u)\) and \(E\) is asymmetric. Importantly, each edge \((u,v)\) can possess a weight \(w_{u,v} \in \mathbf{X}^E\) that quantifies the strength of the relationship.

For each tournament round snapshot, we construct three surface-specific graphs per gender, summarising all the preceding matches. By encoding surface in the graph, we aim to model surface-dependent performance dynamics more precisely. Hence, for each snapshot \(i\) and court surface \(s\) (clay, grass, or hard), we construct the graph: \[\label{eq:graph} G^s_{i} = (V, E_{i}, \mathbf{X}^{V}_{i}, W^s_i),\tag{1}\] where \(W^s_i\) denotes the edge-weight vector. Since this is the only edge feature we consider, we use the simplified notation \(W^s_i\) rather than the more general edge feature matrix \(\mathbf{X}^{E,s}_{i}\).

Table 2: Player Node Features and Handling.
Feature	Type	Data Handling
Height (cm)	Static	Imputation: median (gender-specific)
Weight (kg)	Static	Imputation: median (gender-specific)
Date of birth	Static	Imputation: median (gender-specific)
Handedness	Static	Encoding: one-hot; Imputation: mode (right-handed)
In-degree (per surface)	Dynamic	Normalisation: \(\ell_2\) per snapshot
Out-degree (per surface)	Dynamic	Normalisation: \(\ell_2\) per snapshot

We consider the set of all players of a given gender who competed during the period covered by the dataset as the fixed node set \(V\). Described in Table 2 are the node features we use. We use both static player attributes and dynamic graph-based metrics as node features \(\mathbf{X}^V_i \in \mathbb{R}^{|V|\times d}\). The static player-node feature set comprises the height, weight, date of birth, and handedness of each player. We assume player weights to be static due to data limitations. Additionally, where we could not obtain these values for some players, we imputed the gender-specific medians. For the dynamic node features, we use the surface-graphs node in-degrees and out-degrees, summarising the number of incoming and outgoing edges for each player-node, respectively. Each feature vector \(\mathbf{v}\) is \(\ell_2\)-normalised, scaling the vector by its Euclidean norm \(\|\mathbf{v}\|_2 = \sqrt{\sum_j v_j^2}\), resulting in unit-length vectors that preserve directional information while standardising magnitude. Maintaining the fixed player set \(V\) across surface-specific graphs allows the model to process both the surface-transferable static node features and the surface-specific dynamic features.

We use a common edge index set \(E_i \subseteq V \times V\) across all surfaces, summarising historical head-to-head matchups between players, with edge weights \(W^s_i\) that are surface-specific. Provided there is at least one match played between player \(u\) and player \(v\) prior to snapshot \(i\), we add a weighted directed edge \((u,v)\) to \(E_i\). Each weight \(w_{uv} \in W^s_i\) represents the head-to-head dominance of player \(u\) over player \(v\), where direction indicates the dominant player and magnitude quantifies the degree of dominance. The edge weight \(w_{uv}\) is calculated based on the historical proportion of games won by player \(u\) against player \(v\). This base value is then adjusted using factors for time decay, surface weighting, and tournament weighting. We use the proportion of games won in light of findings by [7], who found that “break points won” and “games won” were the most robust performance measures across several model types. Given the data available to us, we use the proportion of games won \(g_k(u,v) \in [0,1]\) by player \(u\) against \(v\) in historical match \(k\) as the primary variable in our dominance score calculation. This measure can be influenced by an unequal number of service games between players, but it is sufficiently informative for dominance, given data constraints.

An early use of exponential time decay to down-weight older sports match results originates from [31] for association football, where they down-weighted each historical match in the likelihood by an exponential factor. It was subsequently adapted to tennis by [6], who additionally incorporated surface weights to improve forecasting accuracy. We further extend this principle by introducing an explicit tournament quality factor, so that results from higher-prestige events have greater influence on the estimated dominance between players. Hence, to each historical result \(g_k(u,v)\) we apply the time decay coefficient: \[\phi_k(u,v) = \exp({-\lambda\, (\tau_n - \tau_k))},\] where \(\tau_n\) denotes the timestamp of the earliest scheduled match in the tournament round being predicted, \(\tau_k\) denotes the timestamp of historical match \(k\), and the decay rate \(\lambda > 0\) is a learnable parameter governing the rate at which older matches lose influence. This approach ensures that the influence of past interactions diminishes over time, with more recent interactions having a greater impact on the current dominance scores.

For each target surface \(s\), we calculate surface-specific dominance scores that incorporate matches from all surfaces with appropriate weighting. To refine the influence of each historical match, we scale each match by two further coefficients: a surface similarity parameter \(\alpha_{s,t_k}\) and a tournament prestige parameter \(\beta_{k}\). The surface similarity factor \(\alpha_{s,t_k}\) down-weights matches played on surface \(t_k\) when constructing the graph for target surface \(s\), reflecting the reduced transferability of performance across surfaces, with \(\alpha_{s,s} = 1\) for same-surface matches. The tournament prestige factor \(\beta_{k}\) assigns weights based on match \(k\)’s tournament tier relative to Grand Slams, acknowledging that results from higher-prestige events may be more indicative of true player dominance.

These parameter values are optimised using a Tree-structured Parzen Estimator (TPE), described in Section 4.4. The surface-specific dominance score is updated for each new tournament snapshot \(n\) between players \(u\) and \(v\) on surface \(s\) as follows: \[D_n^s(u,v) = \frac{\sum_{k} \alpha_{s,t_k} \beta_k \phi_k g_k(u,v)}{\sum_{k} \alpha_{s,t_k} \beta_k \phi_k},\label{eq:dominance95decay}\tag{2}\] where the sum ranges over all matches \(k\) between players \(u\) and \(v\) that occurred before snapshot \(n\), and \(g_k(u,v)\in[0,1]\) is the proportion of games won by player \(u\) against \(v\) in historical match \(k\), and \(t_k\) is the surface of historical match \(k\). \(\alpha_{s,t_k}\) captures surface similarity, \(\beta_k\) captures tournament prestige, and \(\phi_k\) captures match recency. This calculation is performed independently for each of the three surfaces, yielding three separate graph representations.

The dominance score calculation produces a value \(D_n^s(u,v)\) and its complement \(D_n^s(v,u) = 1 - D_n^s(u,v)\). From these, we create a single directed edge pointing from the player with the higher dominance score to the weaker player. Specifically, if \(D_n^s(u,v) > 0.5\), we assign a directed edge from player \(u\) to player \(v\) with weight \(D_n^s(u,v)\). Conversely, if \(D_n^s(u,v) < 0.5\), we assign a directed edge from player \(v\) to player \(u\) with weight \(1 - D_n^s(u,v)\). If \(D_n^s(u,v) = 0.5\) exactly, no edge is created. This ensures that each player pair contributes at most one directed edge to a graph, with the edge always pointing from the dominant player to the dominated player.

Figure 1: Overview of the graph construction process.

This directed edge construction with asymmetric weights naturally preserves the intransitive relationships that exist between players. Figure 1 provides an overview of our temporal graph construction process.

4.2 MagNet↩︎

To generate match outcome probability estimates, we use the spectral graph convolutional network (GCN) MagNet by [15].² We frame the problem as directional edge prediction, where for a scheduled match between players \(u\) and \(v\), MagNet estimates the edge direction (translating to set-win probability). We select MagNet specifically for its use of the magnetic Laplacian—a complex Hermitian matrix that enables learning from intransitive relationships rather than imposing transitive hierarchies. The magnetic Laplacian encodes undirected geometric structure (existence of head-to-head matches) in the magnitude of its entries and directional information (who dominated) in their phase.

MagNet uses a parameter \(q \in [0, 0.25]\) that controls the extent to which directional information is incorporated. When \(q=0\), the model disregards which player has historically dominated the head-to-head matchups, whereas at \(q=0.25\), this dominance information is maximally integrated. We use a fixed \(q\) of 0.25 to maximise the strength of directional information.

The model is trained to predict set outcomes by minimising cross-entropy loss with label smoothing, using the Adam optimiser. Key hyperparameters include a learning rate of 0.003, weight decay of \(10^{-4}\), and dropout regularisation at 0.3. The label smoothing parameter (0.19) is optimised (described in Section 4.4). Complete mathematical details are provided in 8.

In our dataset, there are both best-of-5 matches (played at men’s Grand Slams) and best-of-3 matches (all women’s matches and non-Grand Slam men’s matches). In a best-of-3 match, there is a greater chance for the lower-ranked player to win as there are fewer opportunities for the stronger player to assert their dominance. Therefore, when producing probability estimates for the out-of-sample test set, we calculate match outcome probabilities for a best-of-3 sets match \(\hat{P}_3\) and a best-of-5 sets match \(\hat{P}_5\) under the assumption of independent and identically distributed sets:

\[\hat{P}_{3}(u,v) = \hat{p}_{uv}^2 + 2\hat{p}_{uv}^2(1-\hat{p}_{uv}), \quad \hat{P}_{5}(u,v) = \hat{p}_{uv}^3 + 3\hat{p}_{uv}^3(1-\hat{p}_{uv}) + 6\hat{p}_{uv}^3(1-\hat{p}_{uv})^2\]

where \(\hat{p}_{uv}\) is the set-win probability for player \(u\) against player \(v\) output by MagNet.

4.3 Walk-forward validation↩︎

Figure 2: The walk-forward validation process. After each pass, predictions are generated and the model is retrained; the historical graph grows as matches are added. For illustration, the schematic is not to scale: in practice, the historical graph and sliding training window span several hundred snapshots.

With the temporal directed graph representation of tennis matches described in Section 4.1, we adopt a walk-forward validation approach, shown in Figure 2, to optimise parameters on a validation set and evaluate the model on an out-of-sample test set.

At each snapshot, three surface graphs are constructed from the earliest 85% of historical matches. The model is then trained to predict outcomes for the most recent 15% of matches, which serve as labelled training edges within these graphs.

The MagNet model parameters are optimised by training on this 15% edge set. The model is initially trained for 150 epochs. To adapt to evolving player dynamics, it is then retrained for an additional 30 epochs every 38 snapshots, which we approximate to a quarter of a year. Each training run completes in under 10 seconds on consumer-grade hardware (RTX 2070 Super, Ryzen 3600).

Following each training run, the model predicts outcomes for the next chronological snapshot. Once the true outcomes for a given snapshot are observed, its matches are integrated into the historical graph. The fixed-size training window then advances, redefining the 15% training set to include the most recent matches for the subsequent cycle.

4.4 Parameter optimisation↩︎

To identify the optimal parameters for graph construction, model architecture, and training, we conduct 300 trials using the walk-forward validation method on data from 01 January 2014 to 20 November 2022, making predictions from 29 August 2019 to 20 November 2022.

@lcc@ Parameter & Search Space & Optimal Value

\(K\), Chebyshev filter order & \(\{1,2,3\}\) & 2
Hidden units & \(\{32,64,128\}\) & 64
Number of layers & \(\{1,2,3\}\) & 2
Use activation function & \(\{\text{True},\text{False}\}\) & False
Label smoothing & \(0.00\)–\(0.2\) & 0.19

\(\lambda\) & \(0.0\)–\(0.5\) & 0.38

\(\alpha_{\text{h,g}}\) & \(0.0\)–\(0.5\) & 0.37
\(\alpha_{\text{h,c}}\) & \(0.0\)–\(0.5\) & 0.01
\(\alpha_{\text{c,g}}\) & \(0.0\)–\(0.5\) & 0.09
\(\alpha_{\text{c,h}}\) & \(0.0\)–\(0.5\) & 0.07
\(\alpha_{\text{g,c}}\) & \(0.0\)–\(0.5\) & 0.05
\(\alpha_{\text{g,h}}\) & \(0.0\)–\(0.5\) & 0.45

\(\beta^{1000}\) & \(0.8\)–\(1.0\) & 0.85
\(\beta^{\text{Finals}}\) & \(0.9\)–\(1.1\) & 0.94
\(\beta^{500}\) & \(0.2\)–\(0.8\) & 0.69

Notes. All parameters were optimised using uniform distributions over the specified search spaces. Range notation \(a\)–\(b\) indicates continuous uniform distributions over closed intervals, while set notation \(\{\dots\}\) indicates discrete uniform distributions. Subscripts for surface transferability (\(\alpha\)) represent the direction of transfer between surface types: hard (h), clay (c), and grass (g). For example, \(\alpha_{\text{h,c}}\) is the parameter for skill transfer from clay to hard. Superscripts for tournament prestige (\(\beta\)) denote the tournament tier: 1000 for Masters and WTA 1000 events, 500 for ATP 500 and WTA 500 tournaments, and Finals for the ATP/WTA Tour Finals.

Table [tab:paramtuning] summarises the search space used for each parameter during optimisation, along with the optimal values found by applying a Tree-structured Parzen Estimator (TPE) algorithm by [33].³ We adopt a multi-objective framework, treating the Brier scores for men’s and women’s tours as separate objectives. We identify Pareto-optimal solutions that have a favourable trade-off between men’s and women’s performance, selecting a unified parameter set that balances predictive accuracy across both genders.

Given the model’s complexity relative to traditional sports forecasting approaches, we select conservative search bounds for the graph construction parameters, described in Section 4.1. Tournament quality parameters are bounded to reflect the hierarchy of tennis events: Grand Slams normalised to unity, 1000 point events constrained to \([0.8, 1.0]\), 500 point events are constrained to \([0.2, 0.8]\), and Tour Finals events are constrained to the range \([0.9, 1.1]\) to account for their high competitive standard given the inclusion of only the season’s top eight ranked players.⁴

For the model architecture, we optimise several hyperparameters. The most important of these are the order of the Chebyshev polynomial approximation in the spectral graph convolution and the number of complex graph convolutional layers, collectively determining the size of the local neighbourhood from which player information is aggregated. We also optimise the dimensionality of the hidden channels, which controls the model’s capacity to learn complex features. Furthermore, we evaluate the inclusion of a complex-valued ReLU activation function, which introduces non-linearity and allows the model to capture more intricate patterns. Finally, to prevent the model from making overconfident predictions, we optimise the degree of label smoothing, a regularisation technique that can improve model calibration and generalisation.

Figure 3: Brier scores from 300 parameter optimisation trials, with Pareto-optimal trials highlighted. Colour indicates receptive field size (K \times L hops). — Figure 3: Brier scores from 300 parameter optimisation trials, with Pareto-optimal trials highlighted. Colour indicates receptive field size (\(K \times L\) hops).

From the Pareto-optimal solutions shown in Figure 3, we select the trial with \(K=2\) and \(L=2\) layers and therefore an effective receptive field spanning \(4\) hops. This neighbourhood size captures direct opponents, their immediate neighbours, and extends to more distant connections. This provides sufficient context for detecting intransitive cycles while avoiding overfitting to global network properties that characterise existing graph-theoretic measures like eigenvector centrality.

5 General Predictive Performance↩︎

We evaluate the proposed graph-based model against the Weighted Elo approach of [8], the Elo approach of [17], a Bradley-Terry model [35], and bookmaker odds from Pinnacle Sports. We initialise the Elo rating systems from 2014, such that ratings are sufficiently stabilised for the out-of-sample test period.

We implement a basic Bradley-Terry model and estimate player strength parameters using all matches in the preceding two years. These parameters are re-estimated for each match using a rolling window approach. Specifically, we use the Iterative Luce Spectral Ranking (ILSR) algorithm by [36] with L2 regularisation of strength \(\lambda = 0.01\).

We derive margin-adjusted implied probabilities from Pinnacle Sports odds using the bookmaker behaviour model of [37]. These odds are also used for our simulated betting strategy evaluated in Section 6.3.

The out-of-sample test period spans from 1 January 2023 to 8 June 2025, and we use two standard metrics to measure model accuracy. Firstly, classification accuracy measures the proportion of matches for which the model correctly predicted the winner. Let \(\hat{o}_m = \mathbb{I}(\hat{p}_m > 0.5)\) be the predicted outcome, where \(\mathbb{I}(\cdot)\) is the indicator function (evaluating to 1 if the condition is true and 0 otherwise). This is based on the predicted probability \(\hat{p}_m\) for the player corresponding to \(o_m=1\), where \(o_m \in \{0, 1\}\) is the actual outcome for match \(m\). Thus, classification accuracy is calculated as: \[\text{Accuracy} = \frac{1}{N} \sum_{m=1}^N \mathbb{I}(\hat{o}_m = o_m)\] where \(N\) is the total number of matches, and higher values indicate better performance. Secondly, the Brier Score [38] provides a measure of the mean squared error between the predicted probabilities and the actual outcomes, reflecting the accuracy and calibration of the probability estimates themselves. It is calculated as: \[\text{Brier Score} = \frac{1}{N} \sum_{m=1}^N (\hat{p}_m - o_m)^2\] where \(\hat{p}_m\) is the predicted probability for match \(m\), and \(o_m \in \{0, 1\}\) is the actual outcome. A lower Brier score indicates more accurate probability predictions.

Table 3: Model Performance vs. Baselines: Men’s and Women’s Results by Surface.
			Model		Elo		WElo		BT		PS
4-5 (lr)6-7 (lr)8-9 (lr)10-11 (lr)12-13 Gender	Surface	Count	Acc	Brier	Acc	Brier	Acc	Brier	Acc	Brier	Acc	Brier
Men	Clay	1375	0.651	0.218	0.636	0.223	0.644	0.219	0.633	0.217	0.703	0.193
Men	Grass	374	0.674	0.195	0.693	0.201	0.709	0.197	0.655	0.207	0.722	0.176
Men	Hard	2401	0.658	0.213	0.663	0.214	0.669	0.212	0.653	0.217	0.688	0.197
Women	Clay	1218	0.662	0.217	0.671	0.212	0.672	0.210	0.656	0.215	0.702	0.193
Women	Grass	356	0.666	0.208	0.691	0.203	0.697	0.200	0.697	0.211	0.691	0.193
Women	Hard	2651	0.652	0.217	0.650	0.216	0.657	0.213	0.641	0.220	0.674	0.201
Both	All	8375	0.657	0.215	0.658	0.215	0.664	0.212	0.648	0.217	0.690	0.196

Notes. Underlined indicates best performance among the evaluated models (excluding PS benchmark) for each surface/gender combination. Count refers to the number of matches in the out-of-sample test set. All metrics rounded to three decimal places. Elo represents the standard Elo rating system by [17]. WElo represents the Elo extension by [8]. BT is the Bradley-Terry model described in Section 2. PS represents Pinnacle Sports odds with [37] margin adjustment.

Overall prediction results and performance by surface are summarised in Table 3, along with performance metrics categorised by surface type. Performance of the model shows some variation across surfaces, with its highest classification accuracy being 67.4% for men’s matches on grass courts. Across all surfaces, the model maintains a competitive performance. The Weighted Elo approach by [8] achieves the best overall performance among competing methods with 66.4% accuracy and 0.212 Brier score, followed closely by the standard Elo approach with 65.8% accuracy and 0.215 Brier score, and our model with 65.7% accuracy and 0.215 Brier score. The Bradley-Terry model performs worst among the evaluated methods with 64.8% accuracy and 0.217 Brier score.

The bookmaker odds’ implied probability, denoted as “PS”, remains notably superior across all surfaces with 69.0% accuracy and 0.196 Brier score, owing to its incorporation of expert knowledge, real-time market adjustments based on betting activity, and state-of-the-art models.

5.1 Model calibration↩︎

Figure 4: Calibration curves for the men’s and women’s predictions, showing predicted probabilities versus observed frequencies. Perfect calibration would align with the diagonal \(y=x\) line..

To assess the reliability of our model’s predictions and motivate future applications of graph neural networks to tennis forecasting, we require a reasonable degree of calibration, defined as how well aligned a model’s predicted probabilities are with the observed frequencies of outcomes. [39] evaluate calibration for their association football forecasting model using calibration curves (also known as reliability plots), which provide a visual assessment of calibration. We adopt their methodology with the minor adaptation that tennis matches have binary outcomes rather than football’s three-way outcomes. Specifically, we plot the empirical frequency \(P(y = 1 \mid q)\) against model-predicted win probabilities \(q\). We follow [39] and use the recursive binning of [40], adding two additional bins for mid-range probabilities (i.e., \([0.375, 0.5)\) and \([0.5, 0.625)\)) to provide finer detail in the probability range common in competitive tennis matches. Perfect calibration corresponds to points lying on the diagonal \(y = x\), indicating precise alignment between predicted and observed frequencies. The resulting calibration curves are shown in Figure 4. Visually, the model demonstrates strong calibration, with predicted probabilities closely matching observed frequencies.

6 Intransitivity and Market Efficiency↩︎

A prior application of MagNet =0[41]used an unoptimised configuration, yet achieved significant betting profits. However, this earlier implementation had substantial methodological limitations—including poorly calibrated time decay parameters, lack of cross-surface information flow, and reduced tournament coverage (detailed comparisons in 9). While suggestive of market inefficiency, we did not investigate the underlying causes.

We investigate whether MagNet’s advantages stem specifically from its ability to handle intransitive relationships — where Player A consistently beats B, B beats C, but C beats A. Traditional rating systems cannot capture such structures because they rely on scalar player strength representations, assigning each athlete a single power rating (or perhaps separate serve and return ratings, as in the point-based model of [42]).

Figure 5: An intransitive triangle among Federer, Nadal, and Davydenko. The edge weights are evaluated at the end of the 2009 ATP season for the hard-court surface graph, with Equation 2 and optimised graph construction parameters detailed in Table [tab:paramtuning]. Each directed edge indicates a consistent head-to-head advantage.

This limitation is particularly relevant in tennis, where stylistic matchups can create intransitive relationships that our graph-based approach is designed to utilise. A well-known example of an intransitive relationship in tennis is shown in Figure 5 and occurred during Roger Federer’s peak in the mid-to-late 2000s, involving Federer, Rafael Nadal, and Nikolay Davydenko. Although Federer was the dominant world number one, Nadal consistently defeated him by exploiting his one-handed backhand with heavy, high-topspin forehands. In turn, Nadal was often troubled by Davydenko (when not playing on clay courts), whose flat, early baseline hitting neutralised Nadal’s spin and pace, resulting in Davydenko’s repeated success on hard courts. Completing the intransitive loop, Federer, despite his struggles against Nadal, dominated Davydenko, using variety and slice to disrupt Davydenko’s rhythm.

6.1 Quantifying intransitivity↩︎

To explore our model’s ability to handle intransitive player dominance, we adapt the intransitivity measure of [13], which is based on Hodge decomposition of an advantage matrix. We calculate this measure for the direct common opponent neighbourhood of each match and scale it by the amount of evidence available for that particular player matchup.

Namely, for each match between players \(u\) and \(v\), we construct a local neighbourhood subgraph to measure intransitivity. We first identify all common opponents \(c\)—that is, players who have competed against both \(u\) and \(v\) prior to snapshot \(i\):

\[\mathcal{C}_{uv} = \{c \in V : (u,c) \in E^{s}_i \text{ and } (v,c) \in E^{s}_i\}.\]

We extract the corresponding \((|\mathcal{C}_{uv}| + 2) \times (|\mathcal{C}_{uv}| + 2)\) subgraph \(G_{uv}\) containing all players \(\{u, v\} \cup \mathcal{C}_{uv}\) and construct the advantage matrix \(A_{uv}\) by applying a logit transformation:

\[A_{uv}[i,j] = \log\left(\frac{w_{ij}}{1 - w_{ij}}\right)\] where \(w_{ij}\) represents the dominance score of player \(i\) over player \(j\) within the neighbourhood. This transformation maps bounded dominance scores \([0,1]\) to unbounded advantages \((-\infty, +\infty)\), ensuring antisymmetry for proper application of Hodge decomposition.

The Hodge decomposition separates \(A_{uv}\) into a transitive component \(\text{grad} \circ \text{div}(A_{uv})\) and a cyclic component \(A_{uv} - \text{grad} \circ \text{div}(A_{uv})\). The original [13] intransitivity measure is:

\[I(A_{uv}) = \frac{1 + \|A_{uv} - \text{grad} \circ \text{div}(A_{uv})\|_F}{1 + \|\text{grad} \circ \text{div}(A_{uv})\|_F}. \label{eq:hamilton95original}\tag{3}\]

A limitation of applying Equation 3 to our problem is that it quantifies the intransitivity of an entire common opponent subgraph uniformly across all edges, whereas we require a matchup-specific measure. To address this, we scale the common opponent intransitivity measure by the accumulated weight for matchup \((u,v)\), prioritising relationships supported by stronger evidence. Specifically, we define the evidence-weighted intransitivity measure as: \[I^*(A_{uv}) = I(A_{uv}) \cdot \sqrt{\sum_{k} \alpha_{s,t_k} \beta_k \phi_k}, \label{eq:weighted95intransitivity}\tag{4}\] where the sum ranges over all matches \(k\) between players \(u\) and \(v\) that occurred before snapshot \(n\), directly matching the denominator of Equation 2 . The accumulated weight captures effective evidence from match recency, tournament prestige, and surface correlations, with square root scaling moderating the influence of highly-weighted matchups.

Figure 6: Distribution of evidence-weighted intransitivity I^*(A_{uv}) by surface and gender. Mean values: hard (men 3.10, women 3.52), clay (men 1.86, women 1.67), grass (men 2.02, women 2.28). Sample sizes: hard (men n=1,034, women n=1,002), clay (men n=368, women n=303), grass (men n=134, women n=124). — Figure 6: Distribution of evidence-weighted intransitivity \(I^*(A_{uv})\) by surface and gender. Mean values: hard (men 3.10, women 3.52), clay (men 1.86, women 1.67), grass (men 2.02, women 2.28). Sample sizes: hard (men n=1,034, women n=1,002), clay (men n=368, women n=303), grass (men n=134, women n=124).

Figure 6 reveals distinct surface-specific intransitivity patterns: hard and clay courts exhibit bimodal distributions indicative of heterogeneous matchup dynamics—both stable player hierarchies (low \(I^*(A_{uv})\)) and upset-prone matchups (high \(I^*(A_{uv})\)). The magnitude of the hard court’s intransitivity is increased by larger evidence weights due to a greater number of matches played on the surface. In contrast, grass courts show a concentration of moderate intransitivity despite low evidence, likely driven by the surface’s faster pace and lower friction.

Figure 7: Two strong intransitive cycles involving Maria Sakkari and Aryna Sabalenka before their match in the 2021 Abu Dhabi WTA Women’s Tennis Open (I^*(A_{uv})=8.54). — Figure 7: Two strong intransitive cycles involving Maria Sakkari and Aryna Sabalenka before their match in the 2021 Abu Dhabi WTA Women’s Tennis Open (\(I^*(A_{uv})=8.54\)).

Women’s tennis exhibits a higher mean intransitivity than men’s on hard (3.52 vs 3.10) and grass courts (2.28 vs 2.02), but notably lower on clay (1.67 vs 1.86). Overall, women’s tennis exhibits 11.5% more intransitivity than men’s (3.02 vs 2.71). Figure 7 shows two of the intransitive cycles that contributed to the most intransitive match from the validation set.

6.2 Model robustness↩︎

We evaluate model robustness to intransitivity by dividing predicted matches into bins based on \(I^*(A_{uv})\) values and examining performance patterns across these bins. Our binning strategy separates matches with \(I^*(A_{uv}) = 0\) (Bin 0, representing matchups with no observed prior meetings) from those with non-zero values, which we further divide into tertiles (Bins 1-3), representing low, medium, and high intransitivity.

For each bin, we calculate Brier score differences between our model and two benchmarks (PS and WElo) to quantify relative performance. We focus on how the performance gap between our model and these benchmarks changes from low-intransitivity (Bin 0) to high-intransitivity (Bin 3) contexts.

5pt

Table 4: Model Robustness to Intransitivity Analysis
Bin	\(I^*(A_{uv})\)	N	Brier Score			Model - Benchmark
4-6 (lr)7-8			Model	PS	WElo	PS	WElo
0	0	4871	0.217	0.193	0.213	+0.023	+0.004
1	0.1–2.0	1168	0.209	0.190	0.200	+0.019	+0.008
2	2.0–3.5	1168	0.212	0.202	0.214	+0.010	\(-0.002\)
3	3.5–9.6	1168	0.216	0.209	0.218	+0.007	\(-0.002\)

Notes. Overall validation Brier scores: Model = 0.215, PS = 0.196, WElo = 0.212. Bin 0 represents matches with \(I^*(A_{uv}) = 0\) (no previous head-to-head matches). Brier differences calculated as Model minus benchmark (positive values indicate Model underperformance).

Table 4 presents the absolute Brier scores and performance differences for each bin, showing that all models exhibit performance degradation as intransitivity increases, but with notable differences in their robustness. MagNet’s disadvantage relative to the bookmaker’s odds narrows by 68.3% from low-evidence matchups (Bin 0: +0.023 Brier difference) to high-intransitivity scenarios (Bin 3: +0.007). Against WElo, our model achieves superior predictive performance in moderate-to-high intransitivity contexts.

To assess reliability, we employ cluster-robust bootstrap inference (10,000 resamples), resampling players with replacement and including all matches involving any sampled player. This player-level approach accounts for the nested match structure and the wide variation in player match counts. The resulting 95% confidence intervals are narrow and consistent with the reported differences. Testing for a monotonic trend using Spearman rank correlation, we find that our model’s narrowing gap with the bookmaker’s odds is statistically significant (\(\rho = +0.049\), \(p < 0.001\)), confirming systematic improvement as match intransitivity increases. This pattern holds robustly across multiple binning choices (2-5 bins).

Whilst the bookmaker maintains superior absolute accuracy, our model demonstrates substantially greater robustness to intransitive matchups between players. This suggests that our graph-based architecture provides specific advantages in handling intransitive competitive structures, even if it cannot fully overcome the market’s informational advantages.

6.3 Betting simulation↩︎

Following the finding of our model’s robustness under highly intransitive conditions compared to bookmaker implied probabilities, we now conduct a betting simulation to assess whether this translates to exploitable market inefficiencies. Namely, we place bets according to our model’s probabilities that specifically target player-matchups at progressively higher thresholds of intransitivity.

We evaluate two staking strategies: unit staking, where a fixed stake of 1 unit is placed on whichever player our model assigns the higher probability (i.e., the model’s favored player), and Kelly staking, which uses the Kelly criterion [43] to adjust stake sizes proportional to perceived edge, placing bets only when our model’s probability exceeds the market-implied probability (positive expected value). The Kelly formula is given by: \[f^* = \frac{\hat{p}o - 1}{o-1}\] where \(f^*\) is the optimal fraction of the bankroll to wager, \(\hat{p}\) is the model’s estimated probability of winning, and \(o\) represents the decimal odds. To address issues with bankroll volatility common in standard Kelly implementations, we follow the approach of [39], by resetting the bankroll to 1 before each bet, ensuring that each bet is sized independently of previous bets and hence the final return on investment is unaffected by the sequence of bets.

We assess risk-adjusted returns using the return on investment (ROI) and the annualised Sharpe ratio. Let \(P_d\) be the total profit aggregated across all bets on day \(d\). We compute the sample mean \(\bar{P}\) and sample standard deviation \(\sigma_P\) of the daily profit \(\{P_d\}\). Reflecting the near-continuous nature of the sports betting markets, the annualised Sharpe ratio is hence given by: \[S = \frac{\bar{P}}{\sigma_P} \times \sqrt{365.25}.\]

We first find an optimal threshold \(I^*(A_{uv}) \geq \gamma\) that maximises the mean return (average of Kelly and unit staking profits) on the validation set, and compare it to betting without any \(I^*(A_{uv})\) constraint. The optimal threshold is \(\gamma = 2.55\), achieving a Kelly ROI of 1.97% over 1,687 bets and a unit ROI of 2.52% over 1,820 bets on the validation matches.

4pt

@ l c *2S[table-format=4.0] S[table-format=4.2] S[table-format=4.2] S[table-format=3.2,input-signs = -] S[table-format=2.2,input-signs = -] S[table-format=1.2,input-signs = -] @ & & &
(lr)3-8 (lr)9-14 Method & \(\gamma\) & n bets & Staked & Return & Profit & ROI (%) & Sharpe & n bets & Staked & Return & Profit & ROI (%) & Sharpe
Model & 2.55 & 1903 & 265.10 & 273.80 & 8.70 & 3.26 & 0.61 & 2063 & 2063.00 & 2086.50 & 23.50 & 1.14 & 0.48
Model & & 7705 & 1129.60 & 1067.50 & -62.10 & -5.49 & -1.93 & 8375 & 8375.00 & 8237.50 & -137.50 & -1.64 & -1.35
WElo & 2.55 & 1812 & 223.50 & 211.10 & -12.40 & -5.54 & -1.35 & 2063 & 2063.00 & 1993.20 & -69.80 & -3.38 & -1.48
WElo & & 7465 & 1058.20 & 1005.70 & -52.50 & -4.96 & -2.15 & 8375 & 8375.00 & 8218.80 & -156.20 & -1.87 & -1.51

Notes: \(\gamma\) represents the \(I^*(A_{uv})\) threshold. \(I^*(A_{uv}) \geq 2.55\): 2,063 matches (clay: 337, grass: 102, hard: 1,624; men: 963, women: 1,100).

In Table [tab:betting95performance], we report the betting performance from betting according to this model during the out-of-sample test period, with both Kelly and unit staking strategies. Absolute staked and returned amounts between unit and Kelly staking differ largely by design, so the key indicators for comparison in this simulation are the percentage ROI and the Sharpe ratio, which account for relative profitability and risk-adjusted profit, respectively.

When applying the \(I^*(A_{uv}) \geq 2.55\) intransitivity threshold filter, our model achieves positive returns with both Kelly staking (3.26% ROI, Sharpe 0.61) and unit staking (1.14% ROI, Sharpe 0.48), suggesting it successfully identifies market inefficiencies in highly intransitive scenarios, though the Sharpe ratios indicate moderate volatility in returns. In contrast, WElo performs poorly at the same threshold, producing -5.54% Kelly ROI and -3.38% unit ROI. Both models have poor performance without intransitivity filtering, with our model yielding -5.49% Kelly ROI and -1.64% unit ROI, whereas WElo yields -4.96% Kelly ROI and -1.87% unit ROI, highlighting the critical importance of targeting mispriced bets.

We applied a significance test, proposed by [44], to confirm our strategy’s systematic profitability. By simulating 10,000 trials of random bets (1,903 for Kelly, 2,063 for unit), we determined the probability (\(p_{bs}\)) that random wagering could match or exceed our observed ROI. Both strategies achieved significant results: Kelly staking yielded \(p_{bs} = 0.005\) for 3.26% ROI, and unit staking yielded \(p_{bs} = 0.022\) for 1.14% ROI, both significant at \(\alpha = 0.025\) after Bonferroni correction for two tests.

Figure 8: Kelly Staking ROI by intransitivity threshold. Values at the top of the chart indicate the number of placed bets.

Figure 8 illustrates the relationship between the intransitivity threshold filter and Kelly staking ROI for our MagNet model. The profitability of the model generally increases, with a large increase when bets with \(I^*(A_{uv})=0\) are excluded, highlighting the importance of player interaction history in the model’s predictive power. At extremely high intransitivity, the profitability turns negative. These matchups feature extensively observed players with substantial head-to-head records, where bookmakers typically demonstrate greater efficiency and are more difficult to exploit.

In summary, whilst our model exhibits poor performance when applied indiscriminately across all matches, it achieves consistent profitability by targeting matchups belonging to highly intransitive local neighbourhoods that are mispriced by the bookmaker. Although bookmaker odds maintain superior accuracy in high-intransitivity scenarios, the positive returns achieved through both Kelly and unit staking strategies suggest systematic market inefficiencies and confirm the known divergence between statistical forecasting and effective betting profitability, as discussed by [44].

7 Conclusion↩︎

In this paper, we addressed the phenomenon of intransitive player dominance between professional tennis players by constructing temporal directed graphs that preserve player relationships and applying a spectral graph convolutional network which explicitly learns from intransitive relationships.

Our optimised model predicts tennis match outcomes with a strong out-of-sample performance (65.7% accuracy, 0.215 Brier score) and is competitive with existing prediction methods such as Weighted Elo (66.5%, 0.212). Moreover, our betting simulation reveals significant market inefficiencies: while the model is unprofitable when applied indiscriminately (-5.49% ROI), it achieves statistically significant returns when targeting matchups with high intransitivity (3.26% Kelly ROI over 1,903 bets). Pinnacle Sports’ odds systematically underperform in these scenarios, with MagNet’s relative disadvantage narrowing by 68.3% from low to high intransitivity (Table 4). This evidence suggests that preserving and learning from intransitive player relationships provides advantages to tasks like forecasting tennis match outcomes.

The GNN application extends the growing research on graph-based methods in sports forecasting by showing how player interactions and match histories can be effectively encoded in graph structures, while the profitable betting strategy supports the perspective of [44] that effective betting strategies should identify specific inefficiencies.

Our model shows robustness to intransitive scenarios, but its overall predictive performance still trails bookmaker odds-implied probabilities significantly (0.215 vs. 0.196 Brier score) and lacks the benefit of a more explainable approach like WElo by [8]. The reliability of profitability analyses also remains contingent on data quality of historical betting odds, a known challenge previously explored by =0 [45]. .

Future research could explore several promising directions. Constructing a more accurate measure of player-matchup intransitivity can enable deeper investigations into the bookmaker’s poor robustness to intransitivity. Exploring whether similar market inefficiencies exist in other individual sports with strong stylistic components (e.g., mixed martial arts) could also validate the broader applicability of our approach. There is also scope for a more detailed graph representation, given that our approach loses information when summarising rich matchup histories into simple edge weights. A method that learns from multiple edge attributes would be useful, while the incorporation of more informative node features may also be beneficial. For profitability, research could investigate incorporating bookmaker error into graphs to better identify profitable trades. Overall, this study has established graph neural networks as a strong tool for predicting tennis matchups in the presence of highly intransitive relationships, with demonstrated advantages in exploiting market inefficiencies where bookmakers systematically misprice.

Acknowledgements↩︎

=1 The authors have no conflict of interest to declare.

8 Mathematical Details of MagNet↩︎

This appendix provides the complete mathematical formulation of the MagNet architecture used in Section 4.2. For clarity, we omit the court surface and snapshot notation, but the following applies independently to each temporal surface-graph. We also follow standard notation where bold uppercase letters (e.g., \(\boldsymbol{A}\)) denote matrices and bold lowercase letters (e.g., \(\boldsymbol{x}\)) denote vectors.

Given a static directed graph \(G = (V, E, \mathbf{X^V}, W)\), we denote its adjacency matrix as \(\boldsymbol{A} \in \mathbb{R}^{|V| \times |V|}\), where \(A(u,v) = w_{uv}\) if an edge \((u,v) \in E\) exists indicating player \(u\) has historically dominated player \(v\), and \(0\) otherwise. First, we compute the symmetrised adjacency matrix \(\boldsymbol{A}_{s}\) and the diagonal degree matrix \(\boldsymbol{D}_{s}\), defined as

\[A_{s}(u,v) = \frac{1}{2}(A(u,v) + A(v,u)), \quad D_{s}(u,u) = \sum_v A_{s}(u,v).\]

To preserve the directional information representing head-to-head dominance between players, the phase encoding matrix \(\boldsymbol{\Theta}^{(q)}\) is calculated as:

\[\Theta^{(q)}(u,v) = 2\pi q(A(u,v) - A(v,u))\] where \(q \in [0, 0.25]\) is a learnable parameter that controls the extent to which directional information is incorporated. When \(q=0\), the model disregards which player has historically dominated the head-to-head matchups, whereas at \(q=0.25\), this dominance information is maximally integrated.

For improved numerical stability during training, we use the normalised formulation of the magnetic Laplacian, which is defined as: \[\boldsymbol{L}_N^{(q)} = \boldsymbol{I} - \boldsymbol{D}_{s}^{-1/2}\boldsymbol{A}_{s}\boldsymbol{D}_{s}^{-1/2} \odot \exp(i\boldsymbol{\Theta}^{(q)})\] The normalised magnetic Laplacian is a complex Hermitian matrix, which guarantees real eigenvalues, thus resolving the challenges posed by asymmetric adjacency matrices. This complex-valued representation captures both the existence of head-to-head history between players in the magnitude of its entries and the directional dominance pattern in their phase. For player pairs with no prior matches, the corresponding entries remain zero.

To address the computationally expensive eigendecomposition of the Laplacian, MagNet implements the spectral filtering approach based on Chebyshev polynomial approximation, as introduced by [46]. As the eigenvalues of the normalised magnetic Laplacian lie in the range \([0, 2]\), the Laplacian is first rescaled to have eigenvalues in the range \([-1, 1]\) for the Chebyshev approximation:

\[\tilde{\boldsymbol{\mathcal{L}}} = \frac{2}{\lambda_{\max}}\boldsymbol{L}_N^{(q)} - \boldsymbol{I}\] where \(\lambda_{\max}\) is the largest eigenvalue of \(\boldsymbol{L}_N^{(q)}\). The parameter \(K\) in the Chebyshev approximation corresponds to the order of the polynomial, defining the filter’s receptive field as the \(K\)-hop neighbourhood of each node. The resulting filters aggregate information from nodes within \(K\) hops, considering both incoming and outgoing edges. The phase of the Laplacian’s entries, which encodes edge direction, allows the filter to process information from these two neighbourhoods distinctly. The filters act on node features \(\boldsymbol{x}\) as follows:

\[\label{eqn:filters} \boldsymbol{Y}\boldsymbol{x} = \sum_{k=0}^{K} \theta_k T_k(\tilde{\boldsymbol{\mathcal{L}}})\boldsymbol{x}\tag{5}\] where \(\theta_k\) are the learnable filter parameters and \(T_k\) are the Chebyshev polynomials of the first kind.

These filters operate across multiple layers of the network. When \(L\) layers are stacked, the effective receptive field expands to \(L \times K\) hops, allowing the model to aggregate information from increasingly distant players in the network. The multi-layer operation follows: \[\boldsymbol{x}_{j}^{(\ell)} = \sigma\left(\sum_{i=1}^{F_{\ell-1}} \boldsymbol{Y}_{ij}^{(\ell)}\boldsymbol{x}_{i}^{(\ell-1)}\right)\] where \(\boldsymbol{x}_{j}^{(\ell)}\) is the j-th feature channel at layer \(\ell\), \(\sigma\) is a complex ReLU activation function, \(F_{\ell-1}\) is the number of input channels to layer \(\ell\), and \(\boldsymbol{Y}_{ij}^{(\ell)}\) represents the convolution matrices implementing the spectral filters between the \(i\)-th input channel and the \(j\)-th output channel. After the final convolutional layer, the complex-valued node embeddings for all players are collected in the matrix \(\boldsymbol{X}^{(L)} \in \mathbb{C}^{|V| \times F_L}\). To obtain real-valued node representations \(\boldsymbol{h}_u, \boldsymbol{h}_v \in \mathbb{R}^{2F_L}\), the matrix is unwound into real-valued features by separating real and imaginary parts into a matrix \(\mathbb{R}^{|V| \times 2F_L}\). Edge features are extracted by concatenating the unwound node embeddings for each player pair \((u,v)\), resulting in edge representation \(\boldsymbol{e}_{uv} = [\boldsymbol{h}_u; \boldsymbol{h}_v] \in \mathbb{R}^{4F_L}\).

The final prediction layer applies a linear transformation followed by softmax activation to produce an order-dependent probability vector for a given player pair, representing the estimated probability associated with the directed edge from \(u\) to \(v\) (indicating player \(u\)’s dominance over \(v\)):

\[\hat{\mathbf{z}}_{uv} = \text{softmax}(\mathbf{W}[\mathbf{h}_u; \mathbf{h}_v] + \mathbf{b})\] where \(\mathbf{h}_u\) and \(\mathbf{h}_v\) are the node embeddings from the final layer, \([\mathbf{h}_u; \mathbf{h}_v]\) represents their concatenation, and \(\mathbf{W} \in \mathbb{R}^{2 \times 4F_L}\) and \(\mathbf{b} \in \mathbb{R}^2\) are learnable weight and bias terms.

To mitigate the impact of player ordering \((u, v)\) on probability estimates and to ensure a balanced prediction, we average the model outputs from both directed perspectives. The final set-win probability for player \(u\), denoted \(\hat{p}_{uv}\), is defined as: \[\hat{p}_{uv} = \frac{1}{2} \left( (\hat{\mathbf{z}}_{uv})_1 + (\hat{\mathbf{z}}_{vu})_2 \right) \label{eq:avg95prob}\tag{6}\] where \((\cdot)_1\) refers to the first component of the probability vector (i.e., the probability of the first player in the input pair winning) and \((\cdot)_2\) refers to the second. This averaging ensures that the model’s predictions are consistent regardless of player order and that the resulting probabilities sum to one, i.e., \(\hat{p}_{uv} + \hat{p}_{vu} = 1\).

9 Extended Initial Model↩︎

In earlier work =0 [41] , we proposed an unoptimised temporal graph model with a less sophisticated graph structure for the same task of tennis forecasting. To evaluate the robustness of this unoptimised approach, we increased the out-of-sample dataset to cover the period 31 October 2023 to 8 June 2025. By incorporating 843 additional matches, the total out-of-sample test set increased in size to 1,918 professional matches across both the original period (31 October 2023 to 8 September 2024) and the extended period (9 September 2024 to 8 June 2025). The dataset differs from the one we use here in that it only considers men’s matches and excludes any 500-point matches other than the Halle Open and the Queen’s Club Championships.

The initial implementation of temporal graph neural networks for tennis forecasting also differs in several key aspects from the model presented in our current work. Specifically, the graph edge weights are constructed with a time decay parameter of \(\lambda = 0.01\), and there is no cross-surface graph updating. This extension provides further justification for the proper hyperparameter tuning and improved graph structure in this work.

Table 5: Model Performance Comparison By Surface: Extended Dataset.
	Count		Accuracy		Brier
2-3 (lr)4-5 (lr)6-7 Dataset	MS	E	MS	E	MS	E
Clay	314	681	0.675	0.639	0.216	0.226
Grass	169	169	0.710	0.710	0.209	0.209
Hard	591	1068	0.638	0.626	0.221	0.227
All	1074	1918	0.660	0.638	0.218	0.225

In Table 5, the results from the original out-of-sample test set are shown are shown along with results when including the 843 additional matches. The unoptimised model achieves an overall classification accuracy of 63.8% and a Brier Score of 0.225, representing a decline from the original model’s performance of 66.0% accuracy and 0.218 Brier Score. This deterioration is consistent across most performance metrics and surfaces, with the exception of grass courts, since there were no additional matches on grass. The difference is most pronounced on clay courts, where accuracy drops from 67.5% to 63.9%, with hard court performance declining from 63.8% to 62.6%.

We attribute the performance deterioration to several methodological differences. First, the dramatically different time decay parameter (\(\lambda = 0.01\) vs \(\lambda = 0.38\)) significantly alters how historical match information influences current predictions, causing the graph to retain excessive influence from outdated match results that may no longer reflect current player form.

Second, the exclusion of 500-point tournaments reduces the graph density and available training information, particularly affecting predictions of players who compete primarily at these lower-tier events. Finally, the restriction of cross-surface information flow prevents the model from consistently learning across the year. This is most pronounced with grass tournaments, which necessitated the selective inclusion in the original work of the Halle Open and the Queen’s Club Championships as a partial remediation.

The performance of the initial model with the extended dataset highlights the importance of of our improved approach, with systematic hyperparameter optimisation and complete graph construction.

Table 6: Betting Performance Comparison: Extended Dataset.
	Count		Staked		Return		Profit		ROI (%)		Sharpe
2-3 (lr)4-5 (lr)6-7 (lr)8-9 (lr)10-11 (lr)12-13 Dataset	MS	E	MS	E	MS	E	MS	E	MS	E	MS	E
Clay	314	681	59.07	126.30	59.21	126.09	0.14	-0.21	0.24	-0.16	0.08	-0.05
Grass	169	169	44.15	44.15	49.27	49.27	5.12	5.12	11.60	11.60	3.54	3.54
Hard	591	1068	82.29	155.61	91.23	159.08	8.94	3.47	10.87	2.23	1.94	0.46
All	1074	1918	185.50	326.06	199.71	334.45	14.20	8.38	7.66	2.57	1.81	0.62

Notes. MS refers to results presented in =0 [41] , E refers to results from using the same methodology but with an extended dataset. While the original results used a dataset spanning 31 October 2023 to 8 September 2024, the extended dataset covers 31 October 2023 to 8 June 2025. Kelly strategy limits bets to model-identified favourites with positive expected value. ROI and Sharpe ratio calculated with same method as original paper. All monetary values are in units.

Table 6 summarises the extended betting performance using the modified favourite-only Kelly criterion strategy. Model performance declined from an ROI of 7.66% and a Sharpe ratio of 1.81 to an ROI of 2.57% and a Sharpe ratio of 0.62.

References↩︎

[1]

Nikolakaki, S.M., Dibie, O., Beirami, A., Peterson, N., Aghdaie, N., Zaman, K., 2020. Competitive balance in team sports games, in: IEEE Conference on Games (CoG), IEEE. pp. 526–533. .

[2]

Garnica-Caparrós, M., Memmert, D., Wunderlich, F., 2022. Artificial data in sports forecasting: a simulation framework for analysing predictive models in sports. Information Systems and e-Business Management20, 551–580. .

[3]

Wunderlich, F., Memmert, D., 2021. Forecasting the outcomes of sports events: A review. European Journal of Sport Science21, 944–957. .

[4]

Clarke, S., Dyte, D., 2000. Using official ratings to simulate major tennis tournaments. International Transactions in Operational Research7, 585–594. .

[5]

Klaassen, F.J., Magnus, J.R., 2003. Forecasting the winner of a tennis match. European Journal of Operational Research148, 257–267. .

[6]

McHale, I., Morton, A., 2011. A Bradley-Terry type model for forecasting tennis match results. International Journal of Forecasting27, 619–630. .

[7]

Kovalchik, S., 2020. Extension of the Elo rating system to margin of victory. International Journal of Forecasting36, 1329–1341. .

[8]

Angelini, G., Candila, V., De Angelis, L., 2022. Weighted Elo rating for tennis match predictions. European Journal of Operational Research297, 120–132. .

[9]

Kovalchik, S.A., 2016. Searching for the GOAT of tennis win prediction. Journal of Quantitative Analysis in Sports12, 127–138. .

[10]

van Ours, J. C., 2025. Non-transitive patterns in sports match outcomes: A profitable anomaly. Tinbergen Institute Discussion Paper.

[11]

Bozóki, S., Csató, L., and Temesi, J., 2016. An application of incomplete pairwise comparison matrices for ranking top tennis players. European Journal of Operational Research, 248(1), 211–218. Elsevier.

[12]

Temesi, J., 2019. An interactive approach to determine the elements of a pairwise comparison matrix. Central European Journal of Operations Research, 27(2), 533–549. Springer.

[13]

Hamilton, A. H., Roughan, M., Kalenkova, A., 2024. Elo ratings in the presence of intransitivity. arXiv preprint arXiv:2412.14427.

[14]

Arcagni, A., Candila, V., Grassi, R., 2023. A new model for predicting the winner in tennis based on the eigenvector centrality. Annals of Operations Research325, 615–632. .

[15]

Zhang, X., He, Y., Brugnone, N., Perlmutter, M., Hirn, M., 2021. MagNet: A neural network for directed graphs, in: International Conference on Neural Information Processing Systems, pp. 27003–27015. ://dl.acm.org/doi/10.5555/3540261.3542329.

[16]

Elo, A.E., Sloan, S., 1978. The rating of chessplayers: Past and present. ARCO Publishing, New York, USA.

[17]

Morris, B., Bialik, C., 2015. Serena williams and the difference between all-time great and greatest of all time. FiveThirtyEight. ://fivethirtyeight.com/features/serena-williams-and-the-difference-between-all-time-great-and-greatest-of-all-time/. accessed: 2025-04-03.

[18]

Leitner, C., Zeileis, A., Hornik, K., 2009. Is Federer stronger in a tournament without Nadal? An evaluation of odds and seedings for Wimbledon 2009. Austrian Journal of Statistics38, 277–286. .

[19]

Dingle, N., Knottenbelt, W., Spanias, D., 2013. On the (page) ranking of professional tennis players, in: Tribastone, M., Gilmore, S.(Eds.), Computer Performance Engineering, Springer, Berlin, Heidelberg. p. 237–247. .

[20]

Radicchi, F., 2011. Who is the best player ever? A complex network analysis of the history of professional tennis. PLOS ONE6, e17249. .

[21]

Brin, S., Page, L., 1998. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems30, 107–117.

[22]

Aparício, D., Ribeiro, P., Silva, F., 2016. A subgraph-based ranking system for professional tennis players, in: Cherifi, H., Gonçalves, B., Menezes, R., Sinatra, R.(Eds.), Complex Networks VII. Springer International Publishing, Cham. volume 644 of Studies in Computational Intelligence, p. 159–171. .

[23]

Bayram, F., Garbarino, D., Barla, A., 2021. Predicting tennis match outcomes with network analysis and machine learning, in: Bureš, T., Dondi, R., Gamper, J., Guerrini, G., Jurdziński, T., Pahl, C., Sikora, F., Wong, P.W.(Eds.), SOFSEM 2021: Theory and Practice of Computer Science, Springer International Publishing, Cham. p. 505–518. .

[24]

Xenopoulos, P., Silva, C., 2021. Graph neural networks to predict sports outcomes, in: 2021 IEEE International Conference on Big Data (Big Data), IEEE. pp. 1757–1763. .

[25]

Mirzaei, A., 2022. Sports match outcome prediction with graph representation learning. Master’s thesis. School of Computing Science, Simon Fraser University, CA, USA. ://summit.sfu.ca/_flysystem/fedora/2022-08/input_data/22492/etd21919.pdf.

[26]

He, Y., Gan, Q., Wipf, D., Reinert, G.D., Yan, J., Cucuringu, M., 2022. GNNrank: Learning global rankings from pairwise comparisons via directed graph neural networks, in: International Conference on Machine Learning, PMLR. pp. 8581–8612. ://proceedings.mlr.press/v162/he22b/he22b.pdf.

[27]

Buchdahl, J., 2016. Squares & Sharps, Suckers & Sharks: The Science, Psychology & Philosophy of Gambling. High Stakes, London.

[28]

Hegarty, T., Whelan, K., 2025. Forecasting soccer matches with betting odds: A tale of two markets. International Journal of Forecasting41(2), 803–820. .

[29]

Longa, A., Lachi, V., Santin, G., Bianchini, M., Lepri, B., Lio, P., Scarselli, F., Passerini, A., 2023. Graph neural networks for temporal graphs: State of the art, open challenges, and opportunities. Transactions on Machine Learning Research://openreview.net/forum?id=pHCdMat0gI.

[30]

Fayomi, A., Majeed, R., Algarni, A., Akhtar, S., Jamal, F., Nasir, J.A., 2022. Forecasting tennis match results using the bradley-terry model. International Journal of Photoenergy2022, 1898132. .

[31]

Dixon, M.J., Coles, S.G., 1997. Modelling association football scores and inefficiencies in the football betting market. Journal of the Royal Statistical Society: Series C (Applied Statistics)46, 265–280. .

[32]

He, Y., Zhang, X., Huang, J., Rozemberczki, B., Cucuringu, M., Reinert, G., 2024. PyTorch geometric signed directed: A software package on graph neural networks for signed and directed graphs, in: Learning on Graphs Conference (LoG), PMLR. ://proceedings.mlr.press/v231/he24a/he24a.pdf.

[33]

Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B., 2011. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 24, pages 2546–2554.

[34]

Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M., 2019. Optuna: A next-generation hyperparameter optimization framework, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631. .

[35]

Bradley, R.A., Terry, M.E., 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika39, 324–345. .

[36]

Maystre, L., Grossglauser, M., 2015. Fast and accurate inference of plackett–luce models. Advances in neural information processing systems28.

[37]

Shin, H.S., 1993. Measuring the incidence of insider trading in a market for state-contingent claims. The Economic Journal103, 1141–1153. .

[38]

Brier, G.W., 1950. Verification of forecasts expressed in terms of probability. Monthly weather review78, 1–3.

[39]

Boshnakov, G., Kharrat, T., McHale, I.G., 2017. A bivariate weibull count model for forecasting association football scores. International Journal of Forecasting33, 458–466. .

[40]

Tukey, J. W., 1961. Curves as parameters, and touch estimation. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, 4, pages 681–695. University of California Press.

[41]

Clegg, L.W., Cartlidge, J.P., 2025b. Tennis match outcome prediction using temporal directed graph neural networks. In: MathSport International 2025 –CONFERENCE PROCEEDINGS –, 50–56. ://math.uni.lu/midas/events/mathsports2025/.

[42]

O’Malley, A. J., 2008. Probability formulas and statistical analysis in tennis. Journal of Quantitative Analysis in Sports4, 2.

[43]

Kelly, J.L., 1956. A new interpretation of information rate. The Bell System Technical Journal35, 917–926. .

[44]

Wunderlich, F., Memmert, D., 2020. Are betting returns a useful measure of accuracy in (sports) forecasting?International Journal of Forecasting36, 713–722. .

[45]

Clegg, L., Cartlidge, J., 2025. Not feeling the buzz: Correction study of mispricing and inefficiency in online sportsbooks. International Journal of Forecasting41(2), 798–802. .

[46]

Defferrard, M., Bresson, X., Vandergheynst, P., 2016. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems29.

A search was conducted on 22/10/2025 using Google Scholar to identify papers applying graph neural networks to tennis prediction. The search combined six graph neural network related terms with eleven tennis prediction related terms, resulting in 66 distinct search queries. All searches used exact phrase matching with quotation marks (e.g., “graph convolutional network” “tennis forecasting”) to ensure precision. In total, these search queries returned 11 papers, none of which were relevant to graph neural networks for pre-match tennis outcome forecasting.↩︎
MagNet is accessible via the PyTorch Geometric Signed Directed library [32].↩︎
We use the Python library Optuna [34] to implement the TPE algorithm.↩︎
The top seven players in the year-end ATP Race qualify automatically. The eighth spot is reserved for a current-year Grand Slam champion who finishes between 9th and 20th in the race. If more than one player meets this condition, the higher-ranked of the two qualifies. If no player satisfies this criterion, the spot goes to the player ranked 8th in the race.↩︎

Intransitive Player Dominance and Market Inefficiency in Tennis Forecasting: A Graph Neural Network Approach