While plenty use their own knowledge at all, basically any serious punter – racing or sport – employs the use of data analysis to help them make betting decisions.
Following Part 1 of this series, where he gave us some rules for data collection, Rod turns his attention to the analysis stage… what are the guidelines you need to know before your start?
You’ve collected your data and tidied it up… now you’re ready to begin your analysis.
Data analysis is about finding new information that will make you a profit, but it’s also about avoiding the traps that will lead you to false conclusions and cost you money. This article provides my advice on how to avoid the traps and find good information in your data that will make you a profit.
The most important metric in data analysis is profit-on-turnover (PoT). PoT – or expected value (EV) – tells you how profitable your bets are or, when it comes to data analysis, how profitable a certain set of historical conditions were if you had you bet on them yourself.
When you want to check how horses in fields of less than 10 perform when they were favourites last start, it’s PoT that will tell you whether you’ll make money on that or not.
Data is valuable because we know that historical results can help predict future events. The problem is that while history is a good predictor, randomness is always getting in the way.
Most of the advice in this article is practical advice on how to avoid randomness. Avoiding randomness is critical in data analysis because results caused by randomness lead to inaccurate conclusions and false information, which cost us money when we bet based upon them.
Two of the most common causes of randomness are small sample sizes and longshot winners. So once you’ve crunched the numbers and get a positive result, just double check your sample size is at least triple figures and there are no 100-1 winners (i.e. outliers) pumping up the results.
We want a large sample size to increase the confidence in our conclusions, but we also need to make sure our data is relevant and we don’t go so far back in time that any conclusions we make are irrelevant in today’s market.
The “game” in all sports changes as new rules and tactics develop but, more importantly, the knowledge of the public on what makes a good bet improves over time, which changes what is profitable.
For example, you might find a nice edge on front runners in racing data from the 1990s. However, the advantage of front runners is common knowledge nowadays and baked into prices, so that edge that was present in the 90s has now either significantly reduced or is non-existent.
Personally, I tend to go no further back than five years in historical data. That provides enough season-on-season data for sports to get a decent sample size but is not so far back that any conclusions I make are irrelevant. Of course, five years ago could be irrelevant today. That’s a call you need to make based on your data, but it’s certainly the more recent the better when it comes to data.
When I analyse a set of statistics, I like to look for trends in the data. I do that because trends effectively use the data as a whole (i.e. create a bigger sample), which adds validity to my conclusions. For example, I might have price data that shows the following:
$1 – $4
$4.01 – $10
|$10.01 – $20|| |
$20.01 – $1000
In this example, I know that the $1 – $4 price range is where I lose the least, but I might only have a sample size of 100 in that range. However, the trend of progressively worse losses as the price increases backs up my conclusion that the shorter the price is, the less I lose. The sample size of all the data might be 400, which gives me more confidence in my conclusion.
Sometimes there’s a peak in the data and values either side of that peak are progressively worse on both sides. That trend is fine as well. In either case, trends give you more confidence in your conclusions.
As well as trends in the data itself, another important trend to look at is year-on-year (or season-on-season) data. Punting is an ever-changing game where edges are found and lost as the public gets smarter and the market becomes more efficient. You want to look at the trend in your theory over different years or seasons and see if your edge is steady, getting worse or perhaps not profitable at all anymore.
In sports, year-on-year data will create small sample sizes because there are not that many matches per season. That can make year-on-year data jump all over the place. That’s the nature of the beast with sports, but it’s still worth looking for trends year-on-year in sports, keeping in mind they will be harder to find.
Hold-out samples are data that you deliberately don’t analyse in your initial analysis. Once you get your data set it’s very tempting to analyse it as a whole, but it’s a good idea to create a hold-out sample.
Hold-out samples help confirm that the conclusions you make in your initial analysis are valid. The idea is that if your initial conclusions are correct, then the data in your hold-out sample will produce the same results.
The best hold-out sample to use is the current or last season’s data because it’s the most relevant data. If it shows a profit, then chances are better that future bets will show a profit as well.
When you test hundreds of data points at the same time, the spread of results will usually follow a bell-shaped curve. That means that most results will hover around the average (slight loss due to the market percentage) but it also means that, simply due to randomness, there will be big winners and losers as well.
For that reason, I always like to know (or try to understand) the reason behind why the data is showing what it is. It gives me confidence that my conclusions are true and not due to randomness.
For example, if you asked 1000 people to flip a coin 100 times, approximately 97% of people will flip between 41 and 59 heads (inclusive), but you will also get approximately 28 people who will flip 60 heads or more. Based purely on the data, you might say those 28 people are excellent head flippers, and any time they flipped a coin, you’re betting on heads @ $1.90 with them.
Even though the data was perfectly good, we know it was pure chance that those people flipped 60 heads, because coin flips are random and we’ll end up losing the market percentage backing them in the long run. That’s why it’s important to know the reasons behind the data.
Making the decision between what is true and what is random is tricky and one of the few subjective decisions you will need to make during your analysis.
It’s important to be able to replicate what you find in your data in future bets. For example, you might be interested in front runners and find that horses who lead at the 600-metre mark have 5% PoT. The problem is that you can’t know before the race which horse will lead at the 600-metre mark, so the data is not that valuable.
In this example (and ones like it), you need to think about how you can analyse your data so that it can be used before the start of the event. For example, you might classify a horse as a leader if it led at the 600-metre mark at its last start and there are no other horses in the race who did that. Analyse how those horses perform and, if you find an edge, you can use those criteria to make your selection before the race.
It’s important to never make assumptions about your data and have direct evidence for the conclusions you are making. Even minor assumptions can have a large effect on the final conclusions you make, which can cost you money when those conclusions are wrong.
There are thousands of combinations of various conditions we can test within our data and there will always be some way we can shave losers from a set of results to make it more profitable.
The term “backfitting” is when you perform an analysis that suits your data, rather than performing an arbitrary, unbiased analysis that would produce similar results with any data set.
Backfitting is usually done when you drill deeper and deeper into your data and eliminate minor random (losing) factors that produce a better profit. It creates profitable conditions in your data, but those conditions are random and not a genuine means to making a profit in the future.
For example, you might work out from your data that horses on the second start in a preparation, racing at Caulfield, carrying a jockey with 55 – 56 kg, with a trainer with a 13% strike rate, that placed last start and are between 10-1 and 20-1 today, show a positive PoT. Bingo!
No, that’s backfitting the data.
There may well be a genuine edge on horses running at the second start as they blew out the cobwebs from their first start. But there’s no logical reason why trainers with a 13% strike rate would be more profitable than trainers with a 15% strike rate (for example), or why horses in the 10-1 to 20-1 range would be more profitable than others.
Again, it’s randomness that produces positive results when we backfit. We need to make sure that when we do drill down into data (which is fine to an extent) that the conditions we’re looking at are genuine reasons why selections might win or lose and we’re not backfitting the data.
Making a profit from data depends on good data, but it also depends on making solid conclusions in your analysis and not succumbing to the traps of randomness. Next time you analyse your data, keep these ideas in mind, check that you haven’t fallen for the traps of randomness and you’re more likely to find some genuine profitable edges that you can make some money on in the market. That’s what we’re all here for!
Check out Part 1 – Punting’s Data Age: Don’t Get Left Behind.