While plenty use their own knowledge at all times, basically any serious punter – racing or sport – employs the use of data to help them make betting decisions.
Whether you’re looking at a few simple stats or building a more complex data model, one of the biggest challenges is getting the data you need in a format that can easily show you what you need to know.
Rod – the man behind our hugely profitable High Low membership – takes us through some of the challenges he’s faced, and some tips for collecting and analysing data.
In an age where stats are everywhere you look, and computers can crunch numbers in seconds, data is a key part of modern-day punting.
Before you can analyse your data, you need to get it in the first place. The problem with data is that it’s often in a form we can’t analyse. We might be able to see the stats on racing.com or the AFL website but getting the data into Excel or another program where we can perform some serious analysis can be tricky.
Furthermore, the data we can see (or already have) may not contain all the information we want and we might want to add new stats from a different source to complete our data set.
In a perfect world, we have all the data points we want, and we have it in the one place in a form we can analyse. Data collection is the first port of call.
So, how do we go about it?
The first question to ask is, “Does someone else already have the data I’m interested in?” There are several businesses around that sell horse racing databases, as well as companies that collect data on sports. For a reasonable price, these businesses can provide you the data you need, keep it up-to-date and save you the hard work.
When the data isn’t available for sale, or you want it for free, you’ll need to get it yourself.
Back in the day, I’d spend hours manually entering stats into Excel, or copying and pasting data from websites.
These days, technology can do the hard work for you and “scrape” thousands of data points from the web in the time it takes you to make a coffee. Data that would literally takes months to enter manually can be collected in hours, or even minutes.
The added benefit is that well-written code significantly reduces the errors in your data entered by a weary and error-prone keyboard warrior.
Personally, I use the computer language Visual Basic for Applications (VBA) and write “macros” in Excel that scrape data from the web and enter it into a spreadsheet. That works for me, but there are several alternative (and probably better) languages out there.
It takes a while to learn a computer language but it’s well worth the effort. There are a ton of sites and forums out there that will help you through the process. The best place to start with Excel macros is to use the “macro recorder”. That’s a tool that lets you record what you want to do in Excel and converts it into VBA code. That helps you learn what code controls what functions.
Take your time hunting around for the best data source. That might be historical data that goes back more years than other sources. It could be data that contains more statistics. Or it could be data from a more reliable, official source. If you’re going to the effort of collecting the data yourself, you want it to be the best data you can find.
I’ve made the mistake in the past of being so keen to start the analysis, that I’ve taken shortcuts in my data collection, or thought to myself, “that’ll do”, when I’ve come across a stumbling block. That’s a mistake because poor data leads to inaccurate conclusions in your analysis and that costs you money.
So be a perfectionist. Be methodical and have an attitude of quality. It’s important to collect the best data that you can.
Sometimes you might be interested in just one or a few statistics from a data source and just collect the data you need. More than once I’ve just collected the data I needed, only to find out in the future that I wanted some other data from the same site, so I’ve needed to go back and re-collect the data. You may not know why you need unwanted data now and it takes a little extra time to write the code but my advice is collect all the data from a site when you scrape it. Your future self will thank you for it.
You’ll often find mistakes or anomalies in the source data you’ve collected. It can be tempting to correct the mistake there and then. A better approach is to create a second column (or file) with the corrected or adjusted data and/or keep a “change log” with the changes you’ve made to the original data.
Sometimes you will make a poor decision when deciding how to deal with an error or anomaly that you wish you hadn’t made later on. Without keeping the original data, or recording how you changed it, you’ve “lost” the original data, or become confused about what you did and why, which creates a messier and lower quality data set. It’s good to have the original data as collected to fall back upon when you need it.
Just because the data comes from a reputable source (e.g. official body), that doesn’t mean it contains no errors. It could be the best data available but the fact is a human was involved at some point (you included!) and given there are probably thousands (or even millions) of data points, there will always be errors somewhere in the data. So how do you find them?
It’s impractical (and impossible) to check every data point individually but there are ways to use the power of Excel to help discover and correct errors.
Filters are a very useful way to identify errors in your data. Filters are dropdown boxes you can add to the headers on the top of your columns. The handy feature about filters is that when you click on the dropdown box, it shows you each unique data point found in that column, which can be handy for identifying errors and anomalies.
For example, if you had a column with AFL winning margins, the data in that column would include most numbers from 0 to 190 (the highest winning margin in AFL/VFL history, Fitzroy vs Geelong, 1979). If one of the entries in the dropdown list was 1028 or “Geelong”, then you know you have an error in your data that needs to be investigated.
Another useful check when it comes to price data – a critical data point in any punting data set – is checking the market percentage. That can be a little trickier to analyse given it takes Excel formulas to calculate but is worthwhile because it can identify important errors.
For example, an AFL match might be expected have a market percentage around 105%, so if one match had a market percentage of 85% (which can again be identified with filters), well you know you have an error.
Each data set is unique but if you can think of ways to confirm your data is correct it is well worth the effort because your data will be cleaner and your analysis will be better for it.
When you’re writing code to collect data, it’s good to think about how you will update the data once new results happen. Sometimes it involves a simple change in your code to collect the most recent results but other times it won’t be as simple as that. It’s best to think about that before you write your code, rather than afterwards.
Learning computer languages like VBA takes time but are well worth the effort if a database for sale doesn’t contain the information you want, or you want to collect the data for free. Like the foundation of a building, good data collection is the foundation of a good analysis, which will lead you to accurate conclusions and not down a rabbit hole that will cost you money.
So consider the data you use. Is it the newspaper form guide? Best Bets? The AFL or NRL website?
Is it giving you the edge you truly need to beat the bookies?