Wednesday 15 February 2023

The King; his Heir Apparent...and The Pharaoh waiting in the wings - Shearer, Kane and Salah, games and goals per season.

This started as either a drunken conversation, a disagreement or a follow up to a Match of the Day stat. How vague my memory of parts of it is suggests one option above the others, but it has also been some time since the conversation happened which might also explain it. 

Some time ago, Harry Kane missed a couple of matches due to an ankle injury, again. The "again" was the problem. It had become clear that if Kane had a weakness, it was his ankle not his game. Cue L saying that the thing that would stop Kane reaching Alan Shearer's goal-scoring records would be injuries, because once you start having to miss games due to recurring injuries to the same body part, the number of games missed because of it is only going to increase. 

L wanted to know whether the games per season Kane played up to this point matched Shearer's or not. 

I raised an objection, which is that Kane, playing for a decent Spurs team, probably has more chance of playing more games than Shearer had while at Newcastle, because while every team plays the same number of league games, there's cups and European games to consider as well. (Shearer at Newcastle, excellent example of 'the things we do for love'.) 

So, it was agreed to calculate percentage of possible games played for Shearer and Kane. Alongside their stats, I was asked to include Mo Salah because he was scoring at a ridiculous rate and might have beaten Kane to any given record. 

I used TransferMarkt's data for all the players. 

When Kane was 27 and Salah was 28 the data looked like this - dotted lines are polynomial lines of best fit.
  Percentage-Predicted-27 Obviously, for Shearer (blue dots and dotted line) we had stats for his whole career. The noticeable thing is that even at the end of his career, he was playing in a high percentage of Newcastle's games (in his last year he played in 85% of Newcastle’s games), but this might have been because Newcastle really never had a replacement for Shearer ready at the time. 

There is a reason he has a statue outside of St. James's Park. Photo-2021-09-05-12-39-06 

For the other two, the dotted lines are predictions and the lines look pretty different. 

Let’s look at it if we only use the data up to the age of 27, the maximum age all had reached at that point. Percentage-up-to-27 
The curve for Shearer is heavily affected by his lack of games at the age of 27 (due to a long injury layoff). 

You can see Shearer's curve is a very different shape to the other two. 

At the start of this year, when there was an extra year's data, the percentage of games played with extrapolation looked like this: Percentage-Predicted-28 
You can see the addition of that extra year's data changes the shape of Kane's curve a lot. His curve was being brought down by one low percentage season. I don't think the difference is an artifact, because if you look at the shape of the curves from actual data, not extrapolated (below), the shape hasn't changed with the extra data. Percentage-up-to-28 
Okay so we have the data, but the point of a striker is to score goals, so how does goals per game look for the three? 

Looking to the projected stats at 27, they look like this: Goals-Predicted-27 The two lower blue dots for Alan Shearer, at 27 and 30 years, reflect the years he had his worst injuries, which does suggest that injuries also reduce potency as you come back. 

If we look at goals per game only up to 27, it looks like this: Goals-Per-Game-27 The really interesting thing is that Salah's curve has a completely different shape to the other two, possibly reflecting his change from winger to striker, whereas the other two have always been strikers.

After the figure was updated to include the data once Kane was 28 and Salah 29, the goals per game curve (predicted) looks like this: Goals-Predicted-28 The shape of the three curves is quite different, Salah's constantly increasing, Shearer's a parabola, but a fairly shallow one, while Kane's is a much sharper parabola. I'm not sure if that's because of low goals per game last season skewing the whole curve, that frankly ridiculous season he had at 18 or a side effect of Spurs playing him slightly deeper now. 

Looking only at data up to the age of 28 does suggest it's an effect of Kane's excellent year at 18, because in this view, his curve and Shearer's are very similar, while Salah's continue to show an increase, possibly due to him moving from wing to striker. Goals-Per-Game-28 

It makes sense to combine the two analyses and provide the goals per possible games, because yes, Shearer had fewer opportunities. On the other hand, it may make Kane and Salah's data look worse unfairly, given the modern tendency to squad rotation. 

The extrapolated version at 27 looks like this: Goals-per-possible-games-27 which is unexpected. I would have expected deleterious effects to hit Kane and Salah equally but Kane's curve really is warped by the poor year at 27. 

I think it's mostly the extrapolation going haywire, because if you look just up to 27 without it (below), Kane's curve and Shearer's again match. Salah's remains different (possibly reflecting that Klopp doesn't really do squad rotation). Goals-per-possible-games-27-data 

I updated this at the end of last season. The extrapolated curve from Kane being 28 and Salah being 29 looks like this: Goals-per-possible-games-predicted-28 While it could be Kane's production dropping precipitously, I think it's the extrapolation because the curves without extrapolation look like this: Goals-per-possible-games-28 Where can this go? Well, there are 3 possible future things I'm thinking of looking at. 

Going from most obvious to least obvious: 

1 - Yearly updates of this data, to find out a) how good the extrapolation was at predicting what will happen, b) find out if Liverpool's 'orrible year this year has any effect on that stunningly straight curve shape of Salah's, and c) see if the drop for Kane in the prediction is just a blip. 

2 - Include Wayne Rooney's data. He'd act as a nice control, retired player, whose position shifted from striker to something deeper. 

3 - Add Haaland. This is another suggestion from L. I don't think it's because he wants to drive me round the twist but I fear it's going to weird things to my graphs. 

@mixed_knuts for @statsbomb once gave a talk where he discussed the effect that year Burnley really outperformed expectations had on Statsbomb's analyses. Burnley's data was so different to everyone else's that after every analysis they had to check whether any outlier was a bug or just Burnley being Burnley. 

I think Haaland would cause the same thing. His goalscoring for his age is ridiculous. On the other hand, he's young enough there's no saying he'd be able to keep it up. That's the one advantage to the above comparison being Kane and Salah, they were already in the middle of their careers when I started it, there was a solid amount of data. Even from that, the very basic extrapolation done by Excel has problems fitting the data. I dread to think what it'll do to Haaland's data.

Thursday 9 February 2023

Benford's Law - From February 2021 to the end of August 2021

I never actually drop projects, I just don't update them for a while. So let us return to the Benford's Law project, with information about the first digits in the top news article on the BBC website on 26 out of the 31 days of August 2021. 

In those 26 articles, there were 398 numbers with leading digits. That's ~ 15 per day, which about the same as June, but more than July. 

 Most of those numbers came from the article on the 8th of August (https://www.bbc.co.uk/sport/olympics/58112331) which was about the performance of different sports at the Tokyo Olympics compared to their funding. August-only 

No number appeared exactly as often as expected, 5 was the closest, but even that was 1% away from expected. 1 and 2 are the most different to their expected values, both are over-represented. 

If you add together the sum of all the values of (observed-expected)squared, all divided by the expected, the calculated test statistic is 8.5, the highest since February itself. 

The critical chi squared value for 9 items with only one line is ~ 15.507 The test statistic smaller than the critical value therefore the difference is not significant. This data does not disobey Benford's Law. 

If we look at the rolling total from February to the end of August, there have been 2258 numbers with leading digits. February-to-August 

No number exactly its expected value, 5 is the closest. 1 is the number furthest away from its expected value and remains over-represented. 

If you add together the sum of all the values of (observed-expected) squared, all divided by the expected, the calculated test statistic is 3.00, not reducing the way it should do with the addition of more first digits that obey Benford's Law. However, as the critical chi squared value for 9 items with only one line is ~ 15.507, the test statistic smaller than the critical value therefore the difference is not significant. This data does not disobey Benford’s Law. 

The test statistic continues to fluctuate rather than reduce which is interesting.

Thursday 2 February 2023

Film Review - Blinded by the Light

The year it came out, I named "Blinded By The Light" my favourite film of the year. I stand by that. 

I have no idea if it's a good film mind you, because it just blows past good, straight past all my critical faculties. 

It captures that teenage feeling of no-one understanding you except your band, in all its melodramatic glory. I mean it, that windswept scene, who hasn't felt precisely that? 

Maybe that's why I love the film - the way it reflects so many of my experiences. Not just "my favourite band are the only people who understand me", but the town in economic distress ('Luton is a Four-Letter Word' indeed), the friend you shared your music with, Leicester being the escape from your rundown town, so much of it. 

That's before we get to Roop looking so much like A who was my mate who shared his music with me. (No, seriously, that was uncanny, and means I get guilt for not keeping in better touch with A every time I think of the film.) 

The whole thing is filled with so much love, from Javed on down. Everyone is trying to get tomorrow and helpd each other as best they can (except Eliza's parents and the National Front, and fuck the National Front). 

The love is everywhere - find me a scene more filled with love than the one where Javed's Mum dyes Javed's Dad's hair. 

It would have been so easy to make Javed's Dad the boo-hiss disapproving Dad of legend, but he's not. He disapproves, yes, and he doesn't understand, but he's trying so hard and it's clear throughout that he loves his son. Even if he's terrible at showing it. 

The other thing I really like is that Javed is not over-idealised. As it's based on an autographical book, it must have been so tempting to make Javed super-sympathetic and always right, but he isn't. He gets to be mean, thoughtless and selfish at times. He's a teenager and feels like it. I also like that, unlike a lot of other Bildungsroman-type films, Javed grows through his own experiences and not the suffering of others. 

In short, I loved it.