As we saw previously, gamers seem to like expressing their opinion, particularly when it goes against the perceived grain. But how reliable are these opinions, particularly when taken en-masse as they are in many websites? And if we have access to such a wealth of user scores, do we need games journalists at all?
In this post, we take a look at review scores taken off a popular review aggregator, and perform some simple mathematical analysis on them. We talk about the conclusions, what they might mean about user reviews in general, and why I might be wrong after all. More below.
Metacritic is a popular website that aggregates and organises reviews of films, books, games and more. They combine reviews to create a ‘Metascore’, a weighted average of all reviews submitted to them. The weighting involves giving greater credence to reviewers who they deem to be more detailed or clear with their opinion. You probably know all of this already, of course, because Metacritic is fast becoming the way to assess a game’s success, rightly or wrongly. One problem with user reviews, however, is that they tend to be quite polarised:
I conducted this experiment to see just how bad this problem was, and whether it affected the worth of user reviews in general. I wrote a script in the language Python, using a library called BeautifulSoup that helps programmers process HTML-like data very fast. The script accessed the raw HTML for the web pages of the top 500 PC games organised by overall Metascore. It read the HTML and extracted the reviews from games journalists, as well as those from users of Metacritic, and it compiled a standard deviation for the two sets, for each of the reviews (more on Standard Deviation later). I then averaged out the standard deviation across all critics, and again across all user reviewers.
The vital statistics are as follows:
- Average Standard Deviation (Critics) – 6.32
- Average Standard Deviation (Users) – 22.47
Some other tidbits for you:
- Highest Critic Standard Deviation – 14.77 (Jedi Knight II)
- Lowest Critic Standard Deviation – 2.62 (Street Fighter IV)
- Highest User Standard Deviation – 51.7 (Serious Sam: The First Encounter)
- Lowest User Standard Deviation – 1.74 (Grim Fandango)
The average standard deviation for the users is noticeably higher than the critics. Standard Deviation is a unit of measurement, like a metre or a gallon. If you collect all your data that lie within one ‘standard deviation’ either side of the average (so for the critics here, anyone who scored 6.32 percent above or below the average score) then you have over 68% of the total data. In other words, the standard deviation measures how close most of your data is to the average. More on Wikipedia, and a helpful graph below:
What this means is that, in order to get 68% of the critic scores, you need only look in a 13-point band of scores (on average). Most of the critics score very close to the average. On the other hand, to get the same proportion of user scores, you need to look within a 44-point band. Much larger – almost half of the possible score range!
As we saw in the two review examples earlier, Metacritic user reviews are very opinionated. When you get a lot of data that tends towards two extremes, and therefore far away from the mean, you get a high standard deviation. Which is exactly what we’ve got here.
One possible reading of this data is that we have games journalists for a reason – they’re able to identify the strengths and weaknesses of a game and they generally reach similar conclusions to other writers. You could see this as a sort of post-publication peer review; we can see that games journalists tend to feel the same way about a lot of games. Of course, another argument for this is that games journalists can estimate what other games journalists are likely to think, and years of cynical 69% scores have made games reviewing a simple case of pigeonholing. I prefer the former explanation.
It also highlights, with numbers to back it up, the dangers of user reviews. Here’s a genuine Starcraft 2 user review:
Like most of the users here, I haven’t actually played this game; but that won’t stop me from commentating on it. I thought about giving it a perfect score, and I also thought about giving it a 0/10. But I felt about that. So instead, I’m giving it a 5/10 in order to balance out the 10/10 and 0/10 scores given by everyone else who hasn’t played the game.
While sites like Metacritic clearly give precedence to critic reviews, many other – indeed, one might argue most other – websites do not. If you’ve ever suspected that user reviews were dodgy, well – I think I found the numbers to prove it!
I sampled over 500 reviews, but many of them did not make the final data for a number of reasons. Firstly, an anomaly in collecting the user data gave approximately 80 reviews the same standard deviation, so I disregarded those (and their equivalents in the critic data. Many games had user reviews but no metascore/critic reviews, so they were disregarded. Others had no user reviews, so I disregarded them for comparison too. In the end, we have a direct comparison between games which garnered both user and critic reviews.
If you’re interested, the data is identical anyway, nearly. Averaging the standard deviation on the raw 500-strong data shows an average standard deviation of about 6 for the critics, and 22 for the users. So all’s well that ends well.
Another consideration is the introduction of bias into the data. Typically, metascores are calculated from around twenty reviews (five being the lower limit). Big games might get upwards of fifty reviews. User reviews come in far larger amounts though – Half-Life 2 has over a thousand of them. This means that the datasets for the users are far larger than the datasets for the critics. I didn’t have time to perform hypothesis testing on the data to try and get around this, so you can once again take my results with a pinch of salt (though I’d argue that any bias inherent in the large user data set merely underlines how useless such a large volume of data is).
As before, Python + BeautifulSoup is the order of the day. Metacritic seems to have some way of detecting large numbers of requests from a certain IP address and slowing down the data connection to a crawl, making this a more painful process than the last. Worse still, I want to do more analysis on Metacritic next time, which will mean going through this all over again.
With thanks to Mike Prescott for feedback and maths.