Skip to content
Sep 3 / Mike

MetaMetaCritic

Introduction

As we saw previously, gamers seem to like expressing their opinion, particularly when it goes against the perceived grain. But how reliable are these opinions, particularly when taken en-masse as they are in many websites? And if we have access to such a wealth of user scores, do we need games journalists at all?

In this post, we take a look at review scores taken off a popular review aggregator, and perform some simple mathematical analysis on them. We talk about the conclusions, what they might mean about user reviews in general, and why I might be wrong after all. More below.

Method

Metacritic is a popular website that aggregates and organises reviews of films, books, games and more. They combine reviews to create a ‘Metascore’, a weighted average of all reviews submitted to them. The weighting involves giving greater credence to reviewers who they deem to be more detailed or clear with their opinion. You probably know all of this already, of course, because Metacritic is fast becoming the way to assess a game’s success, rightly or wrongly. One problem with user reviews, however, is that they tend to be quite polarised:

I conducted this experiment to see just how bad this problem was, and whether it affected the worth of user reviews in general. I wrote a script in the language Python, using a library called BeautifulSoup that helps programmers process HTML-like data very fast. The script accessed the raw HTML for the web pages of the top 500 PC games organised by overall Metascore. It read the HTML and extracted the reviews from games journalists, as well as those from users of Metacritic, and it compiled a standard deviation for the two sets, for each of the reviews (more on Standard Deviation later). I then averaged out the standard deviation across all critics, and again across all user reviewers.

Results

The vital statistics are as follows:

  • Average Standard Deviation (Critics) – 6.32
  • Average Standard Deviation (Users) – 22.47

Some other tidbits for you:

  • Highest Critic Standard Deviation – 14.77 (Jedi Knight II)
  • Lowest Critic Standard Deviation – 2.62 (Street Fighter IV)
  • Highest User Standard Deviation – 51.7 (Serious Sam: The First Encounter)
  • Lowest User Standard Deviation – 1.74 (Grim Fandango)

The average standard deviation for the users is noticeably higher than the critics. Standard Deviation is a unit of measurement, like a metre or a gallon. If you collect all your data that lie within one ‘standard deviation’ either side of the average (so for the critics here, anyone who scored 6.32 percent above or below the average score) then you have over 68% of the total data. In other words, the standard deviation measures how close most of your data is to the average. More on Wikipedia, and a helpful graph below:

What this means is that, in order to get 68% of the critic scores, you need only look in a 13-point band of scores (on average). Most of the critics score very close to the average. On the other hand, to get the same proportion of user scores, you need to look within a 44-point band. Much larger – almost half of the possible score range!

Conclusions

As we saw in the two review examples earlier, Metacritic user reviews are very opinionated. When you get a lot of data that tends towards two extremes, and therefore far away from the mean, you get a high standard deviation. Which is exactly what we’ve got here.

One possible reading of this data is that we have games journalists for a reason – they’re able to identify the strengths and weaknesses of a game and they generally reach similar conclusions to other writers. You could see this as a sort of post-publication peer review; we can see that games journalists tend to feel the same way about a lot of games. Of course, another argument for this is that games journalists can estimate what other games journalists are likely to think, and years of cynical 69% scores have made games reviewing a simple case of pigeonholing. I prefer the former explanation.

It also highlights, with numbers to back it up, the dangers of user reviews. Here’s a genuine Starcraft 2 user review:

Like most of the users here, I haven’t actually played this game; but that won’t stop me from commentating on it. I thought about giving it a perfect score, and I also thought about giving it a 0/10. But I felt about that. So instead, I’m giving it a 5/10 in order to balance out the 10/10 and 0/10 scores given by everyone else who hasn’t played the game.

While sites like Metacritic clearly give precedence to critic reviews, many other – indeed, one might argue most other – websites do not. If you’ve ever suspected that user reviews were dodgy, well – I think I found the numbers to prove it!

Evaluation

I sampled over 500 reviews, but many of them did not make the final data for a number of reasons. Firstly, an anomaly in collecting the user data gave approximately 80 reviews the same standard deviation, so I disregarded those (and their equivalents in the critic data. Many games had user reviews but no metascore/critic reviews, so they were disregarded. Others had no user reviews, so I disregarded them for comparison too. In the end, we have a direct comparison between games which garnered both user and critic reviews.

If you’re interested, the data is identical anyway, nearly. Averaging the standard deviation on the raw 500-strong data shows an average standard deviation of about 6 for the critics, and 22 for the users. So all’s well that ends well.

Another consideration is the introduction of bias into the data. Typically, metascores are calculated from around twenty reviews (five being the lower limit). Big games might get upwards of fifty reviews. User reviews come in far larger amounts though – Half-Life 2 has over a thousand of them. This means that the datasets for the users are far larger than the datasets for the critics. I didn’t have time to perform hypothesis testing on the data to try and get around this, so you can once again take my results with a pinch of salt (though I’d argue that any bias inherent in the large user data set merely underlines how useless such a large volume of data is).

The Code

As before, Python + BeautifulSoup is the order of the day. Metacritic seems to have some way of detecting large numbers of requests from a certain IP address and slowing down the data connection to a crawl, making this a more painful process than the last. Worse still, I want to do more analysis on Metacritic next time, which will mean going through this all over again.

Oh well!

With thanks to Mike Prescott for feedback and maths.

11 Comments

leave a comment
  1. James / Sep 5 2010

    Standard deviation doesn’t quite work that way. You’re gonna get 68% within 1 s.d. if the values are normally distributed, but as you noted the user ratings tend to be bimodal. You are however guaranteed 75% within 2 s.d., regardless of the distribution of the reviews.

  2. Neil Brown / Sep 5 2010

    Nice idea for a post, but I have a couple of problems with your mathematics. Firstly, you talk about how much of the data falls within a standard deviation of the mean. However, this is only valid if the data is drawn from a normal distribution. Given how polarised you say the reviews are, it seems to me quite unlikely that the data is normally distributed: I would expect a distribution skewed towards the upper end with a suggestion of bimodality (due to all the angry 0 scores) — you almost say as much in your conclusions section. There are statistical tests for normality that can be performed, but I don’t see any mention that you have done so. It might be interesting to post some histograms for the review scores of some of the games in order to see what the distribution is like. Secondly, what are the units here? Judging by the images in the early part of the post, I had thought 0-10, but it’s impossible to get a standard deviation of 22 with that range. So I guess it’s 0-100, which makes sense if you’re using data from metacritic.

  3. Mike / Sep 5 2010

    Hey Neil, James,

    Thansk for your feedback! I’m learning as I go with this statistics stuff, so it’s good to get some extra input. I’ll add both your comments to the main post.

    However, I’d argue my use of the data is still valid. Metacritic doesn’t weight the average for user reviews, or treat it any differently than normally distributed data. Am I not just treating the data in the same way that Metacritic does?

    (With regards to the 0-10 range; you’re right. User reviews are scored 0-10, but because critic reviews are out of 100 I normalised the data to be within that range).

  4. Tom / Sep 5 2010

    Don’t you think that scoring on a 0-10 is always going to give a (relatively) bigger deviation than a 0-100 system because there is less perceived difference between scores of, say, 9 and 10 than between scores of 90 and 100 percent.

  5. Elite / Sep 5 2010

    Yes user reviews are much more polarized than professional reviews but I think you’ve mis-attributed the source of this a little, so I’m submitting some points for consideration

    1. User reviews will be perturbed towards the extremes.

    A professional critic will review many different games because they’re obligated to do so. This will include games they like and games they dislike and games they’re apathetic about.

    Most amateur critics will only chime in when they’ve got a strong opinion on something. People don’t get passionate about games they found to be average and unremarkable, so user reviews are more likely to come from people who either loved or hated the game.

    Also user reviews are also more likely to be rushed – the gaming industry is largely hype driven so the earliest reviews are usually the most influential and without advance copies the amateur critic will have to rely on first impressions if they want to be relevant in the games opening days, (or review the game without even playing it). This is less of a factor when someone simply wishes to voice their opinion, rather than wanting people to listen to it.

    2. Professionals are more inclined to introduce balance.

    I suspect many professional critics try to rein in their opinions for more reasonable reviews. This isn’t a case of pigeonholing games according to how they think other critics will react, it’s the case of embracing a certain amount of objectivity. There are games that I love but I can see why someone would dislike them, there are games I hate but I can see why someone would like them. There are dangers in trying to second-guess how the masses will respond to a game, but I’d say it usually stems from trying to be helpful rather than trying to mislead.

    Also professional critics are likely to receive rabid emails if they give out crazy scores so I think there’s a certain amount of appeasement especially when it comes to beloved franchises.

    On the other hand I suspect many amateur critics exaggerate or otherwise take things to the extremes. In a sea of reviews a 0/10 score is more likely to get noticed than a 1/10 or 2/10 score, same goes for 10/10 vs 8/10 or 9/10. When faced with a huge crowd of reviews people will be driven to make themselves stand out so they can be noticed. Also random users are unlikely to severely questioned or harassed if they say something crazy.

    I wouldn’t say it’s fair to say that professionals possess superhuman abilities to assess strengths and weaknesses of a game, but they’re more inclined to present a balanced view.

    ———

    Overall I think there’s some potential in user reviews, but I’m dissatisfied with the current systems. Then again I’m really not a fan of most professional games reviews and the whole numerical reviewing system in the first place.

  6. Simon / Sep 5 2010

    While you’re at it, maybe you could investigate an often criticized part of MetaCritics: the way they convert 5- or 10-star scores into % and how it affects the weight of the different sources.

    The issue is that most decent games get 80 – 95%, but in the 5-star system they might get 3,5 – 5 stars, which in percent becomes 70-100%. This increased variance should mean that publications which uses the star-system become more important than those who don’t.

    I suspect that in the end this might not really change the placements of games compared to each other, but whatever…

  7. Dan / Sep 6 2010

    Another to bear in mind, you have to take scale and range into account as well. For example, if I were doing an analysis of the standard deviations of salaries in Japan versus salaries in the US, the number in Japan would be far, far higher. Not because salaries in Japan are inconsistent, but because the units that are being measured in Japan (Yen) are different from the units that are being measured in the US (Dollars).

    I think that a fair bit has been said about the 7-10 ratings that many publications use, while users may be honestly measuring on a 0-10 scale. Because of this, you’d expect to see a much larger standard deviation in user ratings. This isn’t a new problem, and it’s one that metacritic has to solve itself. They have to decide when 1up.com gives a game a C, is that a 7/10? Or a 3/10?

    It’s called normalizing the data, converting data into a similar scales so that two data sets can be accurately compared against each other.

    And to be fair, while I certainly respect your attempts to make sense of it all, trying to do statistical analysis on review data is somewhat non-trivial. This is one of the reasons why metacritic itself is so often criticized.

  8. Neil Brown / Sep 6 2010

    Hi Mike,

    Thanks for such a positive response. Your conclusions based on the standard deviation are fine: reviewers do have a much narrower spread than users, and standard deviation is a good way to measure this. So that’s fine.

    The only real problem with the post is where you talk about 68% of the data lying within 1 s.d. of the mean. This is only true for normally distributed data, and neither you nor metacritic can assume that the data is normally distributed without a priori knowledge (e.g. heights across a population are known to be normally distributed) or a statistical test. The graph you give is of a normal distribution, but your data is probably not shaped like that, which invalidates the 68% claim. You could remove this section and the rest of the post would be correct and still come to the same valid conclusion :-)

    Oh, and technically, s.d. is not a unit of measurement; s.d. is a measurement that has the same units as the original data (in this case: review scores). Perhaps the best expression is that s.d. measures the spread of data.

    Nice to see quantitative analysis being used where appropriate :-)

  9. Mike / Sep 6 2010

    Thanks everyone for more feedback! I’ve learnt a lot from your input, and you’ve mentioned some things that would make excellent further investigations.

    Rather than replying to you all individually here, I’ll try and make a second post this week that puts all of these comments on display and talks about them a little bit. Thank you all for reading and replying though!

    Mike

  10. Gap Gen / Sep 22 2010

    It would be interesting to see graphs for a typical user review score spread – people have said that the distribution should be bimodal, but the graph you show to explain standard deviation is a Gaussian (or something like it).

    I’d be interested in more statistical analysis of this sort of thing. The bar charts grouping review scores into Positive/Mixed/Negative on Metacritic are pretty good; they show bimodal distributions quite nicely. If you could come up with a way of measuring how these look for hundreds of games, that would be interesting. For example, if you looked at how many games had different distributions (i.e. bimodal = lots of Negative and Positive, few Mixed, wild card = pretty even measures of all 3 responses, etc).

Trackbacks and Pingbacks

  1. The Sunday Papers | Rock, Paper, Shotgun
Leave a Comment