Peer Performance in the Review Process: Reviewer Elo

The peer review process is an imperfect construction that helps lend credibility to the publication of research. It is not the final arbiter of who is right, as cumulative research should encourage further discussion, but it is an important barrier that offers a check on research and also provides feedback as authors work to make their findings available to the scientific community and the public in general.

Non-academic peer review exists in a variety of venues beyond just scientific research.  One such arena, online video games, may offer an insight of how to encourage active and meaningful participation in the peer review process by providing feedback to not just authors and editors, but to reviewers.

The game League of Legends, created and operated by Riot Games, is one of the most popular competitive video game across the globe. It boasts a userbase of tens of millions (stats from October, 2012) and will be having its third world championship in less than a month with over $8 million at stake. The event will have thousands watching live in California and be watched live online by millions. Of note, the North American and European teams are expected to fair quite poorly against teams from China and Korea.

With this large community, one of the persistent problems the community deals with is toxic player behavior. That is, behavior that includes aggression, foul language, threats, and taking actions that ruin the experience of the game for other players.  Such behavior, done by individuals that are typically anonymous except for a username that can be changed, is detrimental to the health of the game and deters other players from continuing to participate in the game or community. In fact, Riot just released a video yesterday to try to combat the problem by appealing to statistics and performance.  In one of their multiple methods to try and combat this problem, Riot hired social scientists to see what inducements they can offer to curb poor behavior.  One strategy they have adopted is the Tribunal. In the tribunal, other players read through the logs of an individual, who has been reported repeatedly, and vote to either pardon the person or punish them for their behavior in the game. Once enough people vote to pardon or punish an individual, the majority decision holds and that person either has nothing done to them due to being pardoned or they receive punishment for their behavior.  This punishment escalates as the player is found guilty in front of multiple tribunals and includes punishments such as a warning, losing the ability to chat in game, being temporarily banned, or having their account permanently banned from the game.

Now, the part of this whole story that is related to academic peer review is that the users who participate in this process receive a score that changes with their voting in the majority or the minority.  This score, an Elo rating, goes up when they are in the majority, and down when they are in the minority. For the tribunal, the system is a modification of the same system that calculates an individual’s chess rating based on their wins and losses.  Generally, a higher score indicates better performance as judge.  This gives feedback to the judge about their consistency in evaluations relative to the rest of the community and there are even leaderboards that rank the top few hundred judges.

This mechanism has potential for use within academic publishing.  Essentially, when a reviewer makes a recommendation to an editor about a particular manuscript, they are casting a vote for rejection, revise and resubmit, or some level of acceptance.  Naturally, these votes are not in isolation.  Given that any academic is likely to review multiple articles over the course of a year, as well as a lifetime, it would be possible to rank reviewers along a similar metric.  In academic publishing, I recommend two such scores to mark performance over binary outcomes: Reviewer agrees with Peers and Reviewer Agrees with Editor.  The first score is an indication as to whether or not a given reviewer is generally in the majority or minority on reviewing papers.  The second indication is a score that determines whether or not a reviewer is providing useful information to the editor.  Thus, it is plausible that generally a reviewer agrees with his or her peers and this informs the editor to their agreement. It is also possible that a reviewer consistently does not agree with his/her peers but editors use their views to inform their decisions.

To provide an Elo system for reviewers would require the cooperation of one or more journals within a discipline. It is feasible for a journal to provide the scores of reviewers for just that journal if there were not cooperation within a discipline, though the sample and movement of those scores would be small as a result.  If several journals within a discipline worked together, then the information would be data rich.  Given that peer review is anonymized on two fronts, the data would likewise have to be anonymized in terms of which papers that reviewer reviewed, but this can be done by simply issuing a unique ID for a manuscript.  Then, all the journal would have to provide is a simple yes or no as to whether the reviewer was in the majority and if the reviewer was in line with the reviewers decision.  It is possible to make the data more informative if we included a range of possible deviations from the majority, but it would make sense to make the data collection as simple as possible.  Ultimately, after the scores have been computed, such scores should be made public to the discipline.


To go through the process and provide such information requires some level of benefits to the discipline to be worth adopting.  There are a few different benefits for providing such data and assembling it.  First, it would provide a benchmark for newer members of the community to gauge how their decisions are related to other people in the community in a systematic way.  This is also true for members are established within a discipline.  As an Elo score goes up or down, the academic can gauge systematically how their decisions have related to others within the community.  This allows individuals to move beyond anecdotal evidence that could very well be filtered through confirmation bias and, instead, provide a direct assessment of their reviewing capabilities.

Second, it can offer a level of prestige.  Such Elo measurements can and should be broken down into lifetime and yearly scores.  These numbers then give participating reviewers a service-based metric of their performance.  If you are the highest Elo reviewer within a discipline for a given year, or over a lifetime, this would send a strong signal about your consistency. If your Elo is consistently low, this may be because you have a higher standard than your peers or, perhaps, a lower standard if you are often the lone wolf that recommends publication among two other rejection recommendations.

The third notable benefit would be to editors.  Editors have networks of reviewers that they request reviews from and have some idea of how those reviewers have performed in the past. This evaluation is generally based on memory as well as whether or not a particular article falls within the expertise domain of a reviewer and may be prone to error. In a ranking system as proposed above, a high Elo ranking on either metric (community or editor Elo) will give editors more information about whether or not they should approach new individuals (or old ones) for additional reviews.  A community wide Elo system would also give editors access to individuals within the discipline that they may not have known were active reviewers.

There are other benefits to having a public Elo system, but I will stop at these three.


Naturally, such Elo computations are not free.  There are two major costs I see associated with such a plan.

First is the task of data gathering as a literal cost of time and money. It would likely put another burden on editors to provide this information to those who calculate the Elo rankings.  Editors already have to tally up votes and read through decisions and convey this information to the author(s) of manuscripts. As such, I imagine this would be a small additional task for each reviewed article, but it still would be another task.  As such, editors would have to choose to take on this additional data reporting burden. However, it can be done on a case by case basis or a journal can give the data to the compiler in a lump sum. Likewise, an organization or individual would have to take the task of collecting this information and providing it publicly (likely online).

The second potential drawback is that such a system may encourage conformity where we do not want it to exist. If reviewers become motivated to achieve a high Elo, their goal in critiquing an article may not be to determine if it is fit to be published, but may instead be to determine what their peers think of such a manuscript. This would depress individual incentives to strongly attack or defend an article.

A more serious problem in this regard is a decision become a focal point for reviewers.  If it is common for articles to be rejected (as it is the norm for most peer reviewed journals), then reviewers that strongly care about their Elo score will reject articles as their default position.  In general, I am not convinced that such a system would override the other incentives reviewers have for reviewing scholarly work, but it is certainly possible.

Along this line, there can be an incentive for discomformity.  If you really do not want to review articles, you can purposely tank your Elo to send a signal to Editors.  However, I imagine that people who would do this already send similar signals to editors in the present through how they review or comment on a manuscript.

The review system is a standard of service within academia and is expected of scholars that publish within journals and books.  Having a systematic way of showcasing consistent reviewers, in effect, offering a public incentive for continued, strong reviewers may be an interesting experiment in providing systematized feedback to reviewers.  At the very least, it is an interesting thought experiment.

Michael A. Allen

About Michael A. Allen

Michael is an Assistant Professor in Political Science at Boise State University with a focus in International Relations, Comparative Politics, and Methodology (quantitative and formal). His work includes issues related to military basing abroad, asymmetric relations, cooperation, and conflict. He received his Ph.D from Binghamton University in 2011.

Leave a Reply