NIPS Experiment Analysis

Sorry for the relative silence on the NIPS experiment. Corinna and I have both done some analysis on the data. Over the Christmas break I focussed an analysis on the ‘raw numbers’ which people have been discussing. In particular I wanted to qualify the certainties that people are placing on these numbers. There are a couple of different ways of doing this, bootstrap, or a Bayesian analysis. I went for the latter. Corinna has also been doing a lot of work on how the scores correlate, and the ball is in my court to pick up on that. However, before doing that I wanted to make the initial Bayesian analysis of the data. In doing so, we’re also releasing a little bit more information on the numbers.

Headline figure is that if we re-ran the conference we would expect anywhere between 38% and 64% of the same papers to have been presented again. This is the figure that several commentators mentioned that is the one attendees are really interested in. Of course, when you think about it, you also realise it is a difficult figure to estimate because you reduce the power of the study because the figure is based only on papers which had at least one accept or more (rather than the full 168 papers used in the study).

Anyway details of the Bayesian analysis are available in a Jupyter notebook on github.

Blogs on the NIPS Experiment

There are now quite a few blog posts on the NIPS experiment, I just wanted to put a place together where I could link to them all. It’s a great set of posts from community mainstays, newcomers and those outside our research fields.

Just as a reminder, Corinna and I were extremely open about the entire review process, with a series of posts about how we engaging the reviewers and processing the data. All that background can be found through a separate post here.

At the time of writing there is also still quite a lot of twitter traffic on the experiment.

List of Blog Posts

What an exciting series of posts and perspectives!
For those of you that couldn’t make the conference, here’s what it looked like.
And that’s just one of 5 or six poster rows!

The NIPS Experiment

Just back from NIPS where it was really great to see the results of all the work everyone put in. I really enjoyed the program and thought the quality of all presented work was really strong. Both Corinna and I were particularly impressed by the work that put in by oral presenters to make their work accessible to such a large and diverse audience.

We also released some of the figures from the NIPS experiment, and there was a lot of discussion at the conference about what the result meant.

As we announced at the conference the consistency figure was 25.9%. I just wanted to confirm that in the spirit of openness that we’ve pursued across the entire conference process Corinna and I will provide a full write up of our analysis and conclusions in due course!

Some of the comment in the existing debate is missing out some of the background information we’ve tried to generate, so I just wanted to write a post that summarises that information to highlight its availability.

Scicast Question

With the help of Nicolo Fusi, Charles Twardy and the entire Scicast team we launched a Scicast question a week before the results were revealed. The comment thread for that question already had an amount of interesting comment before the conference. Just for informational purposes before we began reviewing Corinna forecast this figure would be 25% and I forecast it would be 20%. The box plot summary of predictions from Scicast is below.


Comment at the Conference

There was also an amount of debate at the conference about what the results mean, a few attempts to answer this question (based only on the inconsistency score and the expected accept rate for the conference) are available here in this little Facebook discussion and on this blog post.

Background Information on the Process

Just to emphasise previous posts on this year’s conference see below:

  1. NIPS Decision Time
  2. Reviewer Calibration for NIPS
  3. Reviewer Recruitment and Experience
  4. Paper Allocation for NIPS

Software on Github

And finally there is a large amount of code available on a github site for allowing our processes to be recreated. A lot of it is tidied up, but the last sections on the analysis are not yet done because it was always my intention to finish those when the experimental results are fully released.

NIPS: Decision Time

Thursday 28th August

In the last two days I’ve spent nearly 20 hours in teleconferences, my last scheduled conference will start in about 1/2 an hour. Given the available 25 minutes it seemed to make sense to try and put down some thoughts about the decision process.

The discussion period has been constant, there is a stream of incoming queries from Area Chairs, requests for advice on additional reviewers, or how to resolve deadlocked or disputing reviews. Corinna has handled many of these.

Since the author rebuttal period all the papers have been distributed to google spreadsheet lists which are updated daily. They contain paper titles, reviewer names, quality scores, calibrated scores, a probability of accept (under our calibration model), a list of bot-compiled potential issues as well as columns for accept/reject and poster/spotlight. Area chairs have been working in buddy pairs, ensuring that a second set of eyes can rest on each paper. For those papers around the borderline, or with contrasting reviews, the discussion period really can have an affect, we see when calibrating the reviewer scores: over time the reviewer bias is reducing and the scores are becoming more consistent. For this reason we allowed this period to go on a week longer than originally planned, and we’ve been compressing our teleconferences into the last few days.

Most teleconferences consist of two buddy pairs coming together to discuss their papers. Perhaps ideally the pairs would have a similar subject background, but constraints of time zone and the fact that there isn’t a balanced number of subject areas mean that this isn’t necessarily the case.

Corinna and I have been following a similar format. Listing the papers from highest scoring first, to lowest scoring, and starting at the top. For each paper, if it is a confident accept, we try and identify if it might be a talk or a spotlight. This is where the opinion of a range of Area Chairs can be very useful. For uncontroversial accepts that aren’t nominated for orals we spend very little time. This proceeds until we start reaching borderline papers, those in the ‘grey area’: typically papers with an average score around 6. They fall broadly into two categories: those where the reviewers disagree (e.g. scores of 8,6,4), or those where the review are consistent but the reviewers , perhaps, feel underwhelmed (scores of 6,6,6). Area chairs will often work hard to try and get one of the reviewers to ‘champion’ a paper: it’s a good sign if a reviewer has been prepared to argue the case for a paper in the discussion. However, the decisions in this region are still difficult. It is clear that we are rejecting some very solid papers, for reasons of space and because of the overall quality of submissions. It’s hard for everyone to be on the ‘distributing’ end of this system, but at the same time, we’ve all been on the receiving end of it too.

In this difficult ‘grey area’ for acceptance, we are looking for sparks in a paper that push it over the edge to acceptance. So what sort of thing catches an area chair’s eye? A new direction is always welcome, but often leads to higher variance in the reviewer scores. Not all reviewers are necessarily comfortable with the unfamiliar. But if an area chair feels a paper is taking the machine learning field somewhere new, then even if the paper has some weaknesses (e.g. in evaluation or giving context and detailed derivations etc) then we might be prepared to overlook this. We look at the borderline papers in some detail, scanning the reviews, looking for words like ‘innovative’, ‘new directions’ or ‘strong experimental results’. If we see these then as program chairs we definitely become more attentive. We all remember papers presented at NIPS in the past that lead to revolutions in the way machine learning is done. Both Corinna and I would love to have such papers at ‘our’ NIPS.

A paper that is a more developed area will be expected to have done a more rounded job in terms of setting the context and performing the evaluation. Papers in a more developed area will be expected to hit a high level in terms of their standards.

It is often helpful to have an extra pair of eyes (or even two pairs) run through the paper. Each teleconference call normally ends with a few follow up actions for a different area chair to look through a paper or clarify a particular point. Sometimes we also call in domain experts, who may have already produced four formal reviews of other papers, just to get clarification on  particular point. This certainly doesn’t happen for all papers, but those with scores around 7,6,6 or 6,6,6 or 8,6,4 often get this treatment. Much depends on the discussion and content of the existing reviews, but there are still, often, final checks that need carrying out. From a program chair’s perspective, the most important thing is that the Area Chair is comfortable with the decision, and I think most of the job is acting as a sounding board for the Area Chair’s opinion, which I try to reflect back to them. In the same manner as rubber duck debugging, just vocalising the issues sometimes causes them to be crystallised in the mind. Ensuring that Area Chairs are calibrated to each other is also important. The global probabilities of accept from the reviewer calibration model really help here. As we go through papers I keep half an eye on those, not to influence the decision of a particular paper so much as to ensure that at the end of the process we don’t have a surplus of accepts. At this stage all decisions are tentative, but we hope not to have to come back to too many of them.

Monday 1st September

Corinna finished her last video conference on Friday, Saturday, Sunday and Monday (Labor Day) were filled with making final decisions on accepts, then talks and finally spotlights. Accepts were hard, we were unable to take all the papers that were possible accept, as we would have gone way over our quota of 400. We had to make a decision on duplicated papers where the decisions were in conflict, more details of this to come at the conference. From remembering what a pain it was to do the schedule after the acceptances, and also following advice from Leon Bottou that the talk program emerges to reflect the accepted posters, we finalized the talk and spotlight program whilst putting talks and spotlights directly into the schedule. We had to hone the talks down to 20 from about 40 candidates and spotlights we squeezed in 62 from over a hundred suggestions. We spent three hours in teleconference each day, as well as preparation time, across Labor Day weekend putting together the first draft of the schedule. It was particularly impressive how quickly area chairs responded to any of our follow up queries to our notes from the teleconferences. Particularly those in the US who were enjoying the traditional last weekend of summer.

Tuesday 2nd September

I had an all day meeting in Manchester for the a network of researchers focussed on mental illness. It was really good to have a day discussing research, my first in a long time. I thought very little about NIPS until on the train home, I thought to have a little look at the conference shape. I actually ended up looking at a lot of the papers we rejected, many from close colleagues and friends. I found it a little depressing. I have no doubt there is a lot of excellent work there, and I know how disappointed my friends and colleagues will be to receive those rejections. We did an enormous amount to ensure that the process was right, and I have every confidence in the area chairs and reviewers. But at the end of the day, you know that you will be rejecting a lot of good work. It brought to mind a thought I had at the allocation stage. When we had the draft allocation to each area chair, I went through several of them sanity checking the quality of the allocation. Naturally, I checked those associated with area chairs who are closer to my own areas of expertise. I looked through the paper titles, and I couldn’t help but think what a good workshop each of those allocations would make. There would be some great ideas, some partially developed ideas. There would be some really great experiments and some weaker experiments. But there would be a lot of debate at such workshop. None or very few of the papers would be uninteresting: there would certainly be errors in papers, but that’s one of the charms of a workshop, there’s still a lot more to be said about an idea when it’s presented at a workshop.

Friday 5th September

Returning from an excellent two day UCL-Duke workshop. There is a lot of curiosity about the NIPS experiment, but Corinna and I have agreed to keep the results embargoed until the conference.

Saturday 6th September

Area chairs had until Thursday to finalise their reviews in the light of the final decisions, and also to raise any concerns they had about the final decisions. My own experience of area chairing is that you can have doubts about your reasoning when you are forced to put pen to paper and write the meta review. We felt it was important to not rush the final process to allow any of those doubts to emerge. In the end, the final program has 3 or 4 changes from the draft we first distributed on Monday night, so there may be some merit in this approach. We had a further 3 hour teleconference today to go through the meta-reviews, with a particular focus on those for papers around the decision boundary. Other issues such as comments in the wrong place (the CMT interface can be fairly confusing, 3% of meta reviews were actually placed in the box meant for notes to the program chairs) were also covered. Our big concern was if the area chairs had written a review consistent with our final verdict. A handy learning task would have been to build a sentiment model to predict accept/reject from the meta review.

Monday 8th September 

Our plan had been to release reviews this morning, but we were still waiting for a couple of meta-reviews to be tidied up and had an outstanding issue on one paper. I write this with CMT ‘loaded’ and ready to distribute decisions. However, when I preview the emails the variable fields are not filled in (if I hit ‘send’ I would send 5,000 emails that start “Dear $RecipientFirstName$, which sounds somewhat impersonal … although perhaps more critical is that the authors would be informed of the fate of paper “$Title$,” which may lead to some confusion. CMT are on a different time zone, 8 hours behind. Fortunately, it is late here, so there is a good chance they will respond in time …

Tuesday 9th September

I was wide awake at 6:10 despite going to sleep at 2 am. I always remember when I was Area Chair with John Platt that he would be up late answering emails and then out of bed again 4 hours later doing it again. A few final checks and the all clear for everything is there. Pressed the button at 6:22 … emails are still going out and it is 10:47. 3854 of the 5615 emails have been sent … one reply which was an out of office email from China. Time to make a coffee …

Final Statistics

1678 submissions
414 papers accepted
20 papers for oral
62 for spotlight
331 for poster
19 rejected without review

Epilogue to Decision Mail:  So what was wrong with those variable names? I particularly like the fact that something different was wrong with each one. $RecipientFirstName$ and $RecipientEmail$ are  not available in the “Notification Wizard”, whereas they are in the normal email sending system. Then I got the other variables wrong, $Title$->$PaperTitle$ and $PaperId$->$PaperID$, but since neither of the two I knew to be right were working I assumed there was something wrong with the whole variable substitution system … rather than it being that (at least) two of the variable types just happen to be missing from this wizard … CMT responded nice and quickly though … that’s one advantage of working late.

Epilogue on Acceptances: At the time of the conference there were only 411 papers presented because three were withdrawn. Withdrawals were usually due to some deeper problem authors had found in there own work, perhaps triggered by comments from reviewers. So in the end there were 411 papers accepted and 328 posters.

Author Concerns

So the decisions have been out for a few days now, and of course we have had some queries about our processes. Every one has been pretty reasonable, and their frustration is understandable when three reviewers have argued for accept but the final decision is to reject. This is an issue with ‘space-constrained’ conferences. Whether a paper gets through in the end can depend on subjective judgements about the paper’s qualities. In particular, we’ve been looking for three components to this: novelty, clarity and utility. Papers with borderline scores (and borderline here might be that the average score is in the weak accept range) are examined closely. The decision about whether the paper is accepted at this point necessarily must come down to judgement, because for a paper to get scores this high the reviewers won’t have identified a particular problem with the paper. The things that come through are how novel the paper is, how useful the idea is, and how clearly it’s presented. Several authors seem to think that the latter should be downplayed. As program chairs, we don’t necessarily agree. It’s true that it is a great shame when a great idea is buried in poor presentation, but it’s also true that the objective of a conference is communication, and therefore clarity of presentation definitely plays a role. However, it’s clear that all these three criteria are a matter of academic judgement: that of the reviewers, the area chair and the quad groups in the teleconferences. All the evidence we’ve seen is that reviewers and area chairs did weigh these aspects carefully, but that doesn’t mean that all their decisions can be shown to be right, because they are often a matter of perspective. Naturally authors are upset when what feels like a perfectly good paper is rejected on more subjective grounds. Most of the queries are on papers where this is felt to be the case.

There has also been one query on process, and whether we did enough to evaluate on these criteria, for those papers in the borderline area, before author rebuttal. Authors are naturally upset when the area chair raises such issues in the final decision’s meta review, but these points weren’t there before. Personally I sympathise with both authors and area chairs in this case. We made some effort to encourage authors to identify such papers before rebuttal (we sent out attention reports that highlighted probable borderline papers) but our main efforts at the time were chasing missing and inappropriate or insufficient reviews. We compressed a lot into a fairly short time, and it was also a period when many are on holiday. We were very pleased with the performance of our area chairs, but I think it’s also unsurprising if an area chair didn’t have time to carefully think through these aspects before author rebuttal.

My own feeling is that the space constraint on NIPS is rather artificial, and a lot of these problems would be avoided if it wasn’t there. However, there is a counter argument that suggests that to be a top quality conference NIPS has to have a high reject rate. NIPS is used in tenure cases within the US and these statistics are important there. Whilst I reject these ideas: I don’t think the role of a conference is to allow people to get promoted in a particular country, nor is that the role of a journal: they are both involved in the communication and debate of scientific ideas. However, I do not view the program chair roles as reforming the conference ‘in their own image’. You have to also consider what NIPS means to the different participants.

NIPS as Christmas

I came up with an analogy for this which has NIPS in the role of Christmas (you can substitute Thanksgiving, Chinese New Year, or your favourite traditional feast). In the UK Christmas is a traditional holiday about which people have particular expectations, some of them major (there should be Turkey for Christmas Dinner) and some of them minor (there should be an old Bond movie on TV). These expectations have changed over time.  The Victorians used to eat Goose and the Christmas tree was introduced from Germany by Prince Albert’s influence in the Royal Household, and they also didn’t have James Bond, I think they used Charles Dickens instead. However, you can’t just change Christmas ‘overnight’, it needs to be a smooth transition. You can make lots of arguments about how Christmas could be a better meal, or that presents make the occasion too commercial, but people have expectations so the only way to make change is slowly. Taking small steps in the right direction. For any established successful venture this approach makes a lot of sense. There are many more ways to fail than be successful and I think that the rough argument is that if you are starting from a point of success you should be careful about how quickly you move because you are likely end up in failure. However, not moving at all also leads to failure. I think this year we’ve introduced some innovations and an analysis of the process that will hopefully lead to improvements. We certainly aren’t alone in these innovations, each NIPS before us has done the same thing (I’m a particular fan of Zoubin and Max’s publication of the reviews). Whether we did this well or not, like those borderline papers, is a matter for academic judgement. In the meantime I (personally) will continue to try to enjoy NIPS for what it is, whilst wondering about what it could be and how we might get there. I also know that as a community we will continue to innovate, launching new conferences with new models for reviewing (like ICLR).

Reviewer Calibration for NIPS

One issue that can occur for a conference is differences in interpretation of the reviewing scale. For a number of years (dating back to at least NIPS 2002) mis-calibration between reviewers has been corrected for with a model. Area chairs see not just the actual scores of the paper, but also ‘corrected scores’. Both are used in the decision making process.

Reviewer calibration at NIPS dates back to a model first implemented in 2002 by John Platt when he was an area chair. It’s a regularized least squares model that Chris Burges and John wrote up in 2012. They’ve kindly made their write up available here.

Calibrated scores are used alongside original scores to help in judging the quality of papers.

We also knew that Zoubin and Max had modified the model last year, along with their program manager Hong Ge. However, before going through the previous work we first of all approached the question independently. However, the model we came up with turned out to be pretty much identical to that of Hong, Zoubin and Max, and the approach we are using to compute probability of accepts was also identical. The model is a probabilistic reinterpretation of the Platt and Burges model: one that treats the bias parameters and quality parameters as latent variables that are normally distributed. Marginalizing out the latent variables leads to an ANOVA style description of the data.

The Model

Our assumption is that the score from the jth reviewer for the ith paper is given by

y_{i,j} = f_i + b_j + \epsilon_{i, j}

where f_i is the objective quality of paper i and b_j is an offset associated with reviewer j. \epsilon_{i,j} is a subjective quality estimate which reflects how a specific reviewer’s opinion differs from other reviewers (such differences in opinion may be due to differing expertise or perspective). The underlying ‘objective quality’ of the paper is assumed to be the same for all reviewers and the reviewer offset is assumed to be the same for all papers.

If we have n papers and m reviewers then this implies n + m + nm values need to be estimated. Of course, in practice, the matrix is sparse, and we have no way of estimating the subjective quality for paper-reviewer pairs where no assignment was made. However, we can firstly assume that the subjective quality is drawn from a normal density with variance \sigma^2

\epsilon_{i, j} \sim N(0, \sigma^2 \mathbf{I})

which reduces us to n + m + 1 parameters. The Platt-Burges model then estimated these parameters by regularized least squares. Instead, we follow Zoubin, Max and Hong’s approach of treating these values as latent variables. We assume that the objective quality, f_i, is also normally distributed with mean \mu and variance \alpha_f,

f_i \sim N(\mu, \alpha_f)

this now reduces us to $m$+3 parameters. However, we only have approximately $4m$ observations (4 papers per reviewer) so parameters may still not be that well determined (particularly for those reviewers that have only one review). We therefore also assume that the reviewer offset is a zero mean normally distributed latent variable,

b_j \sim N(0, \alpha_b),

leaving us only four parameters: \mu, \sigma^2, \alpha_f and \alpha_b. When we combine these assumptions together we see that our model assumes that any given review score is a combination of 3 normally distributed factors: the objective quality of the paper (variance \alpha_f), the subjective quality of the paper (variance \sigma^2) and the reviewer offset (variance \alpha_b). The a priori marginal variance of a reviewer-paper assignment’s score is the sum of these three components. Cross-correlations between reviewer-paper assignments occur if either the reviewer is the same (when the cross covariance is given by \alpha_b) or the paper is the same (when the cross covariance is given by $\alpha_f$). With a constant mean coming from the mean of the ‘subjective quality’, this gives us a joint model for reviewer scores as follows:

\mathbf{y} \sim N(\mu \mathbf{1}, \mathbf{K})

where \mathbf{y} is a vector of stacked scores $\mathbf{1}$ is the vector of ones and the elements of the covariance function are given by

k(i,j; k,l) = \delta_{i,k} \alpha_f + \delta_{j,l} \alpha_b + \delta_{i, k}\delta_{j,l} \sigma^2

where i and j are the index of the paper and reviewer in the rows of \mathbf{K} and k and l are the index of the paper and reviewer in the columns of \mathbf{K}.

It can be convenient to reparameterize slightly into an overall scale $\alpha_f$, and normalized variance parameters,

k(i,j; k,l) = \alpha_f(\delta_{i,k} + \delta_{j,l} \frac{\alpha_b}{\alpha_f} + \delta_{i, k}\delta_{j,l} \frac{\sigma^2}{\alpha_f})

which we rewrite to give two ratios: offset/objective quality ratio, \hat{\alpha}_b and subjective/objective ratio \hat{\sigma}^2 ratio.

k(i,j; k,l) = \alpha_f(\delta_{i,k} + \delta_{j,l} \hat{\alpha}_b + \delta_{i, k}\delta_{j,l} \hat{\sigma}^2)

The advantage of this parameterization is it allows us to optimize \alpha_f directly through maximum likelihood (with a fixed point equation). This leaves us with two free parameters, that we might explore on a grid.

We expect both $\mu$ and $\alpha_f$ to be very well determined due to the number of observations in the data. The negative log likelihood is

\frac{|\mathbf{y}|}{2}\log2\pi\alpha_f + \frac{1}{2}\log \left|\hat{\mathbf{K}}\right| + \frac{1}{2\alpha_f}\mathbf{y}^\top \hat{\mathbf{K}}^{-1} \mathbf{y}

where |\mathbf{y}| is the length of \mathbf{y} (i.e. the number of reviews) and \hat{\mathbf{K}}=\alpha_f^{-1}\mathbf{K} is the scale normalised covariance. This negative log likelihood is easily minimized to recover

\alpha_f = \frac{1}{|\mathbf{y}|} \mathbf{y}^\top \hat{\mathbf{K}}^{-1} \mathbf{y}

A Bayesian analysis of alpha_f parameter is possible with gamma priors, but it would merely shows that this parameter is extremely well determined (the degrees of freedom parameter of the associated Student-t marginal likelihood scales will the number of reviews, which will be around |\mathbf{y}| \approx 6,000 in our case.

We can set these parameters by maximum likelihood and then we can remove the offset from the model by computing the conditional distribution over the paper scores with the bias removed, s_{i,j} = f_i + \epsilon_{i,j}. This conditional distribution is found as

\mathbf{s}|\mathbf{y}, \alpha_f,\alpha_b, \sigma^2 \sim N(\boldsymbol{\mu}_s, \boldsymbol{\Sigma}_s)


\boldsymbol{\mu}_s = \mathbf{K}_s\mathbf{K}^{-1}\mathbf{y}


\boldsymbol{\Sigma}_s = \mathbf{K}_s - \mathbf{K}_s\mathbf{K}^{-1}\mathbf{K}_s

and \mathbf{K}_s is the covariance associated with the quality terms only with elements given by,

k_s(i,j;k,l) = \delta_{i,k}(\alpha_f + \delta_{j,l}\sigma^2).

We now use \boldsymbol{\mu}_s (which is both the mode and the mean of the posterior over \mathbf{s}) as the calibrated quality score.

Analysis of Variance

The model above is a type of Gaussian process model with a specific covariance function (or kernel). The variances are highly interpretable though, because the covariance function is made up of a sum of effects. Studying these variances is known as analysis of variance in statistics, and is commonly used for batch effects. It is known as an ANOVA model. It is easy to extend this model to include batch effects such as whether or not the reviewer is a student or whether or not the reviewer has published at NIPS before. We will conduct these analyses in due course. Last year, Zoubin, Max and Hong explored whether the reviewer confidence could be included in the model, but they found it did not help with performance on hold out data.

Scatter plot of Quality Score vs Calibrated Quality Score

Scatter plot of Quality Score vs Calibrated Quality Score

Probability of Acceptance

To predict the probability of acceptance of any given paper, we sample from the multivariate normal that gives the posterior over \mathbf{s}. These samples are sorted according to the values of \mathbf{s}, and the top scoring papers are considered to be accepts. These samples are taken 1000 times and the probability of acceptance is computed for each paper by seeing how many times the paper received a positive outcome from the thousand samples.

NIPS Reviewer Recruitment and ‘Experience’

Triggered by a question from Christoph Lampert as a comment on a previous blog post on reviewer allocation, I thought I’d post about how we did reviewer recruitment, and what the profile of reviewer ‘experience’ is, as defined by their NIPS track record.

I wrote this blog post, but it ended up being quite detailed, so Corinna suggested I put the summary of reviewer recruitment first, which makes a lot of sense. If you are interested in the details of our reviewer recruitment, please read on to the section below ‘Experience of the Reviewing Body’.


As a summary, I’ve imagined two questions and given answers below:

  1. I’m an Area Chair for NIPS, how did I come to be invited?
    You were personally known to one of the Program Chairs as an expert in your domain who had good judgement about the type and quality of papers we are looking to publish at NIPS. You have a strong publication track record in your domain. You were known to be reliable and responsive. You may have a track record of workshop organization in your domain and/or experience in area chairing previously at NIPS or other conferences. Through these activities you have shown community leadership.
  2. I’m a reviewer for NIPS, how did I come to be invited?
    You could have been invited for one of several reasons:

    • you were a reviewer for NIPS in 2013
    • you were a reviewer for AISTATS in 2012
    • you were personally recommended by an Area Chair or a Program Chair
    • you have been on a Program Committee (i.e. you were an Area Chair) at a leading international conference in recent years (specifically NIPS since 2000, ICML since 2008, AISTATS since 2011).
    • you have published 2 or more papers at NIPS since 2007
    • you published at NIPS in either 2012 or 2013 and your publication track record was personally reviewed and approved by one of the Program Chairs.

Experience of The Reviewing Body

That was the background to Reviewer and Area Chair recruitment, and it is also covered below, in much more detail than perhaps anyone could wish for! Now, for those of you that have gotten this far, we can try and look at the result in terms of one way of measuring reviewer experience. Our aim was to increase the number of reviewers and try and maintain or increase the quality of the reviewing body. Of course quality is subjective, but we can look at things such as reviewer experience in terms of how many NIPS publications they have had. Note that we have purposefully selected many reviewers and area chairs who have never previously published at NIPS, so this is clearly not the only criterion for experience, but it is one that is easily available to us and given Christoph’s question, the statistics may be of wider interest.

Reviewer NIPS Publication Record

Firstly we give the histograms for cumulative reviewer publications. We plot two histograms, publications since 2007 (to give an idea of long term trends) and publications since 2012 (to give an idea of recent trends).

Reviewer Publications 2007

Histogram of NIPS 2014 reviewers publication records since 2007.

Our most prolific reviewer has published 22 times at NIPS since 2007! That’s an average of over 3 per year (for comparison, I’ve published 7 times at NIPS since 2007).

Looking more recently we can get an idea of the number of NIPS publications reviewers have had since 2012.

Histogram of NIPS 2014 reviewers publication records since 2012.

Impressively the most prolific reviewer has published 10 papers at NIPS over the last two years, and intriguingly it is not the same reviewer that has published 22 times since 2007. The mode of 0 reviews is unsurprising, and comparing the histograms it looks like about 200 of our reviewing body haven’t published in the last two years, but have published at NIPS since 2007.

Area Chair Publication Record

We have got similar plots for the Area Chairs. Here is the histogram since 2007.

Area Chair Publications 2007

Histogram of NIPS 2014 Area Chair’s publication records since 2007.

Note that we’ve selected 16 Area Chairs who haven’t published at NIPS before. People who aren’t regular to NIPS may be surprised at this, but I think it reflects the openness of the community to other ideas and new directions for research. NIPS has always been a crossroads between traditional fields, and that is one of it’s great charms. As a result, NIPS publication record is a poor proxy for ‘experience’ where many of our area chairs are concerned.

Looking at the more recent publication track record for Area Chairs we have the following histogram.

Area Chair Publications 2012

Histogram of NIPS 2014 Area Chair’s publication records since 20012.

Here we see that a considerable portion of our Area Chairs haven’t published at NIPS in the last two years. I also find this unsurprising. I’ve only published one paper at NIPS since then (that was NIPS 2012, the groups’ NIPS 2013 submissions were both rejected—although I think my overall ‘hit rate’ for NIPS success is still around 50%).

Details of the Recruitment Process

Below are all the gritty details in terms of how things actually panned out in practice for reviewer recruitment. This might be useful for other people chairing conferences in the future.

Area Chair Recruitment

The first stage is invitation of area chairs. To ensure we got the correct distribution of expertise in area chairs, we invited in waves. Max and Zoubin gave us information about the subject distribution of the previous year’s NIPS submissions. This then gave us a rough number of area chairs required for each area. We had compiled a list of 99 candidate area chairs by mid January 2014, coverage here matched the subject coverage from the previous year’s conference. The Area Chairs are experts in their field, the majority of the Area Chairs are people that either Corinna or I have worked with directly or indirectly, others have a long track record of organising workshops and demonstrating thought leadership in their subject area. It’s their judgement on which we’ll be relying for paper decisions. As capable and active researchers they are in high demand for a range of activities (journal editing, program chairing other conferences, organizing workshops etc). This combined with the demands on our everyday lives (including family illnesses, newly born children etc) mean that not everyone can accept the demands on time that being an area chair makes. As well as being involved in reviewer recruitment, assignment and paper discussion. Area chairs need to be available for video conference meetings to discuss their allocation and make final recommendations on their papers. All this across periods of the summer when many are on vacation. Of our original list of 99 invites, 56 were available to help out. This then allowed us to refocus on areas where we’d missed out on Area Chairs. By early March we had a list of 57 further candidate area chairs. Of these 36 were available to help out. Finally we recruited a further 3 Area Chairs in early April, targeted at areas where we felt we were still short of expertise.

Reviewer Recruitment

Reviewer recruitment consists of identifying suitable people and inviting them to join the reviewing body. This process is completed in collaboration with the Area Chairs, who nominate reviewers in their domains. For NIPS 2014 we were targeting 1400 reviewers to account for our duplication of papers and the anticipated increase in submissions. There is no unified database of machine learning expertise, and the history of who reviewed in what years for NIPS is currently not recorded. This means that year to year, we are typically only provided with those people that agreed to review in the previous year as our starting point for compiling this list. From February onwards Corinna and I focussed on increasing this starting number. NIPS 2013 had 1120 reviewers and 80 area chairs, these names formed the core starting point for invitations. Further,  since I program chaired AISTATS in 2012 we also had the list of reviewers who’d agreed to review for that conference (400 reviewers, 28 area chairs). These names were also added to our initial list of candidate reviewers (although, of course, some of these names had already agreed to be area chairs for NIPS 2014 and there were many duplicates in the lists).

Sustaining Expertise in the Reviewing Body

A major concern for Corinna and I was to ensure that we had as much expertise in our reviewing body as possible. Because of the way that reviewer names are propagated from year to year, and the fact that more senior people tend to be busier and therefore more likely to decline, many well known researcher names weren’t in this initial list. To rectify this we took from the web the lists of Area Chairs for all previous NIPS conferences going back to 2000, all ICML conferences going back to 2008 and all AISTATS conferences going back to 2011. We could have extended this search to COLT, COSYNE and UAI also. Back in 2000 there were only 13 Area Chairs at NIPS, by the time that I first did the job in 2005 there were 19 Area Chairs. Corinna and I worked together at the last Program Committee to have a physical meeting in 2006 when John Platt was Program Chair. I remember having an above-average allocation of about 50-60 papers as Area Chair that year. I had papers on Gaussian processes (about 20) and many more in dimensionality reduction, mainly on spectral approaches. Corinna also had a lot of papers that year because she was dealing with kernel methods. Although I think a more typical load was 30-40, and reviewer load was probably around 6-8. The physical meeting consisted of two days in a conference room discussing every paper in turn as a full program committee.  That was also the last year of a single program chair. The early NIPS program committees mainly read as a “who’s who of machine learning”, and it sticks in my mind how carefully each chair went through each of the papers that were around the borderline of acceptance. Many papers were re-read at that meeting. Overall 160 new names were added to the list of candidate reviewers from incorporating the Area Chairs from these meetings, giving us around 1600 candidate reviewers in total. Note that the sort of reviewing expertise we are after is not only the technical expertise necessary to judge the correctness of the paper. We are looking for reviewers that can judge whether the work is going to be of interest to the wider NIPS community and whether the ideas in the work are likely to have significant impact. The latter two areas are perhaps more subjective, and may require more experience than the first. However, the quality of papers submitted to NIPS is very high, and the number that are technically correct is a very large portion of those submitted. The objective of NIPS is not then to select those papers that are the ‘most technical’, but to select those papers that are likely to have an influence on the field. This is where understanding of likely impact is so important. To this end, Max and Zoubin introduced an ‘impact’ score, with the precise intent of reminding reviewers to think about this aspect. However, if the focus is too much on the technical side, then maybe a paper that is highly complex from a technical stand-point, but less unlikely to have an influence on the direction of the field, is more likely to be accepted than a paper that contains a potentially very influential idea that doesn’t present a strong technical challenge. Ideally then, a paper should have a distribution of reviewers who aren’t purely experts in the particular technical domain from where the paper arises, but also informed experts in the wider context of where the paper sits. The role of the Area Chair is also important here. The next step in reviewer recruitment was to involve the Area Chairs in adding to the list in areas where we had missed people. This is also an important route for new and upcoming NIPS researchers to become involved in reviewing. We provided Area Chairs with access to the list of candidate reviewers and asked them to add names of experts who they would like to recruit, but weren’t currently in the list. This led to a further 220 names.

At this point we had also begun to invite reviewers. Reviewer invitation was done in waves. We started with the first wave of around 1600-1700 invites in mid-April. At that point, the broad form of the Program Committee was already resolved. Acceptance rates for reviewer invites indicated that we weren’t going to hit our target of 1400 reviewers with our candidate list. By the end of April we had around 1000 reviewers accepted, but we were targeting another 400 reviewers to ensure we could keep reviewer load low.

A final source of candidates was from Chris Hiestand. Chris maintains the NIPS data base of authors and presenters on behalf of the NIPS foundation. This gave us another potential source of reviewers. We considered all authors that had 2 or more NIPS papers since 2007. We’d initially intended to restrict this number to 3, but that gained us only 91 more new candidate reviewers (because most of the names were in our candidate list already), relaxing this constraint to 2 led to 325 new candidate reviewers. These additional reviewers were invited at the end of April. However, even with this group, were likely to fall short of our target.

Our final group of reviewers came from authors who published either at NIPS 2013 or NIPS 2012. However, authors that have published only one paper are not necessarily qualified to review at NIPS. For example, the author may be a collaborator from another field. There were 697 authors who had one NIPS paper in 2012 or 2013 and were not in our current candidate list. For these 697 authors, we felt it was necessary to go through each author individually, checking their track record on through web searches (DBLP and Google Scholar as well as web pages) and ensuring they had the necessary track record to review for NIPS. This process resulted in an additional 174 candidate reviewer names. The remainder we either were unable to identify on the web (169 people) or they had a track record where we couldn’t be confident about their ability to review for NIPS without a personal recommendation (369 people).  This final wave of invites went out at the beginning of May and also included new reviewer suggestions from Area Chairs and invites to candidate Area Chairs who had not been able to commit to Area Chairing, but may have been able to commit to reviewing. Again, we wanted to ensure the expertise of the reviewing body was as highly developed as possible.

This meant that by the submission deadline we had 1390 reviewers in the system. On 15th July this number has increased slightly. This is because during paper allocation, Area Chairs have recruited additional specific reviewers to handle particular papers where they felt that the available reviewers didn’t have the correct expertise. This means that currently, we have 1400 reviewers exactly. This total number of reviewers comes from around 2255 invitations to review.

Overall, reviewer recruitment took up a very large amount of time, distributed over many weeks. Keeping track of who had been invited already was difficult, because we didn’t have a unique ID for our candidate reviewers. We have a local SQLite data base that indexes on email, and we try to check for matches based on names as well. Most of these checks are done in Python code which is now available on the github repository here, along with IPython notebooks that did the processing (with identifying information removed). Despite care taken to ensure we didn’t add potential reviewers twice to our data base, several people received two invites to review. Very often, they also didn’t notice that they were separate invites, so they agreed to review twice for NIPS. Most of these duplications were picked up at some point before paper allocation and they tended to arise for people whose names could be rendered in multple ways (e.g. because of accents)  who have multiple email addresses (e.g. due to change of affiliation).


Firstly, NIPS uses the CMT system for conference management. In an ideal world, choice of management system shouldn’t dictate how you do things, but in practice particularities of the system can affect our choices. CMT doesn’t store a uniques profile for conference reviewers (unlike for example EasyChair which stores every conference you’ve submitted to or reviewed/chaired for). This means that from year to year information about the previous years reviewers isn’t necessarily passed in a consistent way between program chairs. Corinna and I requested that the CMT set up for our year copied across the reviewers from NIPS 2013 along with their subject areas and conflicts to try and alleviate this. The NIPS program committee in 2013 consisted of 1120 reviewers and 80 area chairs. Corinna and I set a target of 1400 reviewers and 100 area chairs. This was to account for (a) increase in submissions of perhaps 10% and (b) duplication of papers for independent reviewing at a level of around 10%.