Proceedings of Machine Learning Research

Back in 2006 when the wider machine learning community was becoming aware of Gaussian processes (mainly through the publication of the Rasmussen and WIlliams book). Joaquin Quinonero Candela, Anton Schwaighofer and I organised the Gaussian Processes in Practice workshop at Bletchley Park. We planned a short proceedings for the workshop, but when I contacted Springer’s LNCS proceedings, a rather dismissive note came back with an associated prohibitive cost. Given that the ranking of LNCS wasn’t (and never has been) that high, this seemed a little presumptuous on their part. In response I contacted JMLR and asked if they’d ever considered a proceedings track. The result was that I was asked by Leslie Pack Kaelbling to launch the proceedings track.

JMLR isn’t just open access, but there is no charge to authors. It is hosted by servers at MIT and managed by the community.

We launched the proceedings in March 2007 with the first volume from the Gaussian Processes in Practice workshop. Since then there have been 38 volumes including two volumes in the pipeline. The proceedings publishes several leading conferences in machine learning including AISTATS, COLT and ICML.

From the start we felt that it was important to share the branding of JMLR with the proceedings, to show that the publication was following the same ethos as JMLR. However, this led to the rather awkward name: JMLR Workshop and Conference Proceedings, or JMLR W&CP. Following discussion with the senior editorial board of JMLR we now feel the time is right to rebrand with the shorter “Proceedings of Machine Learning Research”.

As part of the rebranding process the editorial team for the Proceedings of Machine Learning Research (which consists of Mark Reid and myself) is launching a small consultation exercise looking for suggestions on how we can improve the service for the community. Please feel free to leave comments on this blog post or via Facebook or Twitter to let us have feedback!

NIPS: Decision Time

Thursday 28th August

In the last two days I’ve spent nearly 20 hours in teleconferences, my last scheduled conference will start in about 1/2 an hour. Given the available 25 minutes it seemed to make sense to try and put down some thoughts about the decision process.

The discussion period has been constant, there is a stream of incoming queries from Area Chairs, requests for advice on additional reviewers, or how to resolve deadlocked or disputing reviews. Corinna has handled many of these.

Since the author rebuttal period all the papers have been distributed to google spreadsheet lists which are updated daily. They contain paper titles, reviewer names, quality scores, calibrated scores, a probability of accept (under our calibration model), a list of bot-compiled potential issues as well as columns for accept/reject and poster/spotlight. Area chairs have been working in buddy pairs, ensuring that a second set of eyes can rest on each paper. For those papers around the borderline, or with contrasting reviews, the discussion period really can have an affect, we see when calibrating the reviewer scores: over time the reviewer bias is reducing and the scores are becoming more consistent. For this reason we allowed this period to go on a week longer than originally planned, and we’ve been compressing our teleconferences into the last few days.

Most teleconferences consist of two buddy pairs coming together to discuss their papers. Perhaps ideally the pairs would have a similar subject background, but constraints of time zone and the fact that there isn’t a balanced number of subject areas mean that this isn’t necessarily the case.

Corinna and I have been following a similar format. Listing the papers from highest scoring first, to lowest scoring, and starting at the top. For each paper, if it is a confident accept, we try and identify if it might be a talk or a spotlight. This is where the opinion of a range of Area Chairs can be very useful. For uncontroversial accepts that aren’t nominated for orals we spend very little time. This proceeds until we start reaching borderline papers, those in the ‘grey area’: typically papers with an average score around 6. They fall broadly into two categories: those where the reviewers disagree (e.g. scores of 8,6,4), or those where the review are consistent but the reviewers , perhaps, feel underwhelmed (scores of 6,6,6). Area chairs will often work hard to try and get one of the reviewers to ‘champion’ a paper: it’s a good sign if a reviewer has been prepared to argue the case for a paper in the discussion. However, the decisions in this region are still difficult. It is clear that we are rejecting some very solid papers, for reasons of space and because of the overall quality of submissions. It’s hard for everyone to be on the ‘distributing’ end of this system, but at the same time, we’ve all been on the receiving end of it too.

In this difficult ‘grey area’ for acceptance, we are looking for sparks in a paper that push it over the edge to acceptance. So what sort of thing catches an area chair’s eye? A new direction is always welcome, but often leads to higher variance in the reviewer scores. Not all reviewers are necessarily comfortable with the unfamiliar. But if an area chair feels a paper is taking the machine learning field somewhere new, then even if the paper has some weaknesses (e.g. in evaluation or giving context and detailed derivations etc) then we might be prepared to overlook this. We look at the borderline papers in some detail, scanning the reviews, looking for words like ‘innovative’, ‘new directions’ or ‘strong experimental results’. If we see these then as program chairs we definitely become more attentive. We all remember papers presented at NIPS in the past that lead to revolutions in the way machine learning is done. Both Corinna and I would love to have such papers at ‘our’ NIPS.

A paper that is a more developed area will be expected to have done a more rounded job in terms of setting the context and performing the evaluation. Papers in a more developed area will be expected to hit a high level in terms of their standards.

It is often helpful to have an extra pair of eyes (or even two pairs) run through the paper. Each teleconference call normally ends with a few follow up actions for a different area chair to look through a paper or clarify a particular point. Sometimes we also call in domain experts, who may have already produced four formal reviews of other papers, just to get clarification on  particular point. This certainly doesn’t happen for all papers, but those with scores around 7,6,6 or 6,6,6 or 8,6,4 often get this treatment. Much depends on the discussion and content of the existing reviews, but there are still, often, final checks that need carrying out. From a program chair’s perspective, the most important thing is that the Area Chair is comfortable with the decision, and I think most of the job is acting as a sounding board for the Area Chair’s opinion, which I try to reflect back to them. In the same manner as rubber duck debugging, just vocalising the issues sometimes causes them to be crystallised in the mind. Ensuring that Area Chairs are calibrated to each other is also important. The global probabilities of accept from the reviewer calibration model really help here. As we go through papers I keep half an eye on those, not to influence the decision of a particular paper so much as to ensure that at the end of the process we don’t have a surplus of accepts. At this stage all decisions are tentative, but we hope not to have to come back to too many of them.

Monday 1st September

Corinna finished her last video conference on Friday, Saturday, Sunday and Monday (Labor Day) were filled with making final decisions on accepts, then talks and finally spotlights. Accepts were hard, we were unable to take all the papers that were possible accept, as we would have gone way over our quota of 400. We had to make a decision on duplicated papers where the decisions were in conflict, more details of this to come at the conference. From remembering what a pain it was to do the schedule after the acceptances, and also following advice from Leon Bottou that the talk program emerges to reflect the accepted posters, we finalized the talk and spotlight program whilst putting talks and spotlights directly into the schedule. We had to hone the talks down to 20 from about 40 candidates and spotlights we squeezed in 62 from over a hundred suggestions. We spent three hours in teleconference each day, as well as preparation time, across Labor Day weekend putting together the first draft of the schedule. It was particularly impressive how quickly area chairs responded to any of our follow up queries to our notes from the teleconferences. Particularly those in the US who were enjoying the traditional last weekend of summer.

Tuesday 2nd September

I had an all day meeting in Manchester for the a network of researchers focussed on mental illness. It was really good to have a day discussing research, my first in a long time. I thought very little about NIPS until on the train home, I thought to have a little look at the conference shape. I actually ended up looking at a lot of the papers we rejected, many from close colleagues and friends. I found it a little depressing. I have no doubt there is a lot of excellent work there, and I know how disappointed my friends and colleagues will be to receive those rejections. We did an enormous amount to ensure that the process was right, and I have every confidence in the area chairs and reviewers. But at the end of the day, you know that you will be rejecting a lot of good work. It brought to mind a thought I had at the allocation stage. When we had the draft allocation to each area chair, I went through several of them sanity checking the quality of the allocation. Naturally, I checked those associated with area chairs who are closer to my own areas of expertise. I looked through the paper titles, and I couldn’t help but think what a good workshop each of those allocations would make. There would be some great ideas, some partially developed ideas. There would be some really great experiments and some weaker experiments. But there would be a lot of debate at such workshop. None or very few of the papers would be uninteresting: there would certainly be errors in papers, but that’s one of the charms of a workshop, there’s still a lot more to be said about an idea when it’s presented at a workshop.

Friday 5th September

Returning from an excellent two day UCL-Duke workshop. There is a lot of curiosity about the NIPS experiment, but Corinna and I have agreed to keep the results embargoed until the conference.

Saturday 6th September

Area chairs had until Thursday to finalise their reviews in the light of the final decisions, and also to raise any concerns they had about the final decisions. My own experience of area chairing is that you can have doubts about your reasoning when you are forced to put pen to paper and write the meta review. We felt it was important to not rush the final process to allow any of those doubts to emerge. In the end, the final program has 3 or 4 changes from the draft we first distributed on Monday night, so there may be some merit in this approach. We had a further 3 hour teleconference today to go through the meta-reviews, with a particular focus on those for papers around the decision boundary. Other issues such as comments in the wrong place (the CMT interface can be fairly confusing, 3% of meta reviews were actually placed in the box meant for notes to the program chairs) were also covered. Our big concern was if the area chairs had written a review consistent with our final verdict. A handy learning task would have been to build a sentiment model to predict accept/reject from the meta review.

Monday 8th September 

Our plan had been to release reviews this morning, but we were still waiting for a couple of meta-reviews to be tidied up and had an outstanding issue on one paper. I write this with CMT ‘loaded’ and ready to distribute decisions. However, when I preview the emails the variable fields are not filled in (if I hit ‘send’ I would send 5,000 emails that start “Dear $RecipientFirstName$, which sounds somewhat impersonal … although perhaps more critical is that the authors would be informed of the fate of paper “$Title$,” which may lead to some confusion. CMT are on a different time zone, 8 hours behind. Fortunately, it is late here, so there is a good chance they will respond in time …

Tuesday 9th September

I was wide awake at 6:10 despite going to sleep at 2 am. I always remember when I was Area Chair with John Platt that he would be up late answering emails and then out of bed again 4 hours later doing it again. A few final checks and the all clear for everything is there. Pressed the button at 6:22 … emails are still going out and it is 10:47. 3854 of the 5615 emails have been sent … one reply which was an out of office email from China. Time to make a coffee …

Final Statistics

1678 submissions
414 papers accepted
20 papers for oral
62 for spotlight
331 for poster
19 rejected without review

Epilogue to Decision Mail:  So what was wrong with those variable names? I particularly like the fact that something different was wrong with each one. $RecipientFirstName$ and $RecipientEmail$ are  not available in the “Notification Wizard”, whereas they are in the normal email sending system. Then I got the other variables wrong, $Title$->$PaperTitle$ and $PaperId$->$PaperID$, but since neither of the two I knew to be right were working I assumed there was something wrong with the whole variable substitution system … rather than it being that (at least) two of the variable types just happen to be missing from this wizard … CMT responded nice and quickly though … that’s one advantage of working late.

Epilogue on Acceptances: At the time of the conference there were only 411 papers presented because three were withdrawn. Withdrawals were usually due to some deeper problem authors had found in there own work, perhaps triggered by comments from reviewers. So in the end there were 411 papers accepted and 328 posters.

Author Concerns

So the decisions have been out for a few days now, and of course we have had some queries about our processes. Every one has been pretty reasonable, and their frustration is understandable when three reviewers have argued for accept but the final decision is to reject. This is an issue with ‘space-constrained’ conferences. Whether a paper gets through in the end can depend on subjective judgements about the paper’s qualities. In particular, we’ve been looking for three components to this: novelty, clarity and utility. Papers with borderline scores (and borderline here might be that the average score is in the weak accept range) are examined closely. The decision about whether the paper is accepted at this point necessarily must come down to judgement, because for a paper to get scores this high the reviewers won’t have identified a particular problem with the paper. The things that come through are how novel the paper is, how useful the idea is, and how clearly it’s presented. Several authors seem to think that the latter should be downplayed. As program chairs, we don’t necessarily agree. It’s true that it is a great shame when a great idea is buried in poor presentation, but it’s also true that the objective of a conference is communication, and therefore clarity of presentation definitely plays a role. However, it’s clear that all these three criteria are a matter of academic judgement: that of the reviewers, the area chair and the quad groups in the teleconferences. All the evidence we’ve seen is that reviewers and area chairs did weigh these aspects carefully, but that doesn’t mean that all their decisions can be shown to be right, because they are often a matter of perspective. Naturally authors are upset when what feels like a perfectly good paper is rejected on more subjective grounds. Most of the queries are on papers where this is felt to be the case.

There has also been one query on process, and whether we did enough to evaluate on these criteria, for those papers in the borderline area, before author rebuttal. Authors are naturally upset when the area chair raises such issues in the final decision’s meta review, but these points weren’t there before. Personally I sympathise with both authors and area chairs in this case. We made some effort to encourage authors to identify such papers before rebuttal (we sent out attention reports that highlighted probable borderline papers) but our main efforts at the time were chasing missing and inappropriate or insufficient reviews. We compressed a lot into a fairly short time, and it was also a period when many are on holiday. We were very pleased with the performance of our area chairs, but I think it’s also unsurprising if an area chair didn’t have time to carefully think through these aspects before author rebuttal.

My own feeling is that the space constraint on NIPS is rather artificial, and a lot of these problems would be avoided if it wasn’t there. However, there is a counter argument that suggests that to be a top quality conference NIPS has to have a high reject rate. NIPS is used in tenure cases within the US and these statistics are important there. Whilst I reject these ideas: I don’t think the role of a conference is to allow people to get promoted in a particular country, nor is that the role of a journal: they are both involved in the communication and debate of scientific ideas. However, I do not view the program chair roles as reforming the conference ‘in their own image’. You have to also consider what NIPS means to the different participants.

NIPS as Christmas

I came up with an analogy for this which has NIPS in the role of Christmas (you can substitute Thanksgiving, Chinese New Year, or your favourite traditional feast). In the UK Christmas is a traditional holiday about which people have particular expectations, some of them major (there should be Turkey for Christmas Dinner) and some of them minor (there should be an old Bond movie on TV). These expectations have changed over time.  The Victorians used to eat Goose and the Christmas tree was introduced from Germany by Prince Albert’s influence in the Royal Household, and they also didn’t have James Bond, I think they used Charles Dickens instead. However, you can’t just change Christmas ‘overnight’, it needs to be a smooth transition. You can make lots of arguments about how Christmas could be a better meal, or that presents make the occasion too commercial, but people have expectations so the only way to make change is slowly. Taking small steps in the right direction. For any established successful venture this approach makes a lot of sense. There are many more ways to fail than be successful and I think that the rough argument is that if you are starting from a point of success you should be careful about how quickly you move because you are likely end up in failure. However, not moving at all also leads to failure. I think this year we’ve introduced some innovations and an analysis of the process that will hopefully lead to improvements. We certainly aren’t alone in these innovations, each NIPS before us has done the same thing (I’m a particular fan of Zoubin and Max’s publication of the reviews). Whether we did this well or not, like those borderline papers, is a matter for academic judgement. In the meantime I (personally) will continue to try to enjoy NIPS for what it is, whilst wondering about what it could be and how we might get there. I also know that as a community we will continue to innovate, launching new conferences with new models for reviewing (like ICLR).

Paper Allocation for NIPS

With luck we will release papers to reviewers early next week. The paper allocations are being refined by area chairs at the moment.

Corinna and I thought it might be informative to give details of the allocation process we used, so I’m publishing it here. Note that this automatic process just gives the initial allocation. The current stage we are in is moving papers between Area Chairs (in response to their comments) whilst they also do some refinement of our initial allocation. If I find time I’ll also tidy up the python code that was used and publish it as well (in the form of an IPython notebook).

I wrote the process down in response to a query from Geoff Gordon. So the questions I answer are imagined questions from Geoff. If you like, you can picture Geoff asking them like I did, but in real life, they are words I put into Geoff’s mouth.

  •  How did you allocate the papers?

We ranked all paper-reviewer matches by a similarity and allocated each paper-reviewer pair from the top of the list, rejecting an allocation if the reviewer had a full quota, or the paper had a full complement of reviewers.

  • How was the similarity computed?

The similarity consisted of the following weighted components.

s_p = 0.25 * primary subject match.
s_s = 0.25 * bag of words match between primary and secondary subjects
m = 0.5 * TPMS score (rescaled to be between 0 and 1).

  •  So how were the bids used?

Each of the similarity scores was multiplied by 1.5^b where b is the bid. For: “eager” b=2, “willing” b=1, “in a pinch” b=-1, “not willing” b=-2 and no bid was b=0. So the final score used in the ranking was (s_p+s_s+m)*1.5^b

  • But how did you deal with the fact that different reviewers used the bidding in different ways?

The rows and columns were crudely normalized by the *square root* of their standard deviations

  • So what about conflicting papers?

Conflicting papers were given similarities of -inf.

  • How did you ensure that ‘expertise’ was evenly distributed?

We split the reviewing body into two groups. The first group of ‘experts’ were those people with two or more NIPS papers since 2007 (thanks to Chris Hiestand for providing this information). This was about 1/3 of the total reviewing body. We allocated these reviewers first to a maximum of one ‘expert’ per paper. We then allocated the remainder of the reviewing body to the papers up to a maximum of 3 reviewers per paper.

  • One or more of my papers has less than three reviewers, how did that happen?

When forming the ranking to allocate papers, we only retained papers scoring in the upper half. This was to ensure that we didn’t drop too far down the rank list. After passing through the rank list of scores once, some papers were still left unallocated.

  • But you didn’t leave unallocated papers to area chairs did you?

No, we needed all papers to have an area chair, so for area chairs we continued to allocate these ‘inappropriate papers’ to the best matching area chair with remaining quota, but for reviewers we left these allocations ‘open’ because we felt manual intervention was appropriate here.

  • Was anything else different about area chair allocation?

Yes, we found there was a tendency for high bidding area chairs to fill up their allocation quickly vs low bidding area chairs, meaning low bidding/similarity area chairs weren’t getting a very good allocation. To try and distribute things more evenly, we used a ‘staged quota’ system. We started by allocating area chairs five papers each. Then ten, then fifeteen etc. This meant that even if an area chair had the top 25 similarities in the overall list, many of those papers would still be matched to other reviewers. Our crude normalization was also designed to prevent this tendency. Perhaps a better idea still would be to rank similarities on a per reviewer basis and use this as the score instead of the similarity itself, although we didn’t try this.

  • Did you do the allocations for the bidding in the same way?

Yes, we did bidding allocations in a similar way, apart from two things. Firstly the similarity score was different, we didn’t have a separate match to primary key. This lead to problems for reviewers who had one dominant primary keyword and many less important secondary key words. Now, the allocated papers were also distributed in a different way. Each paper was allocated (for bidding) to those area chairs who were in the top 25 scores for that paper. This led to quite a wide variety in the number of papers you saw for bidding, but each paper was, (hopefully) seen at least 25 times.

  • That’s for area chairs, was it the same for the bidding allocation for reviewers?

No, for reviewers, we wanted to restrict the number of papers that each reviewer would see. We decided each reviewer should only see a maximum of 25 papers, we did something more similar to the ‘preliminary allocation’ that’s just been sent out. We went down the list allocating a maximum of 25 papers per reviewer, and ensuring each paper was viewed by 17 different reviewers.

  • Why did you do it like this? Did you research into this?

We never (orignally) intended to intervene so heavily in the allocation system, but with this year’s record numbers of submissions and reviewers the CMT database was failing to allocate. This, combined with time delays between Sheffield/New York/Seattle was causing delays in getting papers out for bidding, so at one stage we split the load into Corinna working with CMT to achieve an allocation and Neil working on coding the intervention described above. The intervention was finished first. Once we had a rough and ready system working for bids we realised we could have more fine control over the allocation than we’d get with CMT (for example trying to ensure that each paper got at least one ‘expert’), so we chose to continue with our approach. There may certainly be better ways of doing this.

  • How did you check the quality of the allocation?

The main approach we used for checking allocation quality was to check the allocation of an area chair whose domain we knew well, and ensure that the allocation made sense, i.e. we looked at the list of papers and judged whether it made ‘sense’.

  • That doesn’t sound very objective, isn’t there a better way?

We agree that it’s not very objective, but then again people seem to evaluate topic models like that all the time, and a topic model is a key part of this system (the TPMS matching service). The other approach was to wait until people complained about their allocation. There were only a few very polite complaints at the bidding stage, but these led us to realise we needed to upweight the similarities associated with the primary key word. We found that some people choose one very dominant primary keyword, and many, less important secondary keywords. These reviewers were not getting a very focussed allocation.

  • How about the code?

The code was written in python using pandas in the form of an IPython notebook.

And finally …

Thanks to all the reviewers and area chairs for their patience with the system and particular thanks to Laurent Charlin (TPMS) and the CMT support for their help getting everything uploaded.