Legislation for Personal Data: Magna Carta or Highway Code?

Karl Popper is perhaps one of the most important thinkers from the 20th century. Not purely for his philosophy of science, but for giving a definitive answer to a common conundrum: “Which comes first, the chicken or the egg?”. He says that they were simply preceded by an ‘earlier type of egg’. I take this to mean that the answer is neither: they actually co-evolved. What do I mean by co-evolved? Well broadly speaking there once were two primordial entities which weren’t very chicken-like or egg-like at all, over time small changes occurred, supported by natural selection, rendering those entities unrecognisable from their origins into two of our most familiar foodstuffs of today.

I find the process of co-evolution remarkable, and to some extent unimaginable, or certainly it seems to me difficult to visualise the intermediate steps. Evolution occurs by natural selection: selection by the ‘environment’, but when we refer to co-evolution we are clarifying that this is a complex interaction. The primordial entities effect the environment around them, therefore changing the ‘rules of the game’ as far as survival is concerned. In such a convolved system certainties about the right action disappear very quickly.

What use are chickens and eggs when talking about personal data? Well, Popper used the question to illustrate a point about scientific endeavour. He was talking about science and reflecting on how scientific theories co-evolve with experiments. However, that’s not the point I’d like to make here. Co-evolution is very general, one area it arises is when technological advance changes society to such an extent that existing legislative frameworks become inappropriate. Tim Berners Lee has called for a Magna Carta for the digital age, and I think this is a worthy idea, but is it the right idea? A digital bill of rights may be the right idea in the longer run, but I don’t think we are ready to draft it yet. My own research is machine learning, the main technology underpinning the current AI revolution. A combination of machine learning, fast computers, and interconnected data means that the technological landscape is changing so fast that it is effecting society around us in ways that no one envisaged twenty years ago.

Even if we were to start with the primordial entities that presaged the chicken and the egg, and we knew all about the process of natural selection, could we have predicted or controlled the animal of the future that would emerge? We couldn’t have done. The chicken exists today as the product of its environmental experience, an experience that was unique to it. The end point we see is one of is highly sensitive to very small perturbations that could have occurred at the beginning.

So should we be writing legislation today which ties down the behaviour of future generations? There is precedent for this from the past. Before the printing press was introduced, no one would have begrudged the monks’ right to laboriously transcribe the books of the day. Printing meant it was necessary to protect the “copy rights” of the originator of the material. No one could have envisaged that those copyright laws would also be used to protect software, or digital music. In the industrial revolution the legal mechanism of ‘letters patent’ evolved to protect creative insight. Patents became protection of intellectual property, ensuring that inventors’ ideas could be shared under license. These mechanisms also protect innovation in the digital world. In some jurisdictions they are now applied to software and even user interface designs. Of course even this legislation is stretched in the face of digital technology and may need to evolve, as it has done in the past.

The new legislative challenge is not in protecting what is innovative about people, but what is commonplace about them. The new value is in knowing the nature of people: predicting their needs and fulfilling them. This is the value of interconnection of personal data. It allows us to make predictions about an individual by comparing him or her to others. It is the mainstay of the modern internet economy: targeted advertising and recommendation systems. It underpins my own research ideas in personalisation of health treatments and early diagnosis of disease. But it leads to potential dangers, particularly where the uncontrolled storage and flow of an individual’s personal information is concerned. We are reaching the point where some studies are showing that computer prediction of our personality is more accurate than that of our friends and relatives. How long before an objective computer prediction of our personality can outperform our own subjective assessment of ourselves? Some argue those times are already upon us. It feels dangerous for such power to be wielded unregulated by a few powerful groups. So what is the answer? New legislation? But how should it come about?

In the long term, I think we need to develop a set of rules and legislation, that include principles that protect our digital rights. I think we need new models of ownership that allow us to control our private data. One idea that appeals to me is extending data protection legislation with the right not only to view data held about us, but to also ask for it to be deleted. However, I can envisage many practical problems with that idea, and these need to be resolved so we can also enjoy the benefits of these personalised predictions.

As wonderful as some of the principles in the Magna Carta are, I don’t think it provides a good model for the introduction of modern legislation. It was actually signed under duress: under a threat of violent revolution. The revolution was threatened by a landed gentry, although the consequences would have been felt by all. Revolutions don’t always end well. They occur because people can become deadlocked: they envisage different futures for themselves and there is no way to agree on a shared path to different end points. The Magna Carta was also a deal between the king and his barons. Those barons were asking for rights that they had no intention of extending within their fiefdoms. These two characteristics: redistribution of power amongst a powerful minority, with significant potential consequences for the a disenfranchised majority, make the Magna Carta, for me, a poor analogy for how we would like things to proceed.

The chicken and the egg remind us that the actual future will likely be more remarkable than any of us can currently imagine. Even if we all seek a particular version of the future this version of the future is unlikely to ever exist in the form that we imagine. Open, receptive and ongoing dialogue between the interested and informed parties is more likely to bring about a societal consensus. But can this happen in practice? Could we really evolve a set of rights and legislative principles which lets us achieve all our goals? I’d like to propose that rather than taking as our example a mediaeval document, written on velum, we look to more recent changes in society and how they have been handled. In England, the Victorians may have done more than anyone to promote our romantic notion of the Magna Carta, but I think we can learn more by looking at how they dealt with their own legislative challenges.

I live in Sheffield, and cycle regularly in the Peak District national park. Enjoyment of the Peak Park is not restricted to our era. At 10:30 on Easter Monday in 1882 a Landau carriage, rented by a local cutler, was heading on a day trip from Sheffield to the village of Tideswell, in the White Peak. They’d left Sheffield via Ecclesall Road, and as they began to descend the road beneath Froggatt Edge, just before the Grouse Inn they encountered a large traction engine towing two trucks of coal. The Landau carriage had two horses and had been moving at a brisk pace of four and a half miles an hour. They had already passed several engines on the way out of Sheffield. However, as they moved out to pass this one, it let out a continuous blast of steam and began to turn across their path into the entrance of the inn. One of the horses took fright pulling the carriage up a bank, throwing Ben Deakin Littlewood and Mary Coke Smith from the carriage and under the wheels of the traction engine. I cycle to work past their graves every day. The event was remarkable at the time, so much so that is chiselled into the inscription on Ben’s grave.

The traction engine was preceded, as legislation since 1865 had dictated, by a boy waving a red flag. It was restricted to two and a half miles an hour. However, the boy’s role was to warn oncoming traffic. The traction engine driver had turned without checking whether the road was clear of overtaking traffic. It’s difficult to blame the driver though. I imagine that there was quite a lot involved in driving a traction engine in 1882. It turned out that the driver was also preoccupied with a broken wheel on one of his carriages. He was turning into the Grouse to check the wheel before descending the road.

This example shows how legislation can sometimes be extremely restrictive, but still not achieve the desired outcome. Codification of the manner in which a vehicle should be overtaken came later, at a time when vehicles were travelling much faster. The Landau carriage was overtaking about 100 meters after a bend. The driver of the traction engine didn’t check over his shoulder immediately before turning, although he claimed he’d looked earlier. Today both drivers’ responsibilities are laid out in the “Highway Code”. There was no “Mirror, Signal, Manoeuvre” in 1882. That came later alongside other regulations such as road markings and turn indicators.

The shared use of our road network, and the development of the right legislative framework might be a good analogy for how we should develop legislation for protecting our personal privacy. No analogy is ever perfect, but it is clear that our society both gained and lost through introduction of motorised travel. Similarly, the digital revolution will bring advantages but new challenges. We need to have mechanisms that allow for negotiated solutions. We need to be able to argue about the balance of current legislation and how it should evolve. Those arguments will be driven by our own personal perspectives. Our modern rules of the road are in the Highway Code. It lists responsibilities of drivers, motorcyclists, cyclists, mobility scooters, pedestrians and even animals. It gives legal requirements and standards of expected behaviour. The Highway Code co-evolved with transport technology: it has undergone 15 editions and is currently being rewritten to accommodate driverless cars. Even today we still argue about the balance of this document.

In the long term, when technologies have stabilised, I hope we will be able to distill our thinking to a bill of rights for the internet. But such a document has a finality about it which seems inappropriate in the face of technological uncertainty. Calls for a Magna Carta provide soundbites that resonate and provide rallying points. But they can polarise, presaging unhelpful battles. Between the Magna Carta and the foundation of the United States the balance between the English monarch and his subjects was reassessed through the English Civil War and the American Revolution. Wven tI don’t think we can afford such discord when drafting the rights of the digital age. We need mechanisms that allow for open debate, rather than open battle. Before a bill of rights for the internet, I think we need a different document. I’d like to sound the less resonant call for a document that allows for dialogue, reflecting concerns as they emerge. It could summarise current law and express expected standards of behaviour. With regular updating it would provide an evolving social contract between all the users of the information highway: people, governments, businesses, hospitals, scientists, aid organisations. Perhaps instead of a Magna Carta for the internet we should start with something more humble: the rules of the digital road.

This blog post is an extended version of an written for the Guardian’s media network: “Let’s learn the rules of the digital road before talking about a web Magna Carta”

Proceedings of Machine Learning Research

Back in 2006 when the wider machine learning community was becoming aware of Gaussian processes (mainly through the publication of the Rasmussen and WIlliams book). Joaquin Quinonero Candela, Anton Schwaighofer and I organised the Gaussian Processes in Practice workshop at Bletchley Park. We planned a short proceedings for the workshop, but when I contacted Springer’s LNCS proceedings, a rather dismissive note came back with an associated prohibitive cost. Given that the ranking of LNCS wasn’t (and never has been) that high, this seemed a little presumptuous on their part. In response I contacted JMLR and asked if they’d ever considered a proceedings track. The result was that I was asked by Leslie Pack Kaelbling to launch the proceedings track.

JMLR isn’t just open access, but there is no charge to authors. It is hosted by servers at MIT and managed by the community.

We launched the proceedings in March 2007 with the first volume from the Gaussian Processes in Practice workshop. Since then there have been 38 volumes including two volumes in the pipeline. The proceedings publishes several leading conferences in machine learning including AISTATS, COLT and ICML.

From the start we felt that it was important to share the branding of JMLR with the proceedings, to show that the publication was following the same ethos as JMLR. However, this led to the rather awkward name: JMLR Workshop and Conference Proceedings, or JMLR W&CP. Following discussion with the senior editorial board of JMLR we now feel the time is right to rebrand with the shorter “Proceedings of Machine Learning Research”.

As part of the rebranding process the editorial team for the Proceedings of Machine Learning Research (which consists of Mark Reid and myself) is launching a small consultation exercise looking for suggestions on how we can improve the service for the community. Please feel free to leave comments on this blog post or via Facebook or Twitter to let us have feedback!

Beware the Rise of the Digital Oligarchy

The Guardian’s media network published a short article I wrote for them on 5th March. They commissioned an article of about 600 words, that appeared on the Guardian’s site, but the original version I wrote was around 1400. I agreed a week’s exclusivity with the Guardian, but now that’s up, the longer version is below (it’s about twice as long).

On a recent visit to Genova, during a walk through the town with my colleague Lorenzo, he pointed out what he said was the site of the world’s first commercial bank. The bank of St George, located just outside the city’s old port, grew to be one of the most powerful institutions in Europe, it bankrolled Charles V and governed many of Genova’s possessions on the republic’s behalf. The trust that its clients placed in the bank is shown in records of its account holders. There are letters from Christopher Columbus to the bank instructing them in the handling of his affairs. The influence of the bank was based on the power of accumulated capital. Capital they could accumulate through the trust of a wealthy client base. The bank was so important in the medieval world that Machiavelli wrote that “if even more power was ceded by the Genovan republic to the bank, Genova would even outshine Venice amongst the Italian city states.” The Bank of St George was once one of the most influential private institutions in Europe.

Today the power wielded by accumulated capital can still dominate international affairs, but a new form of power is emerging, that of accumulated data. Like Hansel and Grettel trailing breadcrumbs into the forest, people now leave a trail of data-crumbs wherever we travel. Supermarket loyalty cards, text messages, credit card transactions, web browsing and social networking. The power of this data emerges, like that of capital, when it’s accumulated. Data is the new currency.

Where does this power come from? Cross linking of different data sources can give deep insights into personality, health, commercial intent and risk. The aim is now to understand and characterize the population, perhaps down to the individual level. Personalization is the watch word for your search results, your social network news feed, your movie recommendations and even your friends. This is not a new phenomenon, psychologists and social scientists have always attempted to characterize the population, to better understand how to govern or who to employ. They acquired their data by carefully constructed questionnaires to better understand personality and intelligence. The difference is the granularity with which these characterizations are now made, instead of understanding groups and sub-groups in the population, the aim is to understand each person. There are wonderful possibilities, we should  better understand health, give earlier diagnoses for diseases such as dementia and provide better support to the elderly and otherwise incapacitated people. But there are also major ethical questions, and they don’t seem to be adequately addressed by our current legal frameworks. For Columbus it was clear, he was the owner of the money in his accounts. His instructions to the bank tell them how to distribute it to friends and relations. They only held his capital under license. A convenient storage facility. Ownership of data is less clear. Historically, acquiring data was expensive: questionnaires were painstakingly compiled and manually distributed. When answering, the risk of revealing too much of ourselves was small because the data never accumulated. Today we leave digital footprints in our wake, and acquisition of this data is relatively cheap. It is the processing of the data that is more difficult.

I’m a professor of machine learning. Machine learning is the main technique at the heart of the current revolution in artificial intelligence. A major aim of our field is to develop algorithms that better understand data: that can reveal the underlying intent or state of health behind the information flow. Already machine learning techniques are used to recognise faces or make recommendations, as we develop better algorithms that better aggregate data, our understanding of the individual also improves.

What do we lose by revealing so much of ourselves? How are we exposed when so much of our digital soul is laid bare? Have we engaged in a Faustian pact with the internet giants? Similar to Faust, we might agree to the pact in moments of levity, or despair, perhaps weakened by poor health. My father died last year, but there are still echoes of him on line. Through his account on Facebook I can be reminded of his birthday or told of common friends. Our digital souls may not be immortal, but they certainly outlive us. What we choose to share also affects our family: my wife and I may be happy to share information about our genetics, perhaps for altruistic reasons, or just out of curiosity. But by doing so we are also sharing information about our children’s genomes. Using a supermarket loyalty card gains us discounts on our weekly shop, but also gives the supermarket detailed information about our family diet. In this way we’d expose both the nature and nurture of our children’s upbringing. Will our decisions to make this information available haunt our children in the future? Are we equipped to understand the trade offs we make by this sharing?

There have been calls from Elon Musk, Stephen Hawking and others to regulate artificial intelligence research. They cite fears about autonomous and sentient artificial intelligence that  could self replicate beyond our control. Most of my colleagues believe that such breakthroughs are beyond the horizon of current research. Sentient intelligence is  still not at all well understood. As Ryan Adams, a friend and colleague based at Harvard tweeted:

Personally, I worry less about the machines, and more about the humans with enhanced powers of data access. After all, most of our historic problems seem to have come from humans wielding too much power, either individually or through institutions of government or business. Whilst sentient AI does seem beyond our horizons, one aspect of it is closer to our grasp. An aspect of sentient intelligence is ‘knowing yourself’, predicting your own behaviour. It does seem to me plausible that through accumulation of data computers may start to ‘know us’ even better than we know ourselves. I think that one concern of Musk and Hawking is that the computers would act autonomously on this knowledge. My more immediate concern is that our fellow humans, through the modern equivalents of the bank of St George, will be exploiting this knowledge leading to a form of data-oligarchy. And in the manner of oligarchies, the power will be in the hands of very few but wielded to the effect of many.

How do we control for all this? Firstly, we need to consider how to regulate the storage of data. We need better models of data-ownership. There was no question that Columbus was the owner of the money in his accounts. He gave it under license, and he could withdraw it at his pleasure. For the data repositories we interact with we have no right of deletion. We can withdraw from the relationship, and in Europe data protection legislation gives us the right to examine what is stored about us. But we don’t have any right of removal. We cannot withdraw access to our historic data if we become concerned about the way it might be used. Secondly, we need to increase transparency. If an algorithm makes a recommendation for us, can we known on what information in our historic data that prediction was based? In other words, can we know how it arrived at that prediction? The first challenge is a legislative one, the second is both technical and social. It involves increasing people’s understanding of how data is processed and what the capabilities and limitations of our algorithms are.

There are opportunities and risks with the accumulation of data, just as there were (and still are) for the accumulation of capital. I think there are many open questions, and we should be wary of anyone who claims to have all the answers. However, two directions seem clear: we need to both increase the power of the people; we need to develop their understanding of the processes. It is likely to be a fraught process, but we need to form a data-democracy: data governance for the people by the people and with the people’s consent.

Neil Lawrence is a Professor of Machine Learning at the University of Sheffield. He is an advocate of “Open Data Science” and an advisor to a London based startup, CitizenMe, that aims to allow users to “reclaim their digital soul”.

Blogs on the NIPS Experiment

There are now quite a few blog posts on the NIPS experiment, I just wanted to put a place together where I could link to them all. It’s a great set of posts from community mainstays, newcomers and those outside our research fields.

Just as a reminder, Corinna and I were extremely open about the entire review process, with a series of posts about how we engaging the reviewers and processing the data. All that background can be found through a separate post here.

At the time of writing there is also still quite a lot of twitter traffic on the experiment.

List of Blog Posts

What an exciting series of posts and perspectives!
For those of you that couldn’t make the conference, here’s what it looked like.
And that’s just one of 5 or six poster rows!

The NIPS Experiment

Just back from NIPS where it was really great to see the results of all the work everyone put in. I really enjoyed the program and thought the quality of all presented work was really strong. Both Corinna and I were particularly impressed by the work that put in by oral presenters to make their work accessible to such a large and diverse audience.

We also released some of the figures from the NIPS experiment, and there was a lot of discussion at the conference about what the result meant.

As we announced at the conference the consistency figure was 25.9%. I just wanted to confirm that in the spirit of openness that we’ve pursued across the entire conference process Corinna and I will provide a full write up of our analysis and conclusions in due course!

Some of the comment in the existing debate is missing out some of the background information we’ve tried to generate, so I just wanted to write a post that summarises that information to highlight its availability.

Scicast Question

With the help of Nicolo Fusi, Charles Twardy and the entire Scicast team we launched a Scicast question a week before the results were revealed. The comment thread for that question already had an amount of interesting comment before the conference. Just for informational purposes before we began reviewing Corinna forecast this figure would be 25% and I forecast it would be 20%. The box plot summary of predictions from Scicast is below.


Comment at the Conference

There was also an amount of debate at the conference about what the results mean, a few attempts to answer this question (based only on the inconsistency score and the expected accept rate for the conference) are available here in this little Facebook discussion and on this blog post.

Background Information on the Process

Just to emphasise previous posts on this year’s conference see below:

  1. NIPS Decision Time
  2. Reviewer Calibration for NIPS
  3. Reviewer Recruitment and Experience
  4. Paper Allocation for NIPS

Software on Github

And finally there is a large amount of code available on a github site for allowing our processes to be recreated. A lot of it is tidied up, but the last sections on the analysis are not yet done because it was always my intention to finish those when the experimental results are fully released.

NIPS: Decision Time

Thursday 28th August

In the last two days I’ve spent nearly 20 hours in teleconferences, my last scheduled conference will start in about 1/2 an hour. Given the available 25 minutes it seemed to make sense to try and put down some thoughts about the decision process.

The discussion period has been constant, there is a stream of incoming queries from Area Chairs, requests for advice on additional reviewers, or how to resolve deadlocked or disputing reviews. Corinna has handled many of these.

Since the author rebuttal period all the papers have been distributed to google spreadsheet lists which are updated daily. They contain paper titles, reviewer names, quality scores, calibrated scores, a probability of accept (under our calibration model), a list of bot-compiled potential issues as well as columns for accept/reject and poster/spotlight. Area chairs have been working in buddy pairs, ensuring that a second set of eyes can rest on each paper. For those papers around the borderline, or with contrasting reviews, the discussion period really can have an affect, we see when calibrating the reviewer scores: over time the reviewer bias is reducing and the scores are becoming more consistent. For this reason we allowed this period to go on a week longer than originally planned, and we’ve been compressing our teleconferences into the last few days.

Most teleconferences consist of two buddy pairs coming together to discuss their papers. Perhaps ideally the pairs would have a similar subject background, but constraints of time zone and the fact that there isn’t a balanced number of subject areas mean that this isn’t necessarily the case.

Corinna and I have been following a similar format. Listing the papers from highest scoring first, to lowest scoring, and starting at the top. For each paper, if it is a confident accept, we try and identify if it might be a talk or a spotlight. This is where the opinion of a range of Area Chairs can be very useful. For uncontroversial accepts that aren’t nominated for orals we spend very little time. This proceeds until we start reaching borderline papers, those in the ‘grey area’: typically papers with an average score around 6. They fall broadly into two categories: those where the reviewers disagree (e.g. scores of 8,6,4), or those where the review are consistent but the reviewers , perhaps, feel underwhelmed (scores of 6,6,6). Area chairs will often work hard to try and get one of the reviewers to ‘champion’ a paper: it’s a good sign if a reviewer has been prepared to argue the case for a paper in the discussion. However, the decisions in this region are still difficult. It is clear that we are rejecting some very solid papers, for reasons of space and because of the overall quality of submissions. It’s hard for everyone to be on the ‘distributing’ end of this system, but at the same time, we’ve all been on the receiving end of it too.

In this difficult ‘grey area’ for acceptance, we are looking for sparks in a paper that push it over the edge to acceptance. So what sort of thing catches an area chair’s eye? A new direction is always welcome, but often leads to higher variance in the reviewer scores. Not all reviewers are necessarily comfortable with the unfamiliar. But if an area chair feels a paper is taking the machine learning field somewhere new, then even if the paper has some weaknesses (e.g. in evaluation or giving context and detailed derivations etc) then we might be prepared to overlook this. We look at the borderline papers in some detail, scanning the reviews, looking for words like ‘innovative’, ‘new directions’ or ‘strong experimental results’. If we see these then as program chairs we definitely become more attentive. We all remember papers presented at NIPS in the past that lead to revolutions in the way machine learning is done. Both Corinna and I would love to have such papers at ‘our’ NIPS.

A paper that is a more developed area will be expected to have done a more rounded job in terms of setting the context and performing the evaluation. Papers in a more developed area will be expected to hit a high level in terms of their standards.

It is often helpful to have an extra pair of eyes (or even two pairs) run through the paper. Each teleconference call normally ends with a few follow up actions for a different area chair to look through a paper or clarify a particular point. Sometimes we also call in domain experts, who may have already produced four formal reviews of other papers, just to get clarification on  particular point. This certainly doesn’t happen for all papers, but those with scores around 7,6,6 or 6,6,6 or 8,6,4 often get this treatment. Much depends on the discussion and content of the existing reviews, but there are still, often, final checks that need carrying out. From a program chair’s perspective, the most important thing is that the Area Chair is comfortable with the decision, and I think most of the job is acting as a sounding board for the Area Chair’s opinion, which I try to reflect back to them. In the same manner as rubber duck debugging, just vocalising the issues sometimes causes them to be crystallised in the mind. Ensuring that Area Chairs are calibrated to each other is also important. The global probabilities of accept from the reviewer calibration model really help here. As we go through papers I keep half an eye on those, not to influence the decision of a particular paper so much as to ensure that at the end of the process we don’t have a surplus of accepts. At this stage all decisions are tentative, but we hope not to have to come back to too many of them.

Monday 1st September

Corinna finished her last video conference on Friday, Saturday, Sunday and Monday (Labor Day) were filled with making final decisions on accepts, then talks and finally spotlights. Accepts were hard, we were unable to take all the papers that were possible accept, as we would have gone way over our quota of 400. We had to make a decision on duplicated papers where the decisions were in conflict, more details of this to come at the conference. From remembering what a pain it was to do the schedule after the acceptances, and also following advice from Leon Bottou that the talk program emerges to reflect the accepted posters, we finalized the talk and spotlight program whilst putting talks and spotlights directly into the schedule. We had to hone the talks down to 20 from about 40 candidates and spotlights we squeezed in 62 from over a hundred suggestions. We spent three hours in teleconference each day, as well as preparation time, across Labor Day weekend putting together the first draft of the schedule. It was particularly impressive how quickly area chairs responded to any of our follow up queries to our notes from the teleconferences. Particularly those in the US who were enjoying the traditional last weekend of summer.

Tuesday 2nd September

I had an all day meeting in Manchester for the a network of researchers focussed on mental illness. It was really good to have a day discussing research, my first in a long time. I thought very little about NIPS until on the train home, I thought to have a little look at the conference shape. I actually ended up looking at a lot of the papers we rejected, many from close colleagues and friends. I found it a little depressing. I have no doubt there is a lot of excellent work there, and I know how disappointed my friends and colleagues will be to receive those rejections. We did an enormous amount to ensure that the process was right, and I have every confidence in the area chairs and reviewers. But at the end of the day, you know that you will be rejecting a lot of good work. It brought to mind a thought I had at the allocation stage. When we had the draft allocation to each area chair, I went through several of them sanity checking the quality of the allocation. Naturally, I checked those associated with area chairs who are closer to my own areas of expertise. I looked through the paper titles, and I couldn’t help but think what a good workshop each of those allocations would make. There would be some great ideas, some partially developed ideas. There would be some really great experiments and some weaker experiments. But there would be a lot of debate at such workshop. None or very few of the papers would be uninteresting: there would certainly be errors in papers, but that’s one of the charms of a workshop, there’s still a lot more to be said about an idea when it’s presented at a workshop.

Friday 5th September

Returning from an excellent two day UCL-Duke workshop. There is a lot of curiosity about the NIPS experiment, but Corinna and I have agreed to keep the results embargoed until the conference.

Saturday 6th September

Area chairs had until Thursday to finalise their reviews in the light of the final decisions, and also to raise any concerns they had about the final decisions. My own experience of area chairing is that you can have doubts about your reasoning when you are forced to put pen to paper and write the meta review. We felt it was important to not rush the final process to allow any of those doubts to emerge. In the end, the final program has 3 or 4 changes from the draft we first distributed on Monday night, so there may be some merit in this approach. We had a further 3 hour teleconference today to go through the meta-reviews, with a particular focus on those for papers around the decision boundary. Other issues such as comments in the wrong place (the CMT interface can be fairly confusing, 3% of meta reviews were actually placed in the box meant for notes to the program chairs) were also covered. Our big concern was if the area chairs had written a review consistent with our final verdict. A handy learning task would have been to build a sentiment model to predict accept/reject from the meta review.

Monday 8th September 

Our plan had been to release reviews this morning, but we were still waiting for a couple of meta-reviews to be tidied up and had an outstanding issue on one paper. I write this with CMT ‘loaded’ and ready to distribute decisions. However, when I preview the emails the variable fields are not filled in (if I hit ‘send’ I would send 5,000 emails that start “Dear $RecipientFirstName$, which sounds somewhat impersonal … although perhaps more critical is that the authors would be informed of the fate of paper “$Title$,” which may lead to some confusion. CMT are on a different time zone, 8 hours behind. Fortunately, it is late here, so there is a good chance they will respond in time …

Tuesday 9th September

I was wide awake at 6:10 despite going to sleep at 2 am. I always remember when I was Area Chair with John Platt that he would be up late answering emails and then out of bed again 4 hours later doing it again. A few final checks and the all clear for everything is there. Pressed the button at 6:22 … emails are still going out and it is 10:47. 3854 of the 5615 emails have been sent … one reply which was an out of office email from China. Time to make a coffee …

Final Statistics

1678 submissions
414 papers accepted
20 papers for oral
62 for spotlight
331 for poster
19 rejected without review

Epilogue to Decision Mail:  So what was wrong with those variable names? I particularly like the fact that something different was wrong with each one. $RecipientFirstName$ and $RecipientEmail$ are  not available in the “Notification Wizard”, whereas they are in the normal email sending system. Then I got the other variables wrong, $Title$->$PaperTitle$ and $PaperId$->$PaperID$, but since neither of the two I knew to be right were working I assumed there was something wrong with the whole variable substitution system … rather than it being that (at least) two of the variable types just happen to be missing from this wizard … CMT responded nice and quickly though … that’s one advantage of working late.

Epilogue on Acceptances: At the time of the conference there were only 411 papers presented because three were withdrawn. Withdrawals were usually due to some deeper problem authors had found in there own work, perhaps triggered by comments from reviewers. So in the end there were 411 papers accepted and 328 posters.

Author Concerns

So the decisions have been out for a few days now, and of course we have had some queries about our processes. Every one has been pretty reasonable, and their frustration is understandable when three reviewers have argued for accept but the final decision is to reject. This is an issue with ‘space-constrained’ conferences. Whether a paper gets through in the end can depend on subjective judgements about the paper’s qualities. In particular, we’ve been looking for three components to this: novelty, clarity and utility. Papers with borderline scores (and borderline here might be that the average score is in the weak accept range) are examined closely. The decision about whether the paper is accepted at this point necessarily must come down to judgement, because for a paper to get scores this high the reviewers won’t have identified a particular problem with the paper. The things that come through are how novel the paper is, how useful the idea is, and how clearly it’s presented. Several authors seem to think that the latter should be downplayed. As program chairs, we don’t necessarily agree. It’s true that it is a great shame when a great idea is buried in poor presentation, but it’s also true that the objective of a conference is communication, and therefore clarity of presentation definitely plays a role. However, it’s clear that all these three criteria are a matter of academic judgement: that of the reviewers, the area chair and the quad groups in the teleconferences. All the evidence we’ve seen is that reviewers and area chairs did weigh these aspects carefully, but that doesn’t mean that all their decisions can be shown to be right, because they are often a matter of perspective. Naturally authors are upset when what feels like a perfectly good paper is rejected on more subjective grounds. Most of the queries are on papers where this is felt to be the case.

There has also been one query on process, and whether we did enough to evaluate on these criteria, for those papers in the borderline area, before author rebuttal. Authors are naturally upset when the area chair raises such issues in the final decision’s meta review, but these points weren’t there before. Personally I sympathise with both authors and area chairs in this case. We made some effort to encourage authors to identify such papers before rebuttal (we sent out attention reports that highlighted probable borderline papers) but our main efforts at the time were chasing missing and inappropriate or insufficient reviews. We compressed a lot into a fairly short time, and it was also a period when many are on holiday. We were very pleased with the performance of our area chairs, but I think it’s also unsurprising if an area chair didn’t have time to carefully think through these aspects before author rebuttal.

My own feeling is that the space constraint on NIPS is rather artificial, and a lot of these problems would be avoided if it wasn’t there. However, there is a counter argument that suggests that to be a top quality conference NIPS has to have a high reject rate. NIPS is used in tenure cases within the US and these statistics are important there. Whilst I reject these ideas: I don’t think the role of a conference is to allow people to get promoted in a particular country, nor is that the role of a journal: they are both involved in the communication and debate of scientific ideas. However, I do not view the program chair roles as reforming the conference ‘in their own image’. You have to also consider what NIPS means to the different participants.

NIPS as Christmas

I came up with an analogy for this which has NIPS in the role of Christmas (you can substitute Thanksgiving, Chinese New Year, or your favourite traditional feast). In the UK Christmas is a traditional holiday about which people have particular expectations, some of them major (there should be Turkey for Christmas Dinner) and some of them minor (there should be an old Bond movie on TV). These expectations have changed over time.  The Victorians used to eat Goose and the Christmas tree was introduced from Germany by Prince Albert’s influence in the Royal Household, and they also didn’t have James Bond, I think they used Charles Dickens instead. However, you can’t just change Christmas ‘overnight’, it needs to be a smooth transition. You can make lots of arguments about how Christmas could be a better meal, or that presents make the occasion too commercial, but people have expectations so the only way to make change is slowly. Taking small steps in the right direction. For any established successful venture this approach makes a lot of sense. There are many more ways to fail than be successful and I think that the rough argument is that if you are starting from a point of success you should be careful about how quickly you move because you are likely end up in failure. However, not moving at all also leads to failure. I think this year we’ve introduced some innovations and an analysis of the process that will hopefully lead to improvements. We certainly aren’t alone in these innovations, each NIPS before us has done the same thing (I’m a particular fan of Zoubin and Max’s publication of the reviews). Whether we did this well or not, like those borderline papers, is a matter for academic judgement. In the meantime I (personally) will continue to try to enjoy NIPS for what it is, whilst wondering about what it could be and how we might get there. I also know that as a community we will continue to innovate, launching new conferences with new models for reviewing (like ICLR).

Reviewer Calibration for NIPS

One issue that can occur for a conference is differences in interpretation of the reviewing scale. For a number of years (dating back to at least NIPS 2002) mis-calibration between reviewers has been corrected for with a model. Area chairs see not just the actual scores of the paper, but also ‘corrected scores’. Both are used in the decision making process.

Reviewer calibration at NIPS dates back to a model first implemented in 2002 by John Platt when he was an area chair. It’s a regularized least squares model that Chris Burges and John wrote up in 2012. They’ve kindly made their write up available here.

Calibrated scores are used alongside original scores to help in judging the quality of papers.

We also knew that Zoubin and Max had modified the model last year, along with their program manager Hong Ge. However, before going through the previous work we first of all approached the question independently. However, the model we came up with turned out to be pretty much identical to that of Hong, Zoubin and Max, and the approach we are using to compute probability of accepts was also identical. The model is a probabilistic reinterpretation of the Platt and Burges model: one that treats the bias parameters and quality parameters as latent variables that are normally distributed. Marginalizing out the latent variables leads to an ANOVA style description of the data.

The Model

Our assumption is that the score from the jth reviewer for the ith paper is given by

y_{i,j} = f_i + b_j + \epsilon_{i, j}

where f_i is the objective quality of paper i and b_j is an offset associated with reviewer j. \epsilon_{i,j} is a subjective quality estimate which reflects how a specific reviewer’s opinion differs from other reviewers (such differences in opinion may be due to differing expertise or perspective). The underlying ‘objective quality’ of the paper is assumed to be the same for all reviewers and the reviewer offset is assumed to be the same for all papers.

If we have n papers and m reviewers then this implies n + m + nm values need to be estimated. Of course, in practice, the matrix is sparse, and we have no way of estimating the subjective quality for paper-reviewer pairs where no assignment was made. However, we can firstly assume that the subjective quality is drawn from a normal density with variance \sigma^2

\epsilon_{i, j} \sim N(0, \sigma^2 \mathbf{I})

which reduces us to n + m + 1 parameters. The Platt-Burges model then estimated these parameters by regularized least squares. Instead, we follow Zoubin, Max and Hong’s approach of treating these values as latent variables. We assume that the objective quality, f_i, is also normally distributed with mean \mu and variance \alpha_f,

f_i \sim N(\mu, \alpha_f)

this now reduces us to $m$+3 parameters. However, we only have approximately $4m$ observations (4 papers per reviewer) so parameters may still not be that well determined (particularly for those reviewers that have only one review). We therefore also assume that the reviewer offset is a zero mean normally distributed latent variable,

b_j \sim N(0, \alpha_b),

leaving us only four parameters: \mu, \sigma^2, \alpha_f and \alpha_b. When we combine these assumptions together we see that our model assumes that any given review score is a combination of 3 normally distributed factors: the objective quality of the paper (variance \alpha_f), the subjective quality of the paper (variance \sigma^2) and the reviewer offset (variance \alpha_b). The a priori marginal variance of a reviewer-paper assignment’s score is the sum of these three components. Cross-correlations between reviewer-paper assignments occur if either the reviewer is the same (when the cross covariance is given by \alpha_b) or the paper is the same (when the cross covariance is given by $\alpha_f$). With a constant mean coming from the mean of the ‘subjective quality’, this gives us a joint model for reviewer scores as follows:

\mathbf{y} \sim N(\mu \mathbf{1}, \mathbf{K})

where \mathbf{y} is a vector of stacked scores $\mathbf{1}$ is the vector of ones and the elements of the covariance function are given by

k(i,j; k,l) = \delta_{i,k} \alpha_f + \delta_{j,l} \alpha_b + \delta_{i, k}\delta_{j,l} \sigma^2

where i and j are the index of the paper and reviewer in the rows of \mathbf{K} and k and l are the index of the paper and reviewer in the columns of \mathbf{K}.

It can be convenient to reparameterize slightly into an overall scale $\alpha_f$, and normalized variance parameters,

k(i,j; k,l) = \alpha_f(\delta_{i,k} + \delta_{j,l} \frac{\alpha_b}{\alpha_f} + \delta_{i, k}\delta_{j,l} \frac{\sigma^2}{\alpha_f})

which we rewrite to give two ratios: offset/objective quality ratio, \hat{\alpha}_b and subjective/objective ratio \hat{\sigma}^2 ratio.

k(i,j; k,l) = \alpha_f(\delta_{i,k} + \delta_{j,l} \hat{\alpha}_b + \delta_{i, k}\delta_{j,l} \hat{\sigma}^2)

The advantage of this parameterization is it allows us to optimize \alpha_f directly through maximum likelihood (with a fixed point equation). This leaves us with two free parameters, that we might explore on a grid.

We expect both $\mu$ and $\alpha_f$ to be very well determined due to the number of observations in the data. The negative log likelihood is

\frac{|\mathbf{y}|}{2}\log2\pi\alpha_f + \frac{1}{2}\log \left|\hat{\mathbf{K}}\right| + \frac{1}{2\alpha_f}\mathbf{y}^\top \hat{\mathbf{K}}^{-1} \mathbf{y}

where |\mathbf{y}| is the length of \mathbf{y} (i.e. the number of reviews) and \hat{\mathbf{K}}=\alpha_f^{-1}\mathbf{K} is the scale normalised covariance. This negative log likelihood is easily minimized to recover

\alpha_f = \frac{1}{|\mathbf{y}|} \mathbf{y}^\top \hat{\mathbf{K}}^{-1} \mathbf{y}

A Bayesian analysis of alpha_f parameter is possible with gamma priors, but it would merely shows that this parameter is extremely well determined (the degrees of freedom parameter of the associated Student-t marginal likelihood scales will the number of reviews, which will be around |\mathbf{y}| \approx 6,000 in our case.

We can set these parameters by maximum likelihood and then we can remove the offset from the model by computing the conditional distribution over the paper scores with the bias removed, s_{i,j} = f_i + \epsilon_{i,j}. This conditional distribution is found as

\mathbf{s}|\mathbf{y}, \alpha_f,\alpha_b, \sigma^2 \sim N(\boldsymbol{\mu}_s, \boldsymbol{\Sigma}_s)


\boldsymbol{\mu}_s = \mathbf{K}_s\mathbf{K}^{-1}\mathbf{y}


\boldsymbol{\Sigma}_s = \mathbf{K}_s - \mathbf{K}_s\mathbf{K}^{-1}\mathbf{K}_s

and \mathbf{K}_s is the covariance associated with the quality terms only with elements given by,

k_s(i,j;k,l) = \delta_{i,k}(\alpha_f + \delta_{j,l}\sigma^2).

We now use \boldsymbol{\mu}_s (which is both the mode and the mean of the posterior over \mathbf{s}) as the calibrated quality score.

Analysis of Variance

The model above is a type of Gaussian process model with a specific covariance function (or kernel). The variances are highly interpretable though, because the covariance function is made up of a sum of effects. Studying these variances is known as analysis of variance in statistics, and is commonly used for batch effects. It is known as an ANOVA model. It is easy to extend this model to include batch effects such as whether or not the reviewer is a student or whether or not the reviewer has published at NIPS before. We will conduct these analyses in due course. Last year, Zoubin, Max and Hong explored whether the reviewer confidence could be included in the model, but they found it did not help with performance on hold out data.

Scatter plot of Quality Score vs Calibrated Quality Score

Scatter plot of Quality Score vs Calibrated Quality Score

Probability of Acceptance

To predict the probability of acceptance of any given paper, we sample from the multivariate normal that gives the posterior over \mathbf{s}. These samples are sorted according to the values of \mathbf{s}, and the top scoring papers are considered to be accepts. These samples are taken 1000 times and the probability of acceptance is computed for each paper by seeing how many times the paper received a positive outcome from the thousand samples.