Legislation for Personal Data: Magna Carta or Highway Code?

Karl Popper is perhaps one of the most important thinkers from the 20th century. Not purely for his philosophy of science, but for giving a definitive answer to a common conundrum: “Which comes first, the chicken or the egg?”. He says that they were simply preceded by an ‘earlier type of egg’. I take this to mean that the answer is neither: they actually co-evolved. What do I mean by co-evolved? Well broadly speaking there once were two primordial entities which weren’t very chicken-like or egg-like at all, over time small changes occurred, supported by natural selection, rendering those entities unrecognisable from their origins into two of our most familiar foodstuffs of today.

I find the process of co-evolution remarkable, and to some extent unimaginable, or certainly it seems to me difficult to visualise the intermediate steps. Evolution occurs by natural selection: selection by the ‘environment’, but when we refer to co-evolution we are clarifying that this is a complex interaction. The primordial entities effect the environment around them, therefore changing the ‘rules of the game’ as far as survival is concerned. In such a convolved system certainties about the right action disappear very quickly.

What use are chickens and eggs when talking about personal data? Well, Popper used the question to illustrate a point about scientific endeavour. He was talking about science and reflecting on how scientific theories co-evolve with experiments. However, that’s not the point I’d like to make here. Co-evolution is very general, one area it arises is when technological advance changes society to such an extent that existing legislative frameworks become inappropriate. Tim Berners Lee has called for a Magna Carta for the digital age, and I think this is a worthy idea, but is it the right idea? A digital bill of rights may be the right idea in the longer run, but I don’t think we are ready to draft it yet. My own research is machine learning, the main technology underpinning the current AI revolution. A combination of machine learning, fast computers, and interconnected data means that the technological landscape is changing so fast that it is effecting society around us in ways that no one envisaged twenty years ago.

Even if we were to start with the primordial entities that presaged the chicken and the egg, and we knew all about the process of natural selection, could we have predicted or controlled the animal of the future that would emerge? We couldn’t have done. The chicken exists today as the product of its environmental experience, an experience that was unique to it. The end point we see is one of is highly sensitive to very small perturbations that could have occurred at the beginning.

So should we be writing legislation today which ties down the behaviour of future generations? There is precedent for this from the past. Before the printing press was introduced, no one would have begrudged the monks’ right to laboriously transcribe the books of the day. Printing meant it was necessary to protect the “copy rights” of the originator of the material. No one could have envisaged that those copyright laws would also be used to protect software, or digital music. In the industrial revolution the legal mechanism of ‘letters patent’ evolved to protect creative insight. Patents became protection of intellectual property, ensuring that inventors’ ideas could be shared under license. These mechanisms also protect innovation in the digital world. In some jurisdictions they are now applied to software and even user interface designs. Of course even this legislation is stretched in the face of digital technology and may need to evolve, as it has done in the past.

The new legislative challenge is not in protecting what is innovative about people, but what is commonplace about them. The new value is in knowing the nature of people: predicting their needs and fulfilling them. This is the value of interconnection of personal data. It allows us to make predictions about an individual by comparing him or her to others. It is the mainstay of the modern internet economy: targeted advertising and recommendation systems. It underpins my own research ideas in personalisation of health treatments and early diagnosis of disease. But it leads to potential dangers, particularly where the uncontrolled storage and flow of an individual’s personal information is concerned. We are reaching the point where some studies are showing that computer prediction of our personality is more accurate than that of our friends and relatives. How long before an objective computer prediction of our personality can outperform our own subjective assessment of ourselves? Some argue those times are already upon us. It feels dangerous for such power to be wielded unregulated by a few powerful groups. So what is the answer? New legislation? But how should it come about?

In the long term, I think we need to develop a set of rules and legislation, that include principles that protect our digital rights. I think we need new models of ownership that allow us to control our private data. One idea that appeals to me is extending data protection legislation with the right not only to view data held about us, but to also ask for it to be deleted. However, I can envisage many practical problems with that idea, and these need to be resolved so we can also enjoy the benefits of these personalised predictions.

As wonderful as some of the principles in the Magna Carta are, I don’t think it provides a good model for the introduction of modern legislation. It was actually signed under duress: under a threat of violent revolution. The revolution was threatened by a landed gentry, although the consequences would have been felt by all. Revolutions don’t always end well. They occur because people can become deadlocked: they envisage different futures for themselves and there is no way to agree on a shared path to different end points. The Magna Carta was also a deal between the king and his barons. Those barons were asking for rights that they had no intention of extending within their fiefdoms. These two characteristics: redistribution of power amongst a powerful minority, with significant potential consequences for the a disenfranchised majority, make the Magna Carta, for me, a poor analogy for how we would like things to proceed.

The chicken and the egg remind us that the actual future will likely be more remarkable than any of us can currently imagine. Even if we all seek a particular version of the future this version of the future is unlikely to ever exist in the form that we imagine. Open, receptive and ongoing dialogue between the interested and informed parties is more likely to bring about a societal consensus. But can this happen in practice? Could we really evolve a set of rights and legislative principles which lets us achieve all our goals? I’d like to propose that rather than taking as our example a mediaeval document, written on velum, we look to more recent changes in society and how they have been handled. In England, the Victorians may have done more than anyone to promote our romantic notion of the Magna Carta, but I think we can learn more by looking at how they dealt with their own legislative challenges.

I live in Sheffield, and cycle regularly in the Peak District national park. Enjoyment of the Peak Park is not restricted to our era. At 10:30 on Easter Monday in 1882 a Landau carriage, rented by a local cutler, was heading on a day trip from Sheffield to the village of Tideswell, in the White Peak. They’d left Sheffield via Ecclesall Road, and as they began to descend the road beneath Froggatt Edge, just before the Grouse Inn they encountered a large traction engine towing two trucks of coal. The Landau carriage had two horses and had been moving at a brisk pace of four and a half miles an hour. They had already passed several engines on the way out of Sheffield. However, as they moved out to pass this one, it let out a continuous blast of steam and began to turn across their path into the entrance of the inn. One of the horses took fright pulling the carriage up a bank, throwing Ben Deakin Littlewood and Mary Coke Smith from the carriage and under the wheels of the traction engine. I cycle to work past their graves every day. The event was remarkable at the time, so much so that is chiselled into the inscription on Ben’s grave.

The traction engine was preceded, as legislation since 1865 had dictated, by a boy waving a red flag. It was restricted to two and a half miles an hour. However, the boy’s role was to warn oncoming traffic. The traction engine driver had turned without checking whether the road was clear of overtaking traffic. It’s difficult to blame the driver though. I imagine that there was quite a lot involved in driving a traction engine in 1882. It turned out that the driver was also preoccupied with a broken wheel on one of his carriages. He was turning into the Grouse to check the wheel before descending the road.

This example shows how legislation can sometimes be extremely restrictive, but still not achieve the desired outcome. Codification of the manner in which a vehicle should be overtaken came later, at a time when vehicles were travelling much faster. The Landau carriage was overtaking about 100 meters after a bend. The driver of the traction engine didn’t check over his shoulder immediately before turning, although he claimed he’d looked earlier. Today both drivers’ responsibilities are laid out in the “Highway Code”. There was no “Mirror, Signal, Manoeuvre” in 1882. That came later alongside other regulations such as road markings and turn indicators.

The shared use of our road network, and the development of the right legislative framework might be a good analogy for how we should develop legislation for protecting our personal privacy. No analogy is ever perfect, but it is clear that our society both gained and lost through introduction of motorised travel. Similarly, the digital revolution will bring advantages but new challenges. We need to have mechanisms that allow for negotiated solutions. We need to be able to argue about the balance of current legislation and how it should evolve. Those arguments will be driven by our own personal perspectives. Our modern rules of the road are in the Highway Code. It lists responsibilities of drivers, motorcyclists, cyclists, mobility scooters, pedestrians and even animals. It gives legal requirements and standards of expected behaviour. The Highway Code co-evolved with transport technology: it has undergone 15 editions and is currently being rewritten to accommodate driverless cars. Even today we still argue about the balance of this document.

In the long term, when technologies have stabilised, I hope we will be able to distill our thinking to a bill of rights for the internet. But such a document has a finality about it which seems inappropriate in the face of technological uncertainty. Calls for a Magna Carta provide soundbites that resonate and provide rallying points. But they can polarise, presaging unhelpful battles. Between the Magna Carta and the foundation of the United States the balance between the English monarch and his subjects was reassessed through the English Civil War and the American Revolution. Wven tI don’t think we can afford such discord when drafting the rights of the digital age. We need mechanisms that allow for open debate, rather than open battle. Before a bill of rights for the internet, I think we need a different document. I’d like to sound the less resonant call for a document that allows for dialogue, reflecting concerns as they emerge. It could summarise current law and express expected standards of behaviour. With regular updating it would provide an evolving social contract between all the users of the information highway: people, governments, businesses, hospitals, scientists, aid organisations. Perhaps instead of a Magna Carta for the internet we should start with something more humble: the rules of the digital road.

This blog post is an extended version of an written for the Guardian’s media network: “Let’s learn the rules of the digital road before talking about a web Magna Carta”

Proceedings of Machine Learning Research

Back in 2006 when the wider machine learning community was becoming aware of Gaussian processes (mainly through the publication of the Rasmussen and WIlliams book). Joaquin Quinonero Candela, Anton Schwaighofer and I organised the Gaussian Processes in Practice workshop at Bletchley Park. We planned a short proceedings for the workshop, but when I contacted Springer’s LNCS proceedings, a rather dismissive note came back with an associated prohibitive cost. Given that the ranking of LNCS wasn’t (and never has been) that high, this seemed a little presumptuous on their part. In response I contacted JMLR and asked if they’d ever considered a proceedings track. The result was that I was asked by Leslie Pack Kaelbling to launch the proceedings track.

JMLR isn’t just open access, but there is no charge to authors. It is hosted by servers at MIT and managed by the community.

We launched the proceedings in March 2007 with the first volume from the Gaussian Processes in Practice workshop. Since then there have been 38 volumes including two volumes in the pipeline. The proceedings publishes several leading conferences in machine learning including AISTATS, COLT and ICML.

From the start we felt that it was important to share the branding of JMLR with the proceedings, to show that the publication was following the same ethos as JMLR. However, this led to the rather awkward name: JMLR Workshop and Conference Proceedings, or JMLR W&CP. Following discussion with the senior editorial board of JMLR we now feel the time is right to rebrand with the shorter “Proceedings of Machine Learning Research”.

As part of the rebranding process the editorial team for the Proceedings of Machine Learning Research (which consists of Mark Reid and myself) is launching a small consultation exercise looking for suggestions on how we can improve the service for the community. Please feel free to leave comments on this blog post or via Facebook or Twitter to let us have feedback!

Beware the Rise of the Digital Oligarchy

The Guardian’s media network published a short article I wrote for them on 5th March. They commissioned an article of about 600 words, that appeared on the Guardian’s site, but the original version I wrote was around 1400. I agreed a week’s exclusivity with the Guardian, but now that’s up, the longer version is below (it’s about twice as long).

On a recent visit to Genova, during a walk through the town with my colleague Lorenzo, he pointed out what he said was the site of the world’s first commercial bank. The bank of St George, located just outside the city’s old port, grew to be one of the most powerful institutions in Europe, it bankrolled Charles V and governed many of Genova’s possessions on the republic’s behalf. The trust that its clients placed in the bank is shown in records of its account holders. There are letters from Christopher Columbus to the bank instructing them in the handling of his affairs. The influence of the bank was based on the power of accumulated capital. Capital they could accumulate through the trust of a wealthy client base. The bank was so important in the medieval world that Machiavelli wrote that “if even more power was ceded by the Genovan republic to the bank, Genova would even outshine Venice amongst the Italian city states.” The Bank of St George was once one of the most influential private institutions in Europe.

Today the power wielded by accumulated capital can still dominate international affairs, but a new form of power is emerging, that of accumulated data. Like Hansel and Grettel trailing breadcrumbs into the forest, people now leave a trail of data-crumbs wherever we travel. Supermarket loyalty cards, text messages, credit card transactions, web browsing and social networking. The power of this data emerges, like that of capital, when it’s accumulated. Data is the new currency.

Where does this power come from? Cross linking of different data sources can give deep insights into personality, health, commercial intent and risk. The aim is now to understand and characterize the population, perhaps down to the individual level. Personalization is the watch word for your search results, your social network news feed, your movie recommendations and even your friends. This is not a new phenomenon, psychologists and social scientists have always attempted to characterize the population, to better understand how to govern or who to employ. They acquired their data by carefully constructed questionnaires to better understand personality and intelligence. The difference is the granularity with which these characterizations are now made, instead of understanding groups and sub-groups in the population, the aim is to understand each person. There are wonderful possibilities, we should  better understand health, give earlier diagnoses for diseases such as dementia and provide better support to the elderly and otherwise incapacitated people. But there are also major ethical questions, and they don’t seem to be adequately addressed by our current legal frameworks. For Columbus it was clear, he was the owner of the money in his accounts. His instructions to the bank tell them how to distribute it to friends and relations. They only held his capital under license. A convenient storage facility. Ownership of data is less clear. Historically, acquiring data was expensive: questionnaires were painstakingly compiled and manually distributed. When answering, the risk of revealing too much of ourselves was small because the data never accumulated. Today we leave digital footprints in our wake, and acquisition of this data is relatively cheap. It is the processing of the data that is more difficult.

I’m a professor of machine learning. Machine learning is the main technique at the heart of the current revolution in artificial intelligence. A major aim of our field is to develop algorithms that better understand data: that can reveal the underlying intent or state of health behind the information flow. Already machine learning techniques are used to recognise faces or make recommendations, as we develop better algorithms that better aggregate data, our understanding of the individual also improves.

What do we lose by revealing so much of ourselves? How are we exposed when so much of our digital soul is laid bare? Have we engaged in a Faustian pact with the internet giants? Similar to Faust, we might agree to the pact in moments of levity, or despair, perhaps weakened by poor health. My father died last year, but there are still echoes of him on line. Through his account on Facebook I can be reminded of his birthday or told of common friends. Our digital souls may not be immortal, but they certainly outlive us. What we choose to share also affects our family: my wife and I may be happy to share information about our genetics, perhaps for altruistic reasons, or just out of curiosity. But by doing so we are also sharing information about our children’s genomes. Using a supermarket loyalty card gains us discounts on our weekly shop, but also gives the supermarket detailed information about our family diet. In this way we’d expose both the nature and nurture of our children’s upbringing. Will our decisions to make this information available haunt our children in the future? Are we equipped to understand the trade offs we make by this sharing?

There have been calls from Elon Musk, Stephen Hawking and others to regulate artificial intelligence research. They cite fears about autonomous and sentient artificial intelligence that  could self replicate beyond our control. Most of my colleagues believe that such breakthroughs are beyond the horizon of current research. Sentient intelligence is  still not at all well understood. As Ryan Adams, a friend and colleague based at Harvard tweeted:

Personally, I worry less about the machines, and more about the humans with enhanced powers of data access. After all, most of our historic problems seem to have come from humans wielding too much power, either individually or through institutions of government or business. Whilst sentient AI does seem beyond our horizons, one aspect of it is closer to our grasp. An aspect of sentient intelligence is ‘knowing yourself’, predicting your own behaviour. It does seem to me plausible that through accumulation of data computers may start to ‘know us’ even better than we know ourselves. I think that one concern of Musk and Hawking is that the computers would act autonomously on this knowledge. My more immediate concern is that our fellow humans, through the modern equivalents of the bank of St George, will be exploiting this knowledge leading to a form of data-oligarchy. And in the manner of oligarchies, the power will be in the hands of very few but wielded to the effect of many.

How do we control for all this? Firstly, we need to consider how to regulate the storage of data. We need better models of data-ownership. There was no question that Columbus was the owner of the money in his accounts. He gave it under license, and he could withdraw it at his pleasure. For the data repositories we interact with we have no right of deletion. We can withdraw from the relationship, and in Europe data protection legislation gives us the right to examine what is stored about us. But we don’t have any right of removal. We cannot withdraw access to our historic data if we become concerned about the way it might be used. Secondly, we need to increase transparency. If an algorithm makes a recommendation for us, can we known on what information in our historic data that prediction was based? In other words, can we know how it arrived at that prediction? The first challenge is a legislative one, the second is both technical and social. It involves increasing people’s understanding of how data is processed and what the capabilities and limitations of our algorithms are.

There are opportunities and risks with the accumulation of data, just as there were (and still are) for the accumulation of capital. I think there are many open questions, and we should be wary of anyone who claims to have all the answers. However, two directions seem clear: we need to both increase the power of the people; we need to develop their understanding of the processes. It is likely to be a fraught process, but we need to form a data-democracy: data governance for the people by the people and with the people’s consent.

Neil Lawrence is a Professor of Machine Learning at the University of Sheffield. He is an advocate of “Open Data Science” and an advisor to a London based startup, CitizenMe, that aims to allow users to “reclaim their digital soul”.

Blogs on the NIPS Experiment

There are now quite a few blog posts on the NIPS experiment, I just wanted to put a place together where I could link to them all. It’s a great set of posts from community mainstays, newcomers and those outside our research fields.

Just as a reminder, Corinna and I were extremely open about the entire review process, with a series of posts about how we engaging the reviewers and processing the data. All that background can be found through a separate post here.

At the time of writing there is also still quite a lot of twitter traffic on the experiment.

List of Blog Posts

What an exciting series of posts and perspectives!
For those of you that couldn’t make the conference, here’s what it looked like.
And that’s just one of 5 or six poster rows!

NIPS Reviewer Recruitment and ‘Experience’

Triggered by a question from Christoph Lampert as a comment on a previous blog post on reviewer allocation, I thought I’d post about how we did reviewer recruitment, and what the profile of reviewer ‘experience’ is, as defined by their NIPS track record.

I wrote this blog post, but it ended up being quite detailed, so Corinna suggested I put the summary of reviewer recruitment first, which makes a lot of sense. If you are interested in the details of our reviewer recruitment, please read on to the section below ‘Experience of the Reviewing Body’.

Questions

As a summary, I’ve imagined two questions and given answers below:

  1. I’m an Area Chair for NIPS, how did I come to be invited?
    You were personally known to one of the Program Chairs as an expert in your domain who had good judgement about the type and quality of papers we are looking to publish at NIPS. You have a strong publication track record in your domain. You were known to be reliable and responsive. You may have a track record of workshop organization in your domain and/or experience in area chairing previously at NIPS or other conferences. Through these activities you have shown community leadership.
  2. I’m a reviewer for NIPS, how did I come to be invited?
    You could have been invited for one of several reasons:

    • you were a reviewer for NIPS in 2013
    • you were a reviewer for AISTATS in 2012
    • you were personally recommended by an Area Chair or a Program Chair
    • you have been on a Program Committee (i.e. you were an Area Chair) at a leading international conference in recent years (specifically NIPS since 2000, ICML since 2008, AISTATS since 2011).
    • you have published 2 or more papers at NIPS since 2007
    • you published at NIPS in either 2012 or 2013 and your publication track record was personally reviewed and approved by one of the Program Chairs.

Experience of The Reviewing Body

That was the background to Reviewer and Area Chair recruitment, and it is also covered below, in much more detail than perhaps anyone could wish for! Now, for those of you that have gotten this far, we can try and look at the result in terms of one way of measuring reviewer experience. Our aim was to increase the number of reviewers and try and maintain or increase the quality of the reviewing body. Of course quality is subjective, but we can look at things such as reviewer experience in terms of how many NIPS publications they have had. Note that we have purposefully selected many reviewers and area chairs who have never previously published at NIPS, so this is clearly not the only criterion for experience, but it is one that is easily available to us and given Christoph’s question, the statistics may be of wider interest.

Reviewer NIPS Publication Record

Firstly we give the histograms for cumulative reviewer publications. We plot two histograms, publications since 2007 (to give an idea of long term trends) and publications since 2012 (to give an idea of recent trends).

Reviewer Publications 2007

Histogram of NIPS 2014 reviewers publication records since 2007.

Our most prolific reviewer has published 22 times at NIPS since 2007! That’s an average of over 3 per year (for comparison, I’ve published 7 times at NIPS since 2007).

Looking more recently we can get an idea of the number of NIPS publications reviewers have had since 2012.

Histogram of NIPS 2014 reviewers publication records since 2012.

Impressively the most prolific reviewer has published 10 papers at NIPS over the last two years, and intriguingly it is not the same reviewer that has published 22 times since 2007. The mode of 0 reviews is unsurprising, and comparing the histograms it looks like about 200 of our reviewing body haven’t published in the last two years, but have published at NIPS since 2007.

Area Chair Publication Record

We have got similar plots for the Area Chairs. Here is the histogram since 2007.

Area Chair Publications 2007

Histogram of NIPS 2014 Area Chair’s publication records since 2007.

Note that we’ve selected 16 Area Chairs who haven’t published at NIPS before. People who aren’t regular to NIPS may be surprised at this, but I think it reflects the openness of the community to other ideas and new directions for research. NIPS has always been a crossroads between traditional fields, and that is one of it’s great charms. As a result, NIPS publication record is a poor proxy for ‘experience’ where many of our area chairs are concerned.

Looking at the more recent publication track record for Area Chairs we have the following histogram.

Area Chair Publications 2012

Histogram of NIPS 2014 Area Chair’s publication records since 20012.

Here we see that a considerable portion of our Area Chairs haven’t published at NIPS in the last two years. I also find this unsurprising. I’ve only published one paper at NIPS since then (that was NIPS 2012, the groups’ NIPS 2013 submissions were both rejected—although I think my overall ‘hit rate’ for NIPS success is still around 50%).

Details of the Recruitment Process

Below are all the gritty details in terms of how things actually panned out in practice for reviewer recruitment. This might be useful for other people chairing conferences in the future.

Area Chair Recruitment

The first stage is invitation of area chairs. To ensure we got the correct distribution of expertise in area chairs, we invited in waves. Max and Zoubin gave us information about the subject distribution of the previous year’s NIPS submissions. This then gave us a rough number of area chairs required for each area. We had compiled a list of 99 candidate area chairs by mid January 2014, coverage here matched the subject coverage from the previous year’s conference. The Area Chairs are experts in their field, the majority of the Area Chairs are people that either Corinna or I have worked with directly or indirectly, others have a long track record of organising workshops and demonstrating thought leadership in their subject area. It’s their judgement on which we’ll be relying for paper decisions. As capable and active researchers they are in high demand for a range of activities (journal editing, program chairing other conferences, organizing workshops etc). This combined with the demands on our everyday lives (including family illnesses, newly born children etc) mean that not everyone can accept the demands on time that being an area chair makes. As well as being involved in reviewer recruitment, assignment and paper discussion. Area chairs need to be available for video conference meetings to discuss their allocation and make final recommendations on their papers. All this across periods of the summer when many are on vacation. Of our original list of 99 invites, 56 were available to help out. This then allowed us to refocus on areas where we’d missed out on Area Chairs. By early March we had a list of 57 further candidate area chairs. Of these 36 were available to help out. Finally we recruited a further 3 Area Chairs in early April, targeted at areas where we felt we were still short of expertise.

Reviewer Recruitment

Reviewer recruitment consists of identifying suitable people and inviting them to join the reviewing body. This process is completed in collaboration with the Area Chairs, who nominate reviewers in their domains. For NIPS 2014 we were targeting 1400 reviewers to account for our duplication of papers and the anticipated increase in submissions. There is no unified database of machine learning expertise, and the history of who reviewed in what years for NIPS is currently not recorded. This means that year to year, we are typically only provided with those people that agreed to review in the previous year as our starting point for compiling this list. From February onwards Corinna and I focussed on increasing this starting number. NIPS 2013 had 1120 reviewers and 80 area chairs, these names formed the core starting point for invitations. Further,  since I program chaired AISTATS in 2012 we also had the list of reviewers who’d agreed to review for that conference (400 reviewers, 28 area chairs). These names were also added to our initial list of candidate reviewers (although, of course, some of these names had already agreed to be area chairs for NIPS 2014 and there were many duplicates in the lists).

Sustaining Expertise in the Reviewing Body

A major concern for Corinna and I was to ensure that we had as much expertise in our reviewing body as possible. Because of the way that reviewer names are propagated from year to year, and the fact that more senior people tend to be busier and therefore more likely to decline, many well known researcher names weren’t in this initial list. To rectify this we took from the web the lists of Area Chairs for all previous NIPS conferences going back to 2000, all ICML conferences going back to 2008 and all AISTATS conferences going back to 2011. We could have extended this search to COLT, COSYNE and UAI also. Back in 2000 there were only 13 Area Chairs at NIPS, by the time that I first did the job in 2005 there were 19 Area Chairs. Corinna and I worked together at the last Program Committee to have a physical meeting in 2006 when John Platt was Program Chair. I remember having an above-average allocation of about 50-60 papers as Area Chair that year. I had papers on Gaussian processes (about 20) and many more in dimensionality reduction, mainly on spectral approaches. Corinna also had a lot of papers that year because she was dealing with kernel methods. Although I think a more typical load was 30-40, and reviewer load was probably around 6-8. The physical meeting consisted of two days in a conference room discussing every paper in turn as a full program committee.  That was also the last year of a single program chair. The early NIPS program committees mainly read as a “who’s who of machine learning”, and it sticks in my mind how carefully each chair went through each of the papers that were around the borderline of acceptance. Many papers were re-read at that meeting. Overall 160 new names were added to the list of candidate reviewers from incorporating the Area Chairs from these meetings, giving us around 1600 candidate reviewers in total. Note that the sort of reviewing expertise we are after is not only the technical expertise necessary to judge the correctness of the paper. We are looking for reviewers that can judge whether the work is going to be of interest to the wider NIPS community and whether the ideas in the work are likely to have significant impact. The latter two areas are perhaps more subjective, and may require more experience than the first. However, the quality of papers submitted to NIPS is very high, and the number that are technically correct is a very large portion of those submitted. The objective of NIPS is not then to select those papers that are the ‘most technical’, but to select those papers that are likely to have an influence on the field. This is where understanding of likely impact is so important. To this end, Max and Zoubin introduced an ‘impact’ score, with the precise intent of reminding reviewers to think about this aspect. However, if the focus is too much on the technical side, then maybe a paper that is highly complex from a technical stand-point, but less unlikely to have an influence on the direction of the field, is more likely to be accepted than a paper that contains a potentially very influential idea that doesn’t present a strong technical challenge. Ideally then, a paper should have a distribution of reviewers who aren’t purely experts in the particular technical domain from where the paper arises, but also informed experts in the wider context of where the paper sits. The role of the Area Chair is also important here. The next step in reviewer recruitment was to involve the Area Chairs in adding to the list in areas where we had missed people. This is also an important route for new and upcoming NIPS researchers to become involved in reviewing. We provided Area Chairs with access to the list of candidate reviewers and asked them to add names of experts who they would like to recruit, but weren’t currently in the list. This led to a further 220 names.

At this point we had also begun to invite reviewers. Reviewer invitation was done in waves. We started with the first wave of around 1600-1700 invites in mid-April. At that point, the broad form of the Program Committee was already resolved. Acceptance rates for reviewer invites indicated that we weren’t going to hit our target of 1400 reviewers with our candidate list. By the end of April we had around 1000 reviewers accepted, but we were targeting another 400 reviewers to ensure we could keep reviewer load low.

A final source of candidates was from Chris Hiestand. Chris maintains the NIPS data base of authors and presenters on behalf of the NIPS foundation. This gave us another potential source of reviewers. We considered all authors that had 2 or more NIPS papers since 2007. We’d initially intended to restrict this number to 3, but that gained us only 91 more new candidate reviewers (because most of the names were in our candidate list already), relaxing this constraint to 2 led to 325 new candidate reviewers. These additional reviewers were invited at the end of April. However, even with this group, were likely to fall short of our target.

Our final group of reviewers came from authors who published either at NIPS 2013 or NIPS 2012. However, authors that have published only one paper are not necessarily qualified to review at NIPS. For example, the author may be a collaborator from another field. There were 697 authors who had one NIPS paper in 2012 or 2013 and were not in our current candidate list. For these 697 authors, we felt it was necessary to go through each author individually, checking their track record on through web searches (DBLP and Google Scholar as well as web pages) and ensuring they had the necessary track record to review for NIPS. This process resulted in an additional 174 candidate reviewer names. The remainder we either were unable to identify on the web (169 people) or they had a track record where we couldn’t be confident about their ability to review for NIPS without a personal recommendation (369 people).  This final wave of invites went out at the beginning of May and also included new reviewer suggestions from Area Chairs and invites to candidate Area Chairs who had not been able to commit to Area Chairing, but may have been able to commit to reviewing. Again, we wanted to ensure the expertise of the reviewing body was as highly developed as possible.

This meant that by the submission deadline we had 1390 reviewers in the system. On 15th July this number has increased slightly. This is because during paper allocation, Area Chairs have recruited additional specific reviewers to handle particular papers where they felt that the available reviewers didn’t have the correct expertise. This means that currently, we have 1400 reviewers exactly. This total number of reviewers comes from around 2255 invitations to review.

Overall, reviewer recruitment took up a very large amount of time, distributed over many weeks. Keeping track of who had been invited already was difficult, because we didn’t have a unique ID for our candidate reviewers. We have a local SQLite data base that indexes on email, and we try to check for matches based on names as well. Most of these checks are done in Python code which is now available on the github repository here, along with IPython notebooks that did the processing (with identifying information removed). Despite care taken to ensure we didn’t add potential reviewers twice to our data base, several people received two invites to review. Very often, they also didn’t notice that they were separate invites, so they agreed to review twice for NIPS. Most of these duplications were picked up at some point before paper allocation and they tended to arise for people whose names could be rendered in multple ways (e.g. because of accents)  who have multiple email addresses (e.g. due to change of affiliation).

 

Firstly, NIPS uses the CMT system for conference management. In an ideal world, choice of management system shouldn’t dictate how you do things, but in practice particularities of the system can affect our choices. CMT doesn’t store a uniques profile for conference reviewers (unlike for example EasyChair which stores every conference you’ve submitted to or reviewed/chaired for). This means that from year to year information about the previous years reviewers isn’t necessarily passed in a consistent way between program chairs. Corinna and I requested that the CMT set up for our year copied across the reviewers from NIPS 2013 along with their subject areas and conflicts to try and alleviate this. The NIPS program committee in 2013 consisted of 1120 reviewers and 80 area chairs. Corinna and I set a target of 1400 reviewers and 100 area chairs. This was to account for (a) increase in submissions of perhaps 10% and (b) duplication of papers for independent reviewing at a level of around 10%.

Open Data Science

Not sure if this is really a blog post, it’s more of a ‘position paper’ or a proposal, but it’s something that I’d be very happy to have comment on, so publishing it in the form of a blog seems most appropriate.

We are in the midst of the information revolution and it is being driven by our increasing ability to monitor, store, interconnect and analyse large interacting sets of data. Industrial mechanisation required a combination of coal and heat engine. Informational mechanisation requires the combination of data and data engines. By analogy with a heat engine, which takes high entropy heat energy, and converts it to low entropy, actionable, kinetic energy, a data engine is powered by large unstructured data sources and converts them to actionable knowledge. This can be achieved through a combination of mathematical and computational modelling and the combination of required skill sets falls across traditional academic boundaries.

Outlook for Compaines

From a commercial perspective companies are looking to characterise consumers/users in unprecedented detail. They need to characterize their users’ behavior in detail to

  1. provide better service to retain users,
  2. target those users with commercial opportunities.

These firms are competing for global dominance, to be the data repository. They are excited by the power of interconnected data, but made nervous about the natural monopoly that it implies. They view the current era as being analogous to the early days of ‘microcomputers’: competing platforms looking to dominate the market. They are nervous of the next stage in this process. They foresee the natural monopoly that the interconnectedness of data implies, and they are pursuing it with the vigour of a young Microsoft. They are paying very large fees to acquire potential competitors to ensure that they retain access to the data (e.g. Facebook’s purchase of Whatsapp for $19 billion) and they are acquiring expertise in the analysis of data from academia either through direct hires (Yann LeCun from NYU to Facebook, Andrew Ng from Stanford to found a $300 million Research Lab for Baidu) or purchasing academic start ups (Geoff Hinton’s DNNResearch from Toronto to Google, the purchase of DeepMind by Google for $400 million). The interest of these leading internet firms in machine learning is exciting and a sign of the major successes of the field, but it leaves a major challenge for firms that want to enter the market and either provide competing or introduce new services. They are debilitated by

  1. lack of access to data,
  2. lack of access to expertise.

 

Science

Science is far more evolved than the commercial world from the perspective of data sharing. Whilst its merits may not be
universally accepted by individual scientists, communities and funding agencies encourage widespread sharing. One of the most significant endeavours was the human genome project, now nearly 25 years old. In computational biology there is now widespread sharing of data and methodologies: measurement technology moves so quickly that an efficient pipeline for development and sharing is vital to ensure that analysis adapts to the rapidly evolving nature of the data (e.g. cDNA arrays to Affymetrix arrays to RNAseq). There are also large scale modelling and sharing challenges at the core of other disciplines such as astronomy (e.g. Sarah Bridle’s GREAT08 challenge for Cosmic Lensing) and climate science. However, for many scientists their access to these methodologies is restricted not by lack of availability of better methods, but through technical inaccessibility. A major challenge in science is bridging the gap between the data analyst and the scientist. Equipping the scientist with the fundamental concepts that will allow them to explore their own systems with a complete mathematical and computational toolbox, rather than being constrained by the provisions of a commercial ‘analysis toolbox’ software provider.

Health

Historically, in health, scientists have worked closely with clinicians to establish the cause of disease and, ideally, eradicate
them at source. Antibiotics and vaccinations have had major successes in this area. The diseases that remain are

  1. resulting from a large range of initial causes; and as a result having no discernible target for a ‘magic bullet’ cure (e.g. heart disease, cancers).
  2. difficult to diagnose at early stage, leading to identification only when progress is irreversible (e.g. dementias) or
  3. coevolving with our clinical advances developments to subvert our solutions (e.g. C difficile, multiple drug resistant tuberculosis).

Access to large scale interconnected data sources again gives the promise of a route to resolution. It will give us the ability to better characterize the cause of a given disease; the tools to monitor patients and form an early diagnosis of disease; and the biological
understanding of how disease agents manage to subvert our existing cures. Modern data allows us to obtain a very high resolution,
multifaceted perspective on the patient. We now have the ability to characterise their genotype (through high resolution sequencing) and their phenotype (through gene and protein expression, clinical measurements, shopping behaviour, social networks, music listening behaviour). A major challenge in health is ensuring that the privacy of patients is respected whilst leveraging this data for wider societal benefit in understanding human disease. This requires development of new methodologies that are capable of assimilating these information resources on population wide scales. Due to the complexity of the underlying system, the methodologies required are also more complex than the relatively simple approaches that are currently being used to, for example, understand commercial intent. We need more sophisticated and more efficient data engines.

International Development

The wide availability of mobile telephones in many developing countries provides opportunity for modes of development that differ considerably from the traditional paths that arose in the past (e.g. canals, railways, roads and fixed line telecommunications). If countries take advantage of these new approaches, it is likely that the nature of the resulting societies will be very different from those that arose through the industrial revolution. The rapid adoption of mobile money, which arguably places parts of the financial system in many sub-saharan African countries ahead of their apparently ‘more developed’ counterparts, illustrates what is possible. These developments are facilitated by low capital cost of deployment. They are reliant on the mobile telecommunications architecture and the widespread availability of handsets. The ease of deployment and development of mobile phone apps, and the rapidly increasing availability of affordable smartphone handsets presents opportunities that exploit the particular advantages of the new telecommunications ecosystem. A key strand to our thinking is that these developments can be pursued by local entrepeneurs and software developers (to see this in action check out the work of the AI-DEV group here). The two main challenges for enabling this to happen are mechanisms for data sharing that retain the individual’s control over their data and the education of local researchers and students. These aims are both facilitated by the open data science agenda.

Common Strands to these Challenges

The challenges described above have related strands to them that can be summarized in three areas:

  1. Access to data whilst balancing the individual’s right to privacy alongside the societal need for advance.
  2. Advancing methodologies: development of methodologies needed to characterize large interconnected complex data sets
  3. Analysis empowerment: giving scientists, clinicians, students, commercial and academic partners the ability to analyze their own data using the latest methodological advances.

The Open Data Science Idea

It now seems absurd to posit a ‘magic bullet cure’ for the challenges described above across such diverse fields, and indeed, the underlying circumstances of each challenge is sufficiently nuanced for any such sledge hammer to be brittle. However, we will attempt to describe a philosophical approach, that when combined with the appropriate domain expertise (whether that’s cultural, societal or technical)  will aim to address these issues in the long term.

Microsoft’s quasi-monopoly on desk top computing was broken by open source software. It has been estimated that the development cost of a full Linux system would be $10.8 billion dollars. Regardless of the veracity of this figure, we know that
several leading modern operating systems are based on open source (Android is based on Linux, OSX is based on FreeBSD). If it weren’t for open source software, then these markets would have been closed to Microsoft’s competitors due to entry costs. We can do much to celebrate the competition provided by OSX and Android and the contributions of Apple and Google in bringing them to market, but the enablers were the open source software community. Similarly, at launch both Google and Facebook’s architectures, for web search and social networking respectively, were entirely based on open source software and both companies have contributed informally and formally to its development.

Open data science aims to bring the same community resource assimilation together to capitalize on underlying social driver of this phenomenon: many talented people would like to see their ideas and work being applied for the widest benefit possible. The modern internet provides tools such as github, IPython notebook and reddit for easily distribution and comment on this material. In Sheffield we have started making our ideas available through these mechanisms. As academics in open data science part of our role should be to:

  1. Make new analysis methodologies available as widely and rapidly as possible with as few conditions on their use as possible
  2. Educate our commercial, scientific and medical partners in the use of these latest methodologies
  3. Act to achieve a balance between data sharing for societal benefit and the right of an individual to own their data.

We can achieve 1) through widespread distribution of our ideas under flexible BSD-like licenses that give commercial, scientific and medical partners as much flexibility to adapt our methods and analyses as possible to their own circumstances. We will achieve 2) through undergraduate courses, postgraduate courses, summer schools and widespread distribution of teaching materials. We will host projects from across the University from all departments. We will develop new programs of study that address the gaps in current expertise. Our actions regarding 3) will be to support and advise initiatives which look to return to the individual more control of their own data. We should do this simultaneously with engaging with the public on what the technologies behind data sharing are and how they will benefit society.

Summary

Open data science should be an inclusive movement that operates across traditional boundaries between companies and academia. It could bridge the technological gap between ‘data science’ and science. It could address the barriers to large scale analysis of health data and it will build bridges between academia and companies to ease access to methodologies and data. It will make our ideas publicly available for consumption by the individual; in developing countries, commercial organisations and public institutes.

In Sheffield we have already been actively pursuing this agenda through different strands: we have been making software available for over a decade, and now are doing so with extremely liberal licenses. We are running a series of Gaussian process summer schools, which have included roadshows in UTP, Colombia (hosted by Mauricio Alvarez) and Makerere University, Uganda (hosted by John Quinn). We have organised workshops targeted at Big Data and we are making our analysis approaches freely available. We have organised courses locally in Sheffield in programming targeted at biologists (taught by Marta Milo) and have begun a series of meetings on Data Science (speakers have included Fernando Perez, Fabian Pedregosa, Michael Betancourt and Mike Croucher). We have taught on the ML Summer School and at EBI Summer Schools focused on Computational Systems Biology. Almost all these activities have led to ongoing research collaborations, both for us and for other attendees. Open Data Science brings all these strands together, and it expands our remit to communicate using the latest tools to a wider cross section of clinicians and scientists. Driven by this agenda we will also expand our interaction with commercial partners, as collaborators, consultants and educators. We welcome other groups both in the UK and internationally in joining us in achieving these aims.

Facebook Buys into Machine Learning

Yesterday was an exciting day. First, at the closing of the conference, it was announced that I, along with the with my colleague Corinna Cortes (of Google), would be one of the Program Chairs of next year’s conference. This is a really great honour. Many of the key figures in machine learning have done this job before me. It will be a lot of work, in particular because Max Welling and Zoubin Ghahramani did such a great job this year.

Then,  in the evening, we had a series of industrially sponsored receptions, one from Microsoft, one from Amazon and one from Facebook. I didn’t manage to make the Microsoft reception but did attend those from Facebook and Amazon. The big news from Facebook was the announcement of a new AI research lab to be led by Yann Le Cun, a long time friend and colleague in machine learning. They’ve also recruited Rob Fergus, a rising star in computer vision and deep learning.

The event was really big news, presented to a selected audience  by my old friend and collaborator Joaquin Quinonero Candela.  Facebook had already recruited Marc’Aurelio Ranzato, so they are really committed to this area. Mark Zuckerberg was there to endorse the proceedings, but I was really impressed by the way he let Joaquin and Yann take centre stage. It was a very exciting evening.

I’d guessed a big announcement was coming so I climbed up onto the mezzanine level and took photos and recorded large parts from my mobile phone. They’d asked for an embargo until 8 am this morning, so I’m just posting about this now. I also cleared it with Facebook before posting, they asked that I remove the details of what Mark had to say,  principally because they wanted the main focus of this to be on Yann (which I think is absolutely right … well done Yann!).

A commitment of this kind is a great endorsement for the machine learning community. But there is a lot of work now to be done to fulfil the promise recognized. Today we (myself, James Hensman from Sheffield and  Joaquin and Tianshi from Facebook) are running a workshop on Probabilistic Models for Big Data.

Mark Zuckerberg will be attending the deep learning workshop. The methods that are going to presented at these workshops will hopefully (in the long term) deal with some of the big issues that will face us when taking on these challenges. In the Probabilistic Models workshop we’ve already heard great talks from David Blei (Princeton), Max Welling (Amsterdam), Zoubin Gharamani (Cambridge) as well as some really interesting poster spotlights. This afternoon we will hear from Yoram Singer (Google), Ralf Herbrich (Amazon) and Joaquin Quinonero Candela (Facebook). I think the directions laid out at the workshop will be addressing the challenges that face us in the coming years to fulfil the promise that Mark Zuckerberg and Facebook have seen in the field.

Very proud to be a program chair for next year’s event, wondering if we will be able to sustain the level of excitement we’ve had this year.

 

Update: I’ve written a piece in “the conversation” about the announcement here:

https://theconversation.com/are-you-an-expert-in-machine-learning-facebook-is-hiring-21439