Review appears in The Times Higher Education Supplement, May 9, 2003, No. 1588, p26.

Give your computer's IQ a boost

Journal of Machine Learning Research
MIT Press.

The Journal of Machine Learning Research was founded in 2001, in its words "as an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning." Machine learning is a fascinating and important area of computer science: basically, can computers learn? It is both a theoretical and practical area, where computer programs are developed and run on data to see what they can learn. There are enormous and widespread practical applications for machine learning, from labour negotiation, medical treatment, agriculture, to generally making computers more 'intelligent.' The particular slant of the journal will make it interesting for theoreticians, biologists and psychologists, who are interested in how animals and humans learn and what the theoretical limits to learning are. There are even applications in machine learning for countering terrorism. And as the World Wide Web fills up with vast amounts of unstructured information, we need all the help we can get to learn how to use it effectively. It says something about the huge relevance of machine learning that one managing editor of this journal works at Google and the other has published papers on financial markets.

Machine learning makes a difference, and makes a lot of money worldwide. Yet the JMLR has a free web site, and it only costs $75 ($101 outside the US) for an annual individual print subscription, a small fraction of the cost of subscribing to a conventional science journal. The journal runs like a collective, with MIT Press taking just the paper print rights, so costs are minimised. Turn-around time for authors is dramatically reduced. If somebody, say, in the third world wants to know anything up-to-date and rigorous about machine learning, this is the definitive place to reference. This is an excellent model for all journals to copy, especially in science. As one of the editors says: "What is the role of the scientist in academic publishing? Doing the publishing!"

The bulk of the journal's papers are devoted to discussing and evaluating learning methods. I was interested to see how ideas talked about in the journal actually worked, because that's really the whole point. So, as the journal is available on-line, I looked at every paper and then emailed the authors to ask them about their ideas. After a few weeks I had over 100 replies. I drafted this review, and then bounced it off the editorial board and the authors again. The enthusiasm of authors for their work was impressive; I had replies covering every paper published.

I asked whether the system described in each paper was available. Of course, some papers were theoretical; some replies said my question was irrelevant. Of the remaining, about a third specifically said their systems were unavailable. Their systems were private, commercial confidential, or incomplete in some way. Consider some quotes from replies I got: "Unfortunately, I do not have the system in a state where I can give it away right now" and "We don't have the data ready to be published". Further quotes are quite revealing about authors' attitudes. Somehow research, even stuff published in the journal, isn't considered public: "The system is a research prototype developed in my group, and is not appropriate for public dissemination" and "The implementations we had were very much 'research code', and not suitable for public consumption".

My informal survey suggests some authors have a relaxed regard for scientific virtues: reproducibility, testability, and availability of data, methods and programs -- the openness and attention to detail that supports other researchers. It's a widespread problem in computer science generally. I'm guilty, too. We programmers tend not to keep the equivalent of lab books, and reconstructing what we have done is often unnecessarily hard. As I wrote elsewhere (see there can be problems with publishing work that is not rigorously supported. It is the computer science equivalent of fudging experimental data -- whether this really matters for the progress of science is another question.

Then there is the problem of who owns the work. As one author put it: "We have not had the time to turn our experimental code into something other people can use (and anyway our employers wouldn't like to see things given away)." Certainly there needs to be a balance between science and protecting intellectual property; it's a big problem, as turning research ideas into code that really works might involve a company that then owns it. On the other hand, there is no reason why open source code cannot be made freely and immediately available, at least to the depth the ideas are discussed in the papers. And it is possible: look at sites like the GNU-licensed open source Weka machine learning project (, which provides a framework people can give and take shared work. Many other sites have papers, code, demos and data too.

The Journal of Machine Learning Research does try to encourage authors to add electronic appendices with source code, data, demonstrations: anything, as the journal puts it, that will make life easier or more interesting for readers and researchers who follow in the authors' footsteps. Some authors do an excellent job, but spreading the good practice is an uphill struggle! Machine learning will change our uses of computers dramatically, so let's hope the journal achieves its goals with more and more success.

Harold Thimbleby is Director of UCLIC, the UCL Interaction Centre, and Gresham Professor of Geometry. He is a Royal Society-Wolfson Research Merit Award Holder.

This review (which was copied to the editorial board and authors) stimulated debate. I'm grateful for the wide range of feedback; the review considerably improved as a result. Some responded with their own horror stories, and some said that science progresses anyway...
Monday, May 9, 2003