Neither Green nor Gold
11 February 2013
Text mining, open data and open access publishing all hang together as the emerging future for research across a wide range of disciplines – the “Fourth Paradigm”.
At our recent “Facing Forwards” breakfast seminar, Sunil Vadera gave a fascinating overview of his research group’s work in data mining, and the extraordinary opportunities that are emerging from our ability to search and sort massive amounts of digital data extremely rapidly. Shortly after this, I was in London for the Westminster Forum’s conference, “Open Access Research and the Future for Academic Publishing”. And in my own discipline, the latest edition of World Archaeology is dedicated to Open Archaeology. Text mining, open data and open access publishing all hang together as the emerging future for research across a wide range of disciplines – the “Fourth Paradigm”.
The driver for these changes in how we work is, of course, the massive expansion of digital data, and our ability to store, search and gain rapid access to information on cheap, smart devices – almost anywhere. Last year, total global digital information is estimated to have been 2.7 zetabytes, a 48% increase from 2011. By 2020, it has been predicted that 35 zetabytes of new digital data will be created each year. One zetabyte is one billion terabytes (a one terabyte hard drive costs about £60 and weighs under 200 grams – unless you are doing something unusual, it will a store everything that you could possibly read or write in your lifetime).
When my generation trained as academics, we were taught to formulate a hypothesis and then to go out and hunt down the data to prove or disprove it. Now, we are swamped by immense quantities of digital information. Our challenge is to find what is relevant, and our opportunity lies in inducing meaning from the patterns and associations that take shape through massive and dynamic data sets.

Our challenge is to find what is relevant, and our opportunity lies in inducing meaning from the patterns and associations that take shape through massive and dynamic data sets.
For any research community today, there is a close and vital connection between working in this new digital world and academic publishing. Here, three things are critical. First, the copy-of-record of a peer review paper is a key anchor point, establishing secure stepping stones through the virtual world. Second, on-line publishing is increasingly requiring live links to the data sets on which arguments are based, allowing testing of conclusions and cumulative advances in interpretations. Third, text and data mining – crucial to building a semantic web of meaning – requires that all academic publishing be Open Access.
An Open Data/ Open Access academic world is now essential to keep pace. Looking back again at the research methodologies of the 1970s and 1980s, it was a reasonable requirement of every PhD student that they submit a comprehensive overview of all relevant previously published material – the “literature review”, usually Chapter Two of the dissertation. Today, more than two new peer-reviewed papers in Medicine are published on-line every minute. Without some sort of automated, semantic search ability, recovering copies-of-record through Open Access, the task of assembling relevant literature is impossible: as Douglas Kell and his colleagues have put it, “each newly published paper is cast adrift and essentially lost at sea”.
Last year’s Royal Society report provided an excellent example of how the availability of Open Data and Open Access to research results is a significant public benefit. In May 2011 there was an outbreak of a severe gastro-intestinal infection in Hamburg, rapidly resulting in 4000 cases across Europe and the US and 50 deaths. . All those affected tested positive for an unusual and little-known Shiga-toxin–producing E. coli bacterium. This strain was initially analysed by scientists at BGI-Shenzhen in China, working together with those in Hamburg, and three days later a draft genome was released under an open data licence. This was followed up by bioinformaticians on four continents, and the genome was assembled, and openly available, within 24 hours. Within a further week, more than twenty reports had been posted on a dedicated Open Access site, and these analyses led to the discovery of the bacterium’s virulence and resistance attributes, enabling appropriate antibiotics to be identified. By July – two months after the initial outbreak in Hamburg, copy-of-record papers had been published based on this work.

The availability of Open Data and Open Access to research results is a significant public benefit.
Open Access publishing is itself a complex, and currently controversial, issue. The “Green” versus “Gold” debate, though, is misleading. The imperative is to get to a point where all the costs of publishing, whether negligible or requiring developed mechanisms for meeting Article Processing Charges (APCs), are fully met up front so that copies-of-record can be made freely available under arrangements such as the Creative Commons CC-BY-NC licence. This was our key argument in the Finch Group report, and the case has been remade in a recent – excellent – posting by Stuart Shieber, Harvard’s Director of the Office of Scholarly Communication.
The Royal Society – which has been close to scholarly publishing since the seventeenth century - puts it this way: “if search engines represent a first generation of tools to sort networked knowledge, semantic analysis represents a second generation which not just identifies lists of documents (or databases) but also the relationships between them. This is an exciting prospect not just because the ability to mash data together can produce new knowledge … but also promises as yet unpredictable changes to scholarly practice”.
—————————————————————————————————————————————-
Attwood, Teresa, Douglas Kell, Philip McDermott, James Marsh, Steve Pettifer and David Thorne. Review article: Calling International Rescue: knowledge lost in literature and data landslide! Biochem. J. 424, 317–333. 2009.
Martin Hall, “Neither Green nor Gold”. Open Access Research and the Future for Academic Publishing. Westminster Higher Education Forum ,February 2013. Available at www.salford.ac.uk/vc
Stuart Shieber, “Why Open Access is Better for Scholarly Societies”. Jan 29 2013: http://blogs.law.harvard.edu/pamphlet/2013/01/29/why-open-access-is-better-for-scholarly-societies/
Open Archaeology. World Archaeology, 44:4, 2012
Microsoft . The Fourth Paradigm in Practice. Science@Microsoft , 2012.
Royal Society. Science as an Open Enterprise. The Royal Society Science Policy Centre report 02/12. 2012

February 11th, 2013 at 9.03 pm
MARTIN HALL: “The “Green” versus “Gold” debate, though, is misleading. The imperative is to get to a point where all the costs of publishing, whether negligible or requiring developed mechanisms for meeting Article Processing Charges (APCs), are fully met up front so that copies-of-record can be made freely available under arrangements such as the Creative Commons CC-BY-NC licence. This was our key argument in the Finch Group report, and the case has been remade in a recent – excellent – posting by Stuart Shieber, Harvard’s Director of the Office of Scholarly Communication.”
STUART SHIEBER: “Do you have a pointer to something saying that I support the Finch approach? If so, I’m happy to answer it directly — in the negative when it comes to both their lack of support for green and poorly designed approach to gold support.” (Feb 3 2013, personal communication.
February 12th, 2013 at 11.59 am
Stevan – here are two quotations from Stuart Shieber’s paper which make the point about the significance of moving to full Open Access to copy-of-record. The Finch Report, however imperfect, was about the transition to this. “Open-access journals don’t charge for access, but that doesn’t mean they eschew revenue entirely. Open-access journals are just selling a different good, and therefore participating in a different market. Instead of selling access to readers (or the readers’ proxy, the libraries), they sell publisher services to the authors (or to the authors’ proxy, their research funders). In fact there are now over 8,500 open-access journals listed in the Directory of Open Access Journals. Some of them have been mentioned already on this panel: Linguistic Discovery, Semantics and Pragmatics. The majority of existing open-access journals, like those journals, don’t charge authorside article-processing charges (APCs). But in the end APCs seems to me the most reasonable, reliable, scalable, and efficient revenue mechanism for open-access journals. This move from reader-side subscription fees to author-side APCs has dramatic ramifications for the structure of the market that the publisher participates in”. And later: “So journals compete for authors in a way they don’t for readers, and this competition leads to much greater efficiency. Open-access publishers are highly motivated to provide better services at lower price to compete for authors’ article submissions. We actually see evidence of this competition on both price and quality happening in the market.”