March 28th, 2010

Serial Series, Part 2

By Rob Giampietro

In 2002 Stanford University launched a “community reading project” called Discovering Dickens, making Dickens’s novel Great Expectations available in its original part-issue format and asking Stanford alumni and other members of the Stanford community to read along, exactly as Victorians first did, with the serial version that appeared from December 1860 to August 1861. In 2004, as Discovering Dickens readers were enjoying A Tale of Two Cities, Stanford joined the newly-formed Google Print Library Project, along with the University of Michigan, Harvard, Oxford, and the New York Public Library. A year later, the program would become know as the Google Books Partner Program, or, more simply, Google Books.

At the launch of Google Books, Google’s intent was to scan and make available 15 million books within ten years. By 2008, just four years into the project, 7 million books had already been scanned. When books are scanned, words are automatically converted by Google’s Optical Character Recognition software into searchable text. Occasionally, there is a problem with this conversion process, and Google’s OCR software either can’t recognize some text or it isn’t confident about its conversion, having checked the results against standard grammar rules. The only way to convert these wayward words and phrases is to introduce human eyes into the system. This September, Google did just that with the purchase of reCAPTCHA.

ReCAPTCHA was invented by Luis Von Ahn, who also invented the CAPTCHA, which is a test that can tell if a user is a human or a computer. CAPTCHAs are effective at blocking spam, verifying accounts, and a variety of other online tasks. Von Ahn’s original CAPTCHA presented a randomized set of letters warped in such a way that a computer could not read them, though humans easily could.

A few years ago, Von Ahn started thinking about the time people were wasting filling out CAPTCHAs. It bothered him. About 200 million CAPTCHAs are solved everyday. Each one takes about ten seconds of time to solve, so collectively people spend more than 150,000 hours a day solving CAPTCHAs. What if this time could be harnessed for the global good? Von Ahn found a way: instead of random letters, his new system, reCAPTCHA, presents users with two English words, one known and the other unknown. The unknown words are pulled randomly from a pool of scanned words that OCR cannot convert.

Users solving reCAPTCHAs require the same amount of time as before— ten seconds—to recognize and type these two words. But now, every test produces a human user confirmation and the digitization of an unknown word. ReCAPTCHA digitizes 45 million words a day, or about 4 million books a year. In addition to the words reCAPTCHA digitizes for Google Books, reCAPTCHA’s other significant source of unknown words comes from the archive of the New York Times.

The case of reCAPTCHA once again underscores the fact that text takes time. Even the seemingly insignificant act of parroting back some random letters or words occupies us for a collective 150,000 hours everyday. But while the typical production of text is made by one or a few writers producing words serially in sentences one after another, reCAPTCHA has its millions of users producing text randomly, separating words from their proper context and syntax and presenting them to us based on their ambiguous form and unlikely transliteration instead. Rather than invention, reCAPTCHA’s method is algorithmic. And rather than originality, reCAPTCHA’s reason for generating words boils down to one thing: verification.

Verification is also central to the snarl of issues surrounding the legality of the Google Books project more generally. Many works it has scanned, like Dickens’s writings, were already free of copyright and in the public domain long before the project started. (Mark Twain’s The Adventures of Huckleberry Finn, which entered the public domain in 1942, was first published in 1884. Dickens died in 1870.) However, many of the works Google Books has scanned are still under copyright, and Google has scanned them anyway in an attempt to make them more accessible—similar to a “card catalog,” according to Google—however, authors’ and publishers’ rights groups have objected to this and sued Google to stop them from scanning works under active copyright. For another large segment of the books Google has scanned, the copyright status is simply unknown. So-called “orphan” works, under copyright but now out-of-print, are those works for which, after a “reasonable effort” has been made to locate a current copyright holder, no such person can be found. On one hand, Google must attempt to verify whether or not a current copyright holder exists.

On the other, it must verify to the court that it has been exhaustive in conducting its search in order to make the book available to users of Google Books. And this two-part effort has led to what the New York Times described earlier this year as A Google Search of a Distinctly Retro Kind. The article continues,

Since the copyright holders can be anywhere and not necessarily online—given how many books are old or out of print—it became obvious that what was needed was a huge push in that relic of the pre-Internet age: print. So while there is a large direct-mail effort, a dedicated Web site about the settlement in 36 languages and an online strategy of the kind you would expect from Google, the bulk of the legal notice spending—about $7 million of a total of $8 million—is going to newspapers, magazines, even poetry journals, with at least one ad in each country. These efforts make this among the largest print legal-notice campaigns in history. That Google is in the position of paying for so many print ads “is hilarious—it is the ultimate irony,” said Robert Klonoff, dean of Lewis & Clark Law School in Portland, Ore.

Klonoff’s comment is apt. In its attempt to digitize all the world’s books, Google has not only been forced to search for what it cannot find, but the company, which made its billions by serving relevant advertisements to users of its search engine, must now spend millions placing similar ads in tiny publications that its Google Books service (and the scanning of books more generally) may ultimately render obsolete.

For readers of Discovering Dickens, Google’s hundreds of little text advertisments may seem reminiscent of the ads scattered throughout the original part-issues of Dickens’s serial works, each of which included 16 pages of advertising flanking 32 pages of original text. The benefits of the “Invisible Spine Supporter” and “Dr. Locock’s Female Wafers” were proclaimed alongside entreaties urging buyers to purchase “Alpaca Umbrellas” and “Children’s Frock Coats and Pelisses.” It was a bazaar inside of Bleak House, a marketplace within Martin Chuzzlewit. For Dickens’s publishers, his text provided a perfect vehicle for additional advertising revenue. But, with the aid of the recently developed idea of copyright, Dickens’s text would soon become a commodity of its very own.

Serial Series is a six-part meditation on the production of text from the text’s point-of-view. It was written serially and published serially during the three-week run of Dexter Sinister’s The First/Last Newspaper, a project for Performa 09. The second broadsheet, including this piece (titled TIME CAPTCHA’D FOR GLOBAL GOOD?) can be downloaded here.