OCLC’s planned search engine: The pros and cons of Web crapola filtering by librarians

Was All in the Family---the TV show where Archie Bunker raged against black people, Jews and other supposed inferiors---the first entertainment outlet to use the word "crapola"? imageAnd did that in turn pave the way for respectables to be able to utter "crap" in public? Damned if I know. I simply recall hearing a cable commentator say such things. If I put the above on the Web as absolute fact, then I'd be spreading, er, crapola---unverified information that may or may not be accurate, even if Google picks it up because of TeleRead's prominence in its niche. Weighted toward sites popular with librarians But now a planned search engine, from OCLC, Syracuse University and the University of Washington, will strive to limit search results to librarian-vetted links. As announced yesterday: "Reference Extract is envisioned as a Web search experience similar to those provided by the world's most popular search engines. However, unlike other search engines, Reference Extract will be built for maximum credibility of search results by relying on the expertise of librarians. Users will enter a search term and receive results weighted toward sites most often used by librarians at institutions such as the Library of Congress, the University of Washington, the State Library of Maryland, and over 2,000 other libraries worldwide."

Yes, there are overlaps with TeleRead, my proposal for well-stocked national digital library systems in the U.S. and other countries. I love the idea of helping people find better information than what Google links often provide. Check out a 2004 TeleRead item, Free Hate site gets Rank #1 for word ‘Jew’ on Google—while Anne Frank’s Diary is verboten for free use. Some white hats organized a counter-linking campaign, but even now, the hate site Jew Watch claims the third- and fourth-highest results (no links provided here, thanks). A bigot like Archie Bunker would feel at home with such beauts as “Jewish Communist Rulers and Killers” and “Jewish Mind Control Mechanisms.” I can also think of quackish health cures unwittingly promoted by Google’s algorithms.

So we badly need search engines with more reliable crapola screening than the company’s PageRank system. Information literacy is a laudable goal, but K-12 students are hardly born with it. For adults in search of trustworthy information, moreover, especially on crucial matters such as health, the planned search engine could save time and maybe even lives.

The downsides

Still, librarians are hardly omniscient gods. When it comes to Internet and e-book matters, I can remember some appalling misinformation spread by librarians. How about a former ALA president’s bizarre take on bloggers, for example? As a group librarians are far more trustworthy than stockbrokers or politicians or, yes, the blog world at large, but even then I would exercise caution. Librarians are humans with their own sets of prejudices on various topics, including, yes, e-books, which some see as a threat to libraries’ existing business models.

Another downside is that so far, at least, the embryonic search engine project hardly has the money to match Google’s reach. The news release says “the planning phase of this project is funded through a $100,000 grant from the John D. and Catherine T. MacArthur Foundation.” A pittance—given the scope of the job ahead. Granted, more money will probably be on the way. OCLC is no minnow and “has provided computer-based cataloging, reference, resource sharing, eContent, preservation, library management and Web services to 60,000 libraries in 112 countries and territories.” But Google’s capitalization is north of $100 billion dollars.

A positive just the same

Just the same, the planned site is a positive. Google could try to compete with its own librarian-oriented search service with heavy-duty crapola filtering, but it’s a profit-making company reliant on advertising, and this very fact might skew the search results. I say we really need a mix of the two approaches, and I love the idea of librarians entering the search game in a major way rather than simply lazing back and entrusting their fates so heavily to Google.

Related: An existing site, Librarians’ Internet Index, a portal with the tag line “Web sites you can trust.” It will be interesting to see closely how its recommended sites jibe with a list of those played up by Reference Extract. Will there be any cooperation between the two organizations? How much and in what forms? True, the crapola filters work in different ways—LII uses old-fashioned human selections, directly—but these sites are more or less in the same territory.

Reminders/disclosures: I own a tiny amount of Google stock as a long-term retirement investment and am not a librarian, although I’ve been writing on digital libraries in a TeleRead context since the early 1990s.

8 Comments on OCLC’s planned search engine: The pros and cons of Web crapola filtering by librarians

  1. I have to say I don’t understand this at all. Suppose I go to my university library…there are literally thousands of books in there that are filled with absolute nonsense. The public library down the street is even worse.

    Would a librarian stand up and say “we will not purchase any books for our collection that are not thoroughly vetted for factual accuracy by a special committee of librarians”? If not, how are web-based sources any different?

  2. The simple and existing solution is to use searching based on social bookmarking, ie del.icio.us or similar. Results for some topics are vastly superior to Google searches, since to be included, somebody had to consider the site worthy of bookmarking in the first place.

    Of course, not everybody using del.icio.us is a librarian, but they are often experts or active participants in the topics they bookmark (which is better IMHO).

  3. Fascinating post, David!

    Librarians aren’t perfect, but most of us* will tell you that you should never rely on just one source…or even two. And no two librarians approach a question the same way. So I think the more options that are available to us, the better chance we have of getting a satisfactory answer to our question.

    *I haven’t worked in a library in a long time, but I do have an MLS and have worked the reference desk in a variety of libraries.

  4. Garson O'Toole // November 9, 2008 at 6:09 pm //

    Guidance toward “credible” websites can be useful when it does not block access to the supposedly “non-credible” websites. Yet there is a more serious problem that is not addressed by filtering or re-ordering web search results that are currently available. There is a vast “deep web” or “hidden web” that is not effectively indexed. Some of the most valuable and important storehouses of information are locked up. I have mentioned JSTOR in the past as an example of a database that should be accessible to all but is restricted.

    Regarding “crapola” here is some data from the open web. The word is used by Sammy Glick the anti-hero or villain of the 1941 novel ”What Makes Sammy Run?” by Budd Schulberg. The “corrosive Hollywood novel” was a “runaway best seller” according to a New York Times article from 1998. The work was turned into television programs in 1949 and 1959. It was even made into a Broadway musical in 1964 that was revived in 2006. However, I do not know if the term “crapola” was used in any of these productions.

    The high-profile novel arguably placed the word “crapola” into the acceptable entertainment lexicon for some readers and writers. “All in the Family” premiered on the CBS television network in 1971. Admittedly, a top-ranked television sitcom probably has a wider reach and a more visceral impact than a bestselling novel.

    How accurate is this data? Does it come from credible sources acceptable to OCLC, Syracuse University, and the University of Washington? Who knows? The existence of the word “crapola” in “What Makes Sammy Run?” can be checked with Google Book Search. The New York Times quotes are from an article titled “’41 Best Seller Is Back and Clawing” by Ralph Blumenthal dated August 11, 1998. The television and Broadway production dates are from Wikipedia.

    There are discrepancies. Blumenthal in the Times says “a television version of the book was broadcast in two parts by NBC in 1960.” Wikipedia says “On September 27 and October 4, 1959, on NBC Sunday Showcase, Larry Blyden starred as Sammy Glick in a two-part television broadcast on NBC-TV.” In this case I think Wikipedia is more accurate than the New York Times.

  5. Garson, Paula, Brian and RJH:

    G: Why, of course—Google Book Search, duh! Many thanks for the find. As it happens, What Makes Sammy Run? is among my favorite novels, but I didn’t recall seeing The Word there. By the way, GBS revealed a 1939 usage in a story by Whit Burnett in Story Magazine and apparently even an earlier example from Richard Brinsley Sheridan’s 1935 book Heavenly Hell (alas, the long URL somehow won’t paste in). And there may be older ones. But on the written page, the Schulberg use may have counted the most.

    Paula: Nice hearing from you. I remain a big fan of The Writing Show and was sorry to hear of the heart attack of commentator Jeff DeRego. Open heart surgery, too, just like me? Ouch! I thought I’d used up the current quota for those things among Net-lovin’ writers. But then I see Jeff beat me to it. If he does have a fund to help pay medical expenses—that’s my possibly incorrect recollection—please pass on the details for TeleBlog readers. I enjoyed his dissection of NaNoWriMo, by the way. Wish I’d seen it before I did mine; I’d have linked.

    Brian: I agree with PaulB. If the matter is crucial, you of course want more than one source—maybe many. Although zillions of books are full of crapola, especially obsolete information, the medium as a whole is probably more trustworthy than the Web. Lots of exceptions! But librarians are at least supposed to care about accuracy and the rest.

    RJH: Interesting idea, although I would like to know the credentials of the participants (just one factor!).

    Thanks,
    David

  6. Garson O'Toole // November 9, 2008 at 9:56 pm //

    Thanks for your response David. I also found the earlier search results that you cite via Google Book Search (GBS) but they seem to be problematic.

    Here is an opportunity to illustrate some of the ways in which GBS is not the comprehensive library tool that our society should have. It is a potent and flawed instrument in its current incarnation.

    The metadata supplied by GBS is sometimes frustratingly inaccurate. For example you mention that “crapola” had a “1939 usage in a story by Whit Burnett in Story Magazine”. However, Whit Burnett was the founder and editor of Story magazine, and this means that he probably did not write the piece that contains the word “crapola”. The snippet that Google displays does not even show the target word. This cropping problem is common with GBS snippets, and it makes verification more difficult.

    Many times the dates given by GBS for magazines and journals are inaccurate. Often GBS gives the founding date of a magazine and that is irrelevant when the precise issue date is desired. Sometimes GBS gives the start date of a volume that contains a multi-month or multi-year span of magazine issues. Story magazine was founded in 1931 according to Wikipedia, so the 1939 date might be accurate, but the absurdly restrictive snippet view does not allow one to see the title page. Also the contents page data is cropped with “14 other sections not shown”.

    The Richard Brinsley Sheridan reference cannot be double-checked at all because the message “Sorry, this page’s content is restricted” is displayed when further exploration is attempted. GBS finds the term “crapola” in numerous Italian language works of the 1800s and 1700s. I do not know what the word means in Italian. Seeing context is important because the word might appear in a non-English phrase with an alternative denotation.

    I decided to present the citation within “What Makes Sammy Run” because I knew that the novel had a high cultural impact. As you suggest “on the written page, the Schulberg use may have counted the most.” Of course it is possible that there is an earlier influential use.

  7. Execllent points, Garson! Your reference to Google’s inadequacies actually ties in with my latest plea for a well-stocked national digital library system. Google, as a private company, is more likely to take shortcuts. Not that Google is the only possible villain here. Maybe some publishers got in the way, too, in various cases—although, if Burnett didn’t write the piece in question, Google is very possibly the one to blame. Here’s to entrusting our culture to a truly professional library system, not just the private sector alone!

    Thanks,
    David
    (thoroughly sold on the use of the Schulberg example)

  8. The recent Google settlement with publishers and the Authors Guild could improve Google Book Search (GBS) immensely in the U.S. The agreement, if approved in court, will allow Google to display up to 20 percent of in-copyright, out-of-print books according to the official Google blog.

    Searchers will be able to view the title page, the copyright page, the table of contents, and a substantial slice of context while adhering to the 20 percent cap. Hence verifying dates and ascertaining author identities will be easier.

    The current irritating flaws in GBS can be evaluated in the framework of its remarkable power and its ongoing evolution I think. Regarding the propensity to “take shortcuts”; I think that private companies, public companies, non-profit companies, and governments of every variety sometimes take shortcuts. As your last sentence suggests it is a poor idea to depend on one sector or institution alone.

    Ultimately people should be able to see full library e-books from home and should be able to download them. I think we agree on this. You have been admirably, presciently, and articulately enunciating this for many years.

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.

wordpress analytics