The future of facts (and the rise of fact servers)

The Wikipedia had to freeze the George W. Bush entry a few weeks ago because people were altering it to suit their political viewpoints at an alarming rate. So, the editors pared the page down to the non-controversial "core" of facts. There was still a lot of information there — much more than merely "He was born, he drank, he became president" — and occasional acknowledgements of controversies, such as whether Bush satisfactorily completed his National Guard service.

But, most interesting to me, towards the top, on the right, the Wikipedia ran one of the staples of its biographical entries: A fact box.

Bush fact box from wikipedia

I find this two-tiered view of facts, quite common in reference works, fascinating. And in the context of a bottom-up work such as the Wikipedia, in the midst of a dust-up over what constitutes a factual account of the life of W, you have to ask: What's happening to facts?

I don't like facts and I never have. Psychologically, metaphysically and sociologically, I'm uncomfortable in their stern, disapproving, Cheney-like presence.

Psychologically, I freeze when I have to recite one. They are, for me, simply opportunities to be wrong in public. My hesitation is noticeable, leading people to think I must be struggling to make up the fact, which actually is frequently the case. That's why JOHO has been 100% fact free since it's inception. That's my pledge to you.

I also have a metaphysical problem with facts. Of course I understand that there's a real world that existed before I was born and into which I will be buried (or smudged, depending on the cause of my demise). But facts aren't the same thing as reality. They are one way reality — the way the world is apart from our awareness of it — shows itself to us. Without us, the universe would carry on fine, but facts wouldn't emerge from the darkness. Because experience is cultural, facts are cultural artifacts: They're expressed in language, they have a grammar, they are deeply contextual. Facts don't like us saying that, but it's true: "The Titanic sank in 1912" is only a fact because of a context that implicitly includes an understanding of how names stand for things, a decision to mark time by trips around the sun, a convention that numbers years from the birth of a guy I don't care much about, and a historical-cultural context that says that the sinking of a large ship is worth making an explicit proposition about.

Now, you probably snort at that line of thought because you think I'm running from the pure, brutal "Look, it happened!" that facts express. But I'm not. It was sad when the great ship went down (down to the bottom of the...), and it happened on a date we agree on. But facts are not context-free meteors that slam into our planet unbidden. They are instead a way of conjuring up the world in one of its infinite facets. They are a way of speaking, a form of rhetoric, and thus should not be treated as if they are the end-all of thought and discussion. But, sociologically, that's often how they're used: They are the knuckle sandwich of rhetoric. Facts are, of course, peculiarly important, but they are not the only peculiar and important things we say to one another. And they are not quite as reality-based, muscular and manly as they pretend. Inside every fact is a value struggling to get out.

I Love Facts

To forestall rants about how I don't believe in facts and think that, for example, the date the Titanic went down is subject to debate, let me state for the record: The Titanic sank on April 15, 1912. We should reject any explanation of facts that lets someone claim that the date of its sinking is up for grabs, relative or unknowable. Facts are crucial in disciplines I care a lot about, including science and journalism. Nevertheless, facts are form of understanding and a form of rhetoric, and thus they are always infected with slimy humanity.

So, when the Web started heating up the Internet, I was among those who thought that we were going to see a merging of voice and facts, and, more particularly, voice and objectivity. (Objectivity is the mood in which we get all factual.) To a greater extent than I'd hoped, that's happening: Just read your 50 favorite blogs. Many Big-Time Journalists go to absurd lengths to hide their political sympathies — one editor boasts he doesn't even vote — but it's reversed on blogs: If we don't know who you're voting for, how can we trust what you write?

And yet...There are classes of facts I don't want wrapped in voice. If I post a question about the battery life of a laptop, I'll trust the people who write in response more than I trust the computer company's site, but I trust the company site more for the dimensions of the machine. The company is liable for its answer in a way that a random blogger isn't; if I have to buy a new carrying case because the number was wrong, the blogger can say, "Sorry, dude, I misread the measuring tape," whereas I'll expect the company to compensate me one way or another.

Similarly, I count on mainstream newspapers to provide fact-based stories that "cover" an event: I don't expect in the foreseeable future to be counting on webloggers to tell me how many troops attacked Samara, how this was coordinated with other simultaneous battles, or how many civilians were killed. Of course I expect bloggers to fact check the media's ass but good, which implies that I don't have full confidence in the media's ability to deliver the facts. (PS: there's no such thing as "the" facts because which facts are relevant is not itself a matter of fact.) But covering events seems to require the type of centralization that only a news bureau can provide. (Hint: Any sentence of mine that of the form "only a _____ can provide" is likely to turn false particularly quickly.) Further, news organizations stand behind their stories in a way that someone talking over the virtual back fence doesn't have to. (Of course, sometimes the news media stand behind their stories Rather longer than they should.)

The role of facts in discourse may look immutable, but it is exactly the sort of thing that can change; I've been reading Foucault recently and it's startling how such deep structures can transform rapidly.(It's also startling how unbelievably brilliant Foucault was.) I don't know what will happen, but my hunch is that we are heading towards commoditizing facts, driving down their value so that they don't provide differentiating value. For example, take the table of Bush facts at the Wikipedia. With the right API, the Wikipedia could become a Fact Server that delivers the undisputed facts about any of its 1,000,000+ topics to any application that asks politely, making facts cheaper than popcorn.

Now, it would be irresponsible for a fact server to serve up dubious or putative facts, but if it only serves the commoditized facts, it won't have all that much value. So, perhaps fact servers will deliver facts along with metadata about how reliable the facts are: It's 0.99 certain that Bush was born in 1946 but it's 0.4 that he completed his National Guard duty. Will this sharpen the line between the two tiers of facts — the reliability of lower-class facts will always be the subject to argument while 0.99s are beyond serious dispute — or will it tar all facts with the welcome brush of human fallibility?

There are bunches of other questions, many of which take on an Hegelian cast. For example, the Wikipedia fact box gives Bush's date of birth but not his race. That's because our culture does not count race as relevant (haha!), and, no, you can't always tell from the photo. The Wikipedia fact box also does not state who W's parents are, yet in some cultures knowing your parentage is as important as knowing the year you were born. But, if Wikipedia acts as a fact server, it won't have to decide which 0.99 facts to include in the fact box. It will simply serve up all facts the requesting app wants. Thus, Bush's date of birth, race and parentage will show up as equal; if your culture values parentage, your app will make a big deal of that. If some other culture considers listing the date of birth to be a type of ageism, its apps will ignore that datum. Undoubtedly, some app will find intense value in the 0.99 fact that Bush is white. So, the commoditization of facts may result in the formation of cultural fact boxes that divide us on the basis of a consensus core of 0.99s that we all agree on: Cultures united in a core of commoditized facts from which they select the fact boxes that divide us. Weird. Or is it the way the world has always implicitly worked?

The delivery of facts with probabilities as part of them could lead to unpredictable consequences. Building doubt into facts could transform their rhetorical and social role. Will we recognize facts as being as perpetually subject to argument as are opinions? Will their source of authority become an integral part of them, as opposed to being an outside reference? Will the recognition that they're socially conditioned degrade them so that all facts are equal, no matter how contradictory or stupid — appending a huge "Whatever!" to all factual discussions? Are we heading towards a more sophisticated, nuanced way of thinking that will put facts in their place, or towards a new age of stupidity and obstinacy? And in the new world of facts, what will be the sound of voices conversing and voices testifying?

I believe we are currently inventing a new and important life for facts. We just don't yet know what it will be.

The end of data?

Here's an idea for the book I am perpetually working on working on. (No, that's not a typo. I've been working for over a year on a proposal that would enable me to work on the book.)

There used to be a difference between data and metadata. Data was the suitcase and metadata was the name tag on it. Data was the folder and metadata was its label. Data was the contents of the book and metadata was the Dewey Decimal number on its spine. But, in the Third Age of Order (see the previous issue), everything is becoming metadata.

For example, imagine you're at a large corporation doing a Third Order treatment of its digital library of research articles. Instead of (or, in addition to) designing a large, complex, hierarchical taxonomy, you focus on adding enough metadata to each article so that people will be able to sort and classify them any which way they want. If someone wants to find all the articles that talk about hydrocarbons written in Italian in 1965 and that have more than 30 footnotes, they'll be able to. If someone wants to make a browsable hierarchy based not on topic but on gender or on the number of co-authors, they'll be able to. You build enriched objects first so your users can forever after taxonomize the way they want to, instead of the way you think they'll want to.

Now take a closer look at these information objects. They look like contents tagged with lots of metadata, but in fact they're all metadata. If I'm looking for an article about hydrocarbons written by Barbara Rodriguez, then the article's topic ("hydrocarbons") and author's name ("Rodriguez, Barbara") are metadata, and the content is the data. But, I could just as well be trying to remember the name of the author who wrote an article that included the phrase "Hydrocarbons are the burros of the the cosmos" sometime in the 1960s, in which case the content and date are metadata and the author's name is the data. What's data and what's metadata depends on the person doing the asking.

So, in the Third Age of Order, all data is metadata. Contents are labels. Data is all surface and no insides. It's all handles and no suitcase. It's a folder whose content is just another label. It's all sticker and no bumper.

Why does this matter? It changes the primary job of information architects. It makes stores of information more useful to users. It enables research that otherwise would be difficult, thus making our culture smarter overall. But, most interestingly (at least to me), this does the ol' Einsteinian reverse flip to Aristotle. Aristotle assumed that of the 10 categories by which one could understand a thing, one must be primary: Where that thing fits into the tree of knowledge. So, you could say that Alcibiades is made of flesh or lived in Greece, but if you really want to understand him, you have to say that he is an animal of a particular kind. But, now that everything is metadata, no particular way of understanding something is any more inherently valuable than any other; it all depends on what you're trying to do. The old framework of knowledge — and authority — are getting a pretty good shake.

Right? Wrong? Old? Obvious? Pointless? Stop me before I make a fool of myself to someone not as nice as you...

My friend Robert Morris who teaches computer science at U. Mass Boston, and who has always been unnecessarily generous to me with what he knows, says that the above is pretty much old news:

The short answer is that in the business, nobody anymore contends there is a diffference between data and metadata ort her than in a context such as you mention, namely the metadata is usually that part which helps you locate and use the other part and which you can often ignore if you already know those things.

Bob points to Life Science IDs (LSIDs) as an example of a standard that does sort of distinguish data from metadata.

An LSID is an immutable, permanent, globally unique key to a piece of information. The LSID spec requires that getData always return the same bytes for the entire future of the universe, whereas getMetadata may return things about the information that could change.

LSIDs are being supported by the Interoperable Informatics Infrastructure Consortium (I3C). An LSID server sits in front of your database or application so you can continue to use your existing infrastructure.

Sounds like the architecture for a life sciences fact server...