When Organizing Big Data Content, It's Okay To Be Messy
I recently listened to an O'Reilly programming videocast interview with Travis Lowerdermilk about his new book, User-Centered Design: A Developer's Guide to Building User-Friendly Applications. During the videocast, Travis notes that according to Jakob Nielsen, you need only 5 users to identify most of the problems with a UI design.
Lowdermilk acknowledges that the idea has been challenged by some, including Jared Spool and others (see When and Why 5 Test Users Isn't Enough). And to be fair, Nielsen's theory is more nuanced. Nielsen says if you have 15 research participants, it's best to divide up the participants into 3 smaller studies of 5 each and then iterate your design with each small group, thereby maximizing the value of your research.
But the exact sample size isn't my point here. I assume the UX or HCI proponents latched onto the 5-users-only theory as a means of simplifying user research. Rather than asking developers to orchestrate a massive user research study involving hundreds of users and then quantifying scores of data points, and then tabulating the responses in some meaningful and intelligent way to arrive at a conclusion, you just have to gather 5 users.
My point, rather, is that people use samples rather than wholes because the full data is too massive, unwieldy, and difficult to gather and manage. However, with the trend toward big data, the convention of working with samples rather than the full spectrum of data may be a tradition of the past we discard.
Enter Big Data
I've recently been listening to Big Data: A Revolution That Will Transform How We Live, Work, and Think, which is quite an interesting book. The authors, Viktor Mayer-Schonberger and Kenneth Cukier, explain some of the difficulties of collecting information from whole groups. For example, they relate the challenges in census taking and in collecting information about flu epidemics.
Apparently it took more than a decade to collect information for the 1880 census. By that time the data was gathered, the results were already out of date. Similarly, during the H1N1 outbreak, the Center for Disease Control tried to take and process incoming reports to track the spread of the outbreak. Reports were slow. Instead, Google found that it could predict the path of the outbreak in near real-time by correlating millions of search queries about H1N1 and location (see Google Flu Trends).
Mayer-Schonberger and Cukier assert that we aren't restricted to using samples (and hoping they're representative) in order to make analyze information. With millions of people liking posts on Facebook, millions more uploading and tagging videos in Youtube, millions posting micro tweets about what they think or what's happening, and the incredible processing power of computers, it's possible to grab all this data and analyze it for insights.
You don't have to capture a 5% sample and hope that it's representative enough to predict the whole. Mayer-Schonberger and Cukier don't make any mention of Nielsen's 5-users-only theory about UX research, but I'm guessing that scenarios that depend on massive extrapolation from an extremely small sample will become obsolete in favor of big data crunching scenarios.
How can you tell if your prototype works well or not? Just as big data might listen for engine hum vibrations and other noise patterns to predict end-of-life factors for engines, web researchers might capture raw keystrokes and other usability patterns from eye-tracking movements to mouse clicks and navigation patterns (most of which are captured in the browser) and then correlate this massive amount of information to evaluate different prototypes.
In other words, the 102,000 eye twitches, 55,000 keystrokes, 3,500 mouse clicks, 27,000 page loads, 3500 browser crashes, 3,000 page freezes, 45,000 seconds on each pages, and 133,00 different paths through site collected from 100 users might tell you something different from a couple of users saying, "Gee, I'm not sure if I like this screen or not" and "I like the color of this button."
In fact, you could run usability studies post-release as well, capturing a ton of information each time people use a product. Researchers can conduct sentiment analysis with the millions of social media posts and other feedback to determine overall trends about what "sucks" and what "rocks."
Finding Insights in Big Data
What can you do with all of this data you collect? Discover unanticipated insights, apparently.
According to Mayer-Schonberger and Cukier, once you start analyzing big data, you start seeing surprises that you don't expect. For example, Oren Etzioni, a big data researcher, discovered that airline prices don't always increase as the airfare departure data gets closer (see Farecast --> Bing/Travel). There are some interesting patterns that tend to arise when you look at large amounts of information.
Some researchers number crunched big data from sumo matches and discovered irregularities that clued people to match cheating. Others analyzed credit card transactions to identify irregularities in usage that tipped off officials to fraud (in New Jersey).
Mayer-Schonberger and Cukier assert that rather than looking for causal patterns that help you predict results based on underlying factors, big data moves us more toward probability. We can say that given the occurrence of X, it's likely that Y will also be present -- regardless of what's causing X or Y.
One of the most intriguing application is with DNA sequencing. Given X pattern, it's likely that Y disease results, and so on. Never mind what causes X or Y -- what matters is that X leads to a likelihood of Y when you analyze millions of data points.
Messy Is Okay Because It's More Accurate
How does big data apply to the tech comm profession? Mayer-Schonberger and Cukier explain that neatly classifying all the data into precise buckets is not really possible with big data. Instead, big data uses a more messy and imprecise tagging methodology (for example, tagging photos on Yahoo's Flickr site).
Why tagging? The authors explain that in the days before big data, sampling was the only means of analyzing the information. And with the small sample, we could be careful to exclude outliers, to assess and classify all the sample data in very clean, neat ways. You could put all the books in a library into various card catalogs sorted by title, author, or subject.
But big data is much more massive, and as such, it doesn't fit into neat little folders and organization systems. Yahoo tried to classify the content on the web into logical, well-organized folders, but the content quickly overflowed and got too messy. Organization on the web, if done at all, is usually done via ad hoc tags created by users to accomplish a variety of purposes.
Mayer-Schonberger and Cukier explain:
The imprecision inherent in tagging is about accepting the natural messiness of the world. It is an antidote to more precise systems that try to impose a false sterility on the hurly burly of reality, pretending that everything under the sun fits into neat rows and columns. There are more things in heaven and earth than are dreamt of in that philosophy." (chapter 3, 28 min.)
In other words, when you take in all the data, you get the whole gnarly mess of information. Its messiness is characteristic of the realness of the data. You can't throw out the oddball assets and other nonstandard types as you might do with a "random" sample.
Tagging content allows you to scale the content in a much easier way. Rather than trying to stuff content into one folder after another, with tags you can create multiple views, and your content automatically finds its way to the view designated for the tag.
But whether a typical documentation scenario fits a big data situation is another question. Most likely it doesn't. Either way, in the larger landscape of the web, the "tags" on your content are just the keywords in your article title.
"Wordpress" is a great example of a tag. While the documentation for WordPress partly lives on the WordPress Codex, there's a much wider distribution of content across the web, organized only by the tag "wordpress" (thank goodness it's not a two-word name with umpteen syntax options!).
As we move into an era of big data, I think we'll have to say goodbye to traditional classification systems and embrace a more chaotic, dynamic, messy body of content -- one that doesn't look very organized at all, but one where users find answers because the answers are there, buried somewhere in the pile of big data.
About Tom Johnson
I'm a technical writer / API doc specialist based in the Seattle area. In this blog, I write about topics related to technical writing and communication — such as software documentation, API documentation, visual communication, information architecture, writing techniques, plain language, tech comm careers, and more. Check out simplifying complexity and API documentation for some deep dives into these topics. If you're a technical writer and want to keep on top of the latest trends in the field, be sure to subscribe to email updates. You can also learn more about me or contact me.