The Semantic Web and Content Findability: Interview with Patrick Warren [Organizing Content 14]
In a previous comment, Larry Kunz wondered if the semantic web might be useful in helping users find the content they're looking for. I decided to ask Patrick Warren to write a guest post on the semantic web based on questions I asked him. There's a lot of material to ingest and think about in Patrick's responses. It definitely get the ball rolling in a new direction for this series.
What's your current role and job title (or area of interest)?
I'm about to start a brand new assignment where my title will be Business Process Management Analyst. In this role I will be establishing and teaching business process mapping and modeling best practices, standards and tool usage, to assist business areas in documenting and modeling their business processes as part of a continuous improvement initiative and indirectly the SmartGrid initiative.
What's the semantic web and why does it matter for technical communicators?
To answer this question in the simplest way possible, I will borrow bits and pieces directly from some Wikipedia terms and definitions.
At the heart and core of the Semantic Web are a set of W3C specifications and standards, Resource Description Framework (RDF), RDF Schema (RDFS) and Web Ontology Language (OWL). RDF was originally designed as a metadata (data about data) data model. A data model in software engineering is an abstract model that describes how data are represented and accessed. Data models formally define data elements and relationships among data elements for a (specific) domain of interest. It has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources, using a variety of syntax formats for example, RDF/XML.
The RDF data model is similar to classic conceptual modeling approaches such as Entity-Relationship or Class diagrams, as it is based upon the idea of making statements about resources (in particular Web resources) in the form of subject-predicate-object expressions. These expressions are known as triples in RDF terminology. The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object.
This mechanism for describing resources is a major component in what is proposed by the W3C's Semantic Web activity: an evolutionary stage of the World Wide Web in which automated software can store, exchange, and use machine-readable information distributed throughout the Web, in turn enabling users to deal with the information with greater efficiency and certainty.
Source: Resource Description Framework, http://en.wikipedia.org/wiki/Resource_Description_Framework
Side Note: (Many of us have already been using RDF in the form of RSS feeds. RSS version 1 is RDF, called RDF Site Summary, but was based on an early working draft of the RDF standard, and was not compatible with the final RDF Recommendation.)
RDF Schema is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called Resource Description Framework (RDF) vocabularies, intended to structure RDF resources. Many RDFS components are included in the more expressive language Web Ontology Language (OWL). The RDFS vocabulary builds on the limited vocabulary of RDF.
Source: RDF Schema, http://en.wikipedia.org/wiki/RDFS
The Web Ontology Language (OWL) is a family of knowledge representation languages for authoring ontologies. They are characterized by formal semantics and RDF/XML-based serializations for the Semantic Web. OWL started as a research-based revision of DAML+OIL aimed at the semantic web.
The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries.
—World Wide Web Consortium, W3C Semantic Web Activity
Source: Web Ontology Language, http://en.wikipedia.org/wiki/Web_Ontology_Language
So going back to the description of RDF we have the subject-predicate-object, something technical communicators use everyday in their work. It is also the smallest possible element that can represent a bit or piece of knowledge, both across languages and also be interpreted by machine-based intelligence or computer systems. As writers or wordsmiths, technical communicators form sentences and paragraphs using a natural language such as English, comprised of multiple subject-predicate-object strings, strung together to describe and define a 'thing' or resource.
In RDF we would essentially perform the same process, but instead of sentences and paragraphs, we would create a collection of RDF statements comprised of individual subject-predicate-objects, commonly called triples. Each individual subject, predicate and object is then replaced by an individual Uniform Resource Identifier (URI, a type of URL), which is a string of characters used to identify a name or a resource on the Internet. Such identification enables interaction with representations of the resource over a network (typically the World Wide Web) using specific protocols.
Source: Uniform Resource Identifier, http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
For example, I could write a series of sentences and paragraphs describing the resource, The Society for Technical Communication or STC. I can also create a collection of triples that essentially does the same thing, and with the added benefit of using URIs, I can link it to other resources on the web.
<http://en.wikipedia.org/wiki/Technical_communication> or even better <http://dbpedia.org/page/Technical_communication>
I would continue building upon and expanding this collection of RDF statements or triples, further defining what a Chapter and SIG is, using their respective URIs for each Chapter and SIG, and on and on until I was satisfied that everything that I knew or could describe about STC was captured, creating a representation of my knowledge of that subject and it's relationships to other objects and resources on the web.
Where possible I would try to use and leverage existing RDF vocabularies and URIs, such as those from the Dublin Core (DC), FOAF, SKOS, DOAC, etc. as well as RDFS. As I further refined the descriptions and elements, I may eventually move to using OWL (via Protégé) instead of RDF or RDFS, and create and publish a formal ontology. A good example would be for the term 'Technical Communication', which is our specific knowledge domain. This could then be connected to other ontologies and knowledge domains, creating a web of Linked Data, such as what is seen and under development in the Linking Open Data community project.
This also lays the ground work for the creation of rich and interactive visualizations using both 2D and 3D models for the representation of concepts, content, information and knowledge. A simple example of this is similar to Visual Thesaurus. Other examples of this type of 'rich visualization' can be seen in the recent outcomes of the Netflix Prize Movie Similarity Visualization, BellKor's Pragmatic Chaos, The Landscape of Movies, and The Ensemble. More about this and the current activities can be read here: DMSIG – Advances in Ensemble Learning from the $1,000,000 Netflix Prize Contest 6/28/10.
How can the semantic web help solve some of the problems of content findability, especially with help information?
As you can see in the previous description of the underlying concepts, elements and components that comprise the Semantic Web and semantic technologies and vocabularies, it's all about defining and describing resources, and as important the relationships and connectivity between those resources.
A help system or file is essentially the same thing. We use combinations of text, screen captures, and in some cases animations and video of a 'thing' or resource, such as a software application, and how to make use of that application or 'thing' to accomplish a specific task or activity. It is an information or knowledge model, designed for a very specific purpose: how to use some 'thing'.
Another essential part of any help system is in connecting and establishing relationships of the various pieces of that help system together to provide navigational elements for that system of information. With help systems this is typically accomplished with a Table of Contents, an index of keywords for that specific application domain, a basic search, and in some cases using a task or activity based approach in constructing it.
Semantic technologies such as RDF would allow for the creation of new navigational methods and models, since we are defining the relationships between resources in a much more structured, granular and refined way. Search could be expanded and performed on additional predicates, the relationships between the subject and object resources internal to that particular help system and topic, and quite possibly to other external resources.
Navigation also becomes much more inherent, since we defined these relationships as part of the content creation process and underlying structure, much in the same way we create a TOC or index today. Eventually more could be automated and accomplished programmatically through the use of complex and robust algorithms and Natural Language Processing (NLP) of the resulting text and content.
An example of this is Swoogle, a search engine for Semantic Web ontologies, documents, terms and data published on the Web.
Swoogle employs a system of crawlers to discover RDF documents and HTML documents with embedded RDF content. Swoogle reasons about these documents and their constituent parts (e.g., terms and triples) and records and indexes meaningful metadata about them in its database.
Swoogle provides services to human users through a browser interface and to software agents via RESTful web services. Several techniques are used to rank query results inspired by the PageRank algorithm developed at Google but adapted to the semantics and use patterns found in semantic web documents.
Source: Swoogle, http://en.wikipedia.org/wiki/Swoogle
Before any of this can happen though, there is one major change that I see being needed concerning help development applications. That is the existing separation of the help development platform or system and the resulting help system or file that is generated. In effect we end up creating something that is rigid and static, completely separated from the system that created it. This creates correction, update and expansion issues, since the creation process needs to be reinitiated, changes or additions collected, tracked, managed and added and a new help file or system compiled and generated.
If instead we moved to a new model of help development, one where the application and resulting system were united as one continuous whole, we could create a dynamic and interactive system, one that can be updated and improved in real-time via comments and feedback from users. Additionally, we could employ much more contemporary user interfaces and methods of navigation, track and monitor user behaviors for improving the system over time, and allow users (or the system) to create personalized usage profiles that they could return to, similar to what Amazon and others are now using in the eCommerce domain.
You mentioned that what's next might be "RDFa in XHTML files or RDF/XML auto-generated via PHP, Perl or Python." Can you expand on that? What is RDF and why would it be helpful for technical writers?
I identified RDF above or previously. RDFa, or Resource Description Framework, is a W3C Recommendation that adds a set of attribute level extensions to XHTML for embedding rich metadata (RDF) within Web documents. The RDF data model mapping enables its use for embedding RDF triples within XHTML documents; it also enables the extraction of RDF model triples by compliant user agents. As you may already know XHTML (Extensible Hypertext Markup Language) is a family of XML markup languages that mirror or extend versions of the widely used Hypertext Markup Language (HTML), the language in which web pages are written.
Today's scripting languages such as Perl, Python and PHP, have XML libraries and extensions that allow for the programmatic and dynamic creation of web pages, both HTML and XHTML, the latter of which is also being extended to include RDFa. This then provides semantic web-based parsing engines like Swoogle mentioned previously, with the additional RDF metadata to index, further define and make inferences about that web page as a specific resource and source of information and knowledge.
These are typically combined with the Apache HTTP Web server and MySQL to provide what is called a WAMP (Windows, Apache, MySQL, Python, Perl or PHP), LAMP (Linux, Apache, MySQL, Python, Perl or PHP) or MAMP (Mac OS X, Apache, MySQL, Python, Perl or PHP) stack, which forms the foundation for many of the recent open-source web applications, such as WordPress, Joomla and Drupal.
RDF/XML is one type of serialization of an RDF graph or group of RDF triples, expressed as an XML document. This provides the ability to create content and data feeds, both to and from web sites that can be parsed and reused in other applications or on other web sites. Some web applications now provide for the creation and consumption of RDF/XML as a feature.
To implement some of these ideas, would technical writers need to adopt web platforms such as Drupal?
Some of us have witnessed the birth of the web, and still others have participated in creating it -- at first using simple text editors writing HTML by hand back in the mid 90's and later using systems and applications that programmatically generated the desired web pages, the layout, and look and feel, moving us further and further away from having to type and code manually.
These same tools and processes have continued to evolve, providing us with the next generation of tools and improved processes of content creation we use today. For me it was a natural progression, being fed up and totally dissatisfied with commercial offerings concerning both content creation and the management and delivery of that content. But I am in one of two distinct camps of technical communication, one in which the content is never intended to be converted to print, it is for all intents and purposes 100% online delivery and consumption. This affords me a higher level of flexibility in my choices of which tools to use, versus those that create printed materials.
So today's technical writers are again left with a choice. Continue to use antiquated tools and processes, or forge ahead into using what may possibly be the next evolution in tools for our trade and profession. Of the dozens of web-based applications I have tested and reviewed in the past three years or so, Drupal had what I was looking for as far as flexibility and extendibility. With it I can create various forms of web-based content management systems, for many types of information. From simple news sites, to requirements management systems, to full blown knowledge systems spanning an entire company or enterprise, all the while linking and connecting the multiple information silos that still exist in most companies today.
Drupal is also one of very few web-based applications also breaking ground in the RDF, Semantic Web, and Linked Data arenas -- initially with free 3rd party contributed modules and more recently by absorbing the functionality provided by those modules into the core of Drupal. The next release of Drupal, version 7, will be Semantic Web enabled to a certain extent right out of the box. A new and specific distribution of Drupal, OpenPublish, will include the implementation of a variety of media outlets sites including magazines, newspapers, journals, trade publications, broadcast, wire service, multimedia sites and membership publications will also.
I predict that there will be a growing need for technical writers with experience and knowledge of these types of publication systems and processes, on various levels.
Where do you get so much enthusiasm for the Semantic Web? You seem really passionate about it.
I began my career 26 years ago as an electrical engineer, predominately with a focus on microprocessors, microprocessor based systems and software development. I eventually moved into full-time technical writing and training, but still heavily involved in the engineering and manufacturing of microprocessor-based, software-intensive computer information systems and the defining and creation of standardized development methodologies.
For me, I have always visualized data, content, processes, information and knowledge as one big continuously evolving dynamic web, network or super system; a system-of-systems. Nothing ever exists alone or in a vacuum, everything (and everyone) in one way or another is connected, in multiple and evolving ways, to something else. These new Semantic Web tools and technologies now allow me to bring that visualization that I have had all along, to life.