Part 1: Linked Data Introduction
Background – The Web of Documents
Everyone who has ever used a browser is familiar with the World Wide Web that we’ve been enjoying for many years. This Web – really a web of documents – has provided a foundation for us to share previously unimaginable amounts of information yet has some key implementation details that ultimately impose a limit to its usefulness. The Web represents information as text on pages. It was designed to allow humans to read, filter out redundant information, and infer meaning based on the natural language used, the context of the information, and the existing knowledge of the reader. In other words, we humans glean data from the web pages that we read. Furthermore, the meaning of relationships between different pieces of information in the Web is never explicit. Again, us humans infer that meaning – HTML just isn’t expressive enough to represent typed relationships between defined entities.
The document-centric Web does contain a lot of useful data though. The problem is, due to the way it is represented, we can’t do as much as we’d like with that data. And just as problematic, if not more so, a lot of the data that we could use to help us answer our questions simply isn’t published.
When we search the Web we rely on algorithms employed by search engine indexers to provide links to documents that the indexer believes are relevant, and may or may not contain the information we seek. We trust the algorithms used to not exclude useful and relevant information from the result set, and we’re expected to filter out any remaining irrelevant information and to combine information that is represented on different pages in order to try to arrive at answers to our questions.
Imagine now being able to perform a search that allows us to expect accurate and relevant answers to even the most complex of the questions that we may ask. Sound good? That’s what can be expected when instead of searching a web of documents, we search a web of structured data.
Linked Data
Linked Data is a phrase that refers to a set of best practices for establishing a web of data; that is publishing and connecting data on the Web. Linked Data can be read by machines, has explicitly defined meaning, can be linked to other data, and can in turn be linked to from other data.
When we describe ‘meaning’ in the context of Linked Data we are referring to describing data in a form that is understandable by computers.
Linked Data has four main supporting principles:
- Use URIs to identify things
- Use HTTP URIs so that you can de-reference / look up those things
- When someone de-references a URI, provide some useful information about that thing
- Include links to other URIs, so that additional, related things can be discovered
Just as the Web relies on Uniform Resource Identifiers (URI’s) and the Hypertext Transfer Protocol (HTTP) to provide a hugely scalable architecture for linking documents (HTML pages) regardless of where those documents are physically located, it uses the same underpinning technology stack to provide the same scalability for linking structured data, regardless of where that data is located.
Representing data as RDF (Resource Description Framework) allows machines to understand the data and how it is related to other data.
RDF describes data and relationships as triples consisting of a subject, an object, and a predicate, and allows us to build a web of data.
You can think of a triple as being the structure of a simple sentence. For example, let’s take the sentence ‘Tim Berners-Lee created HTML’.
In this sentence, the subject is ‘Tim Berners-Lee’, the object is ‘HTML’ and the predicate is ‘created’ and describes the relationship between the subject and the object. Remember that we can’t easily describe these typed relationships between entities with the web of documents.
With RDF, subjects are always URIs. Objects can be either URIs of related resources, or simple literals such as a string, date, or a number (e.g. “HTML was invented in 1990”).
Predicates are also identified by URIs that are collected in vocabularies. Different vocabularies are used to group predicates that describe types of relationships between data for a given domain. Vocabularies help you understand data and its properties more quickly. For example OSLC (Open Services for Lifecycle Collaboration) has a vocabulary that describes properties of – and typical relationships between – lifecycle resources. A set of common vocabularies has been established (e.g. Dublin Core: http://dublincore.org/documents/dcmes-xml/ , Friend Of A Friend (FOAF): http://xmlns.com/foaf/0.1/). When publishing Linked Data, it is best practice to check whether your data can be represented using terms from existing vocabularies.
In N3 / Turtle (two popular serialization formats for RDF) the triple for ‘Tim Berners-Lee created HTML’ might look like:
@prefix dc: <http://purl.org/dc/elements/1.1/> .
<http://en.wikipedia.org/wiki/HTML>
dc:title “HTML”;
dc:creator http://www.w3.org/People/Berners-Lee/card .
One key benefit of RDF is that it can represent any kind of data. With RDF, we need never be concerned that we won’t be able to represent some as yet unforeseen data – in other words, traditional concerns around data models (as encountered with relational databases) aren’t a worry to us.
Linking Open Data
The best known example of Linked Data is the Linking Open Data (LOD) project, which was created to identify data sets that exist in the public domain, publish them and link them using the Linked Data principles, in order to create a publically accessible web of data.
As of March 2012, this web of open data comprised of over 52 billion RDF triples. Still, these numbers are relatively small when we consider just how much data is estimated to be stored in all the world’s databases (over 3 petabytes). Whilst a lot of this data will be private, huge amounts could be published to the web of open data, and as the web of open data grows, we are able to answer more and more questions that we previously could not, or at least not easily.
Figure 1: Visualization of the Linked Open Data Cloud
Click here to read Part 2 of this blog series, where we discuss Linked Data in the systems and software engineering lifecycle.
Benjamin Williams
Senior Product Manager, Rational Systems Platform
bwilliams@uk.ibm.com
Hi Benjamin,
very good Blog about Linked Open Data(LOD). Let me ask you some questions:
* In the context of RTC, could we talk about LOD? Because the Linked Data in RTC is password protected
* URI are a central key of LOD. Wouldn’t it be good to avoid the server name and port (9443) in our RTC URIs?
Thanks in advance
Hello shufnagl.
Thank you for the kind words.
Typically no, data in lifecycle development tools such as RTC would not be categorized as open.
As you mention the data most often requires authentication to access.
Often the data is proprietary, confidential and commercially sensitive. In most industries the data stores are not accessible outside company firewalls.
Linked Open Data is just one example of the application of Linked Data principles.
In part 2 of this series of blog entries (scheduled for the beginning of July) I talk about the application of Linked Data principles across the development lifecycle.
We call this Linked Lifecycle Data.