Scalable Integration of Linked Data

Introduction

The goal of this tutorial is to introduce, motivate and detail techniques for integrating heterogeneous structured data from across the Web. Inspired by the growth in Linked Data publishing, our tutorial aims at educating Web researchers and practitioners about this new publishing paradigm. The tutorial will show how Linked Data enables uniform access, parsing and interpretation of data, and how this novel wealth of structured data can potentially be exploited for creating new applications or enhancing existing ones.

As such, the tutorial will focus on Linked Data publishing and related Semantic Web technologies and standards, introducing scalable techniques for crawling, indexing and automatically integrating structured heterogeneous Web data through reasoning.


Content

Monday, October 24th

Session 1: Introduction to RDF and Linked Data

The first session gives an overview of RDF and Linked Data publishing. We will discuss the RDF data model and Linked Data principles for publishing RDF data on the Web. In particular, this session will cover:

  • RDF rationale and basics
  • Linked Data principles and introduction
  • Current adoption and trends in Linked Data

Session 2: Scalable Linked Data Crawling

This session gives an overview of the state of the art in efficient data-retrieval techniques, including novel challenges and techniques for crawling Linked Data from the Web. We will present the architecture of a crawler for small to medium-sized datasets in the range to several hundred million triples. In particular, this session will cover:

  • Linked Data location, access and crawling
  • Scalable crawling techniques and algorithms
  • Description of the open-source LDSpider Linked Data crawler

Session 3: Scalable RDF Indexing Techniques

This session presents scalable techniques for indexing and querying local repositories of Linked Data. We will discuss the standardised SPARQL query-language and thereafter discuss the state-of-the-art in RDF storage with respect to research, directions and applications. In particular, this ses- sion will cover:

  • Overview and challenges
  • Introduction to the SPARQL standard
  • Scalable/distributed RDF indexing systems (e.g., YARS2, etc.)

Session 4: Reasoning: Motivation and Overview

This session gives an introduction to the RDFS and OWL (2) standards and to rule-based reasoning, with heavy emphasis on motivating reasoning for the Linked Data use-case and for integrating heterogeneous data from a large num- ber of diverse sources. We also introduce algorithms which incorporate information about the provenance of data during reasoning to ensure robustness in the face of noisy or impudent remote data. In particular, this session will cover:

Session 5: Scalable Distributed Reasoning over MapReduce

This session presents scalable distributed reasoning using the MapReduce distribution framework, enabling high performance over a cluster of commodity hardware. This session details the MapReduce framework (employed by Google and Yahoo, among others) and the award-winning WebPIE system which integrates optimised execution strategies for rules supporting a (pragmatic) fragment of OWL semantics.

  • MapReduce architecture
  • Core optimisations and approach for distributed reasoning
  • Implementing RDFS
  • Extension to pD* (OWL-Horst)
  • Hands-on: how to launch the reasoner on a cluster and on the Amazon cloud

Session 6: Implementing a LarKC Workflow

This session allows attendees to get hands-on with building scalable linked data applications. Some of the technologies presented in the previous sessions will be put together using a scalable workflow engine tailored for Linked Data: the Large Knowledge Collider (LarKC).

  • Workflow overview and rationale
  • LarKC platform overview
  • Hands-on: Building a LarKC Workflow for crawling and reasoning over Linked Data