Integrating Linked Data Technologies

Jans Aasman

This page describes a half day tutorial being conducted at the 2010 Extended Semantic Web Conference in Heraklion, Greece, May 30 - June 3. The fee for participation in the tutorial sessions are not included in the regular registration fee.

 

Abstract

There is an explosion of linked RDF datasets in the life sciences domain. A typical RDF dataset published on the web is about one particular domain and contains an ontology of the data it contains, a set of instances, and possibly some explicit owl:sameas relations to other instances in other datasets. In practice, the exploration of these data sources is far from trivial. The domain expert has to study each dataset to discover what classes, including properties of each class, it contains. Unfortunately not all datasets come with full ontologies that make this easy. Most interesting problems require one to combine a large number of these datasets and then create queries and analysis programs that touch multiple sources. This tutorial will discuss techniques for exploring linked datasets that lack even simple class descriptions, datasets that do contain at least rdf:types, then how to use existing ontologies and the output from these techniques to create an enriched schema space for data mining. We will work with visual tools to quickly understand what is in the various datasets, how they are linked, and how to automatically create SPARQL queries. A thorough overview of working with an RDF store will be provided.

Introduction

The goal of this half-day tutorial is to provide people with the understanding of how to express queries that involve RDFS++ reasoning, geospatial primitives, temporal primitives and social network concepts. For this tutorial we will assume that the attendee has some introductory knowledge of RDF(S), SPARQL, Entity Extraction, Spatial - Temporal concepts and Social Network Analysis. We will be using W3C standards based technology (AllegroGraph) for this tutorial but the concepts learned will transfer to other Semantic Web solutions. Attendees that bring their laptop will receive a version of the software they can use to interactively work with the tutorial on their own.

Tutorial

  1. (15 minutes) Basic RDF Store Operations
    1. Creating, opening and closing an RDF store
    2. Adding triples manually or by loading from an RDF file
    3. Deleting triples
    4. Getting triples out of a triple store
  2. (5 minutes) Named Graph: working with the fourth element of an RDF statement

    Many of the RDF stores now in existence are quad stores but for historical reasons we still call them triple stores.

  3. (20 minutes) SPARQL: the W3C standard query language for the Semantic Web.

    SPARQL's syntax resembles SQL but is more powerful, enabling queries over multiple disparate, remote or local data sources. It offers a relational, pattern-based approach to retrieving data from an RDF store.

  4. (40 minutes) RDFS++ Reasoning

    Description logic or OWL reasoners are good at handling (complex) ontologies, they are usually complete (give all the possible answers to a query) but have completely unpredictable execution times when the number of individuals increases beyond millions. In RDFS++ practice Business Intelligence is done over large numbers and there is need for more scalable reasoning. RDFS++ reasoning supports all the RDFS predicates and some of OWL's. It is not complete but it has predictable and fast performance. We'll cover the supported predicates:

    1. rdf:type and rdfs:subClassOf
    2. rdfs:range and rdfs:domain
    3. rdfs:subPropertyOf
    4. owl:sameAs
    5. owl:inverseOf
    6. owl:TransitiveProperty
  5. (20 minutes) Prolog: Creating rules and querying the triple store with Prolog.

    RDF and OWL are powerful in themselves, but for more complex Business Intelligence questions you need a rule-based language. We will show how to use RDF-Prolog.

  6. (10 minutes)Full text indexing and how to query it with SPARQL and Prolog
  7. (10 minutes) Range Queries: all numeric types, dates, telephone numbers, etc.
  8. (20 minutes) Social Network Analysis

    Learn how to apply Social Networking Analysis algorithms. We will show examples of how to find relationships between people, compute the strengths of groups; etc. Typical algorithms that we will discuss are both classical search algorithms and particular social network algorithms. Examples are depth-first, breadth-first, bidirectional, best-first, A*, in-degree, out-degree, nodal degree, ego group selection, density, actor degree centralization, group degree centralization, actor closeness, group closeness centrality, actor betweenness, group betweenness centrality, and cliques. Most of the algorithms are described in Social Network Analysis 1.

  9. (20 minutes) Spatial - Temporal Reasoning

    Learn how to write simple spatial-temporal queries. We will first show how to add longitudes, latitudes and time information in the RDF store so that our geotemporal primitives will work the most efficient way. Then we will show how to use basic primitives such as spatial and temporal bounding boxes in queries.

  10. (20 minutes) Graph Visualization Techniques
  11. (10 minutes) Requirements for running an RDF Store in the Enterprise
    1. Backups
    2. Replication
    3. Failover
    4. Instrumentation
    5. Management
  12. (25 minutes) Exercise 1: Combining Google News, Entity Extraction, and Linked Data in an RDF Store

    Entity Extraction is a technology to extract entities such as names, locations, dates, and industry specific terminology from text. We’ll work with an RDF data set derived from using a professional entity extractor to scrape Google News. For this exercise, the scraper takes all of the main categories in Google News from a particular day, finds all of the subcategories, then within each of those subcategories, scrapes five articles. This results in about 700 articles. We’ll open the RDF triple store, and then link all the people and places mentioned in the news articles to public linked datasets, DBPedia and Geonames. We’ll graphically visualize the results of queries on this dataset that are currently impossible with regular search engines.

  13. (25 minutes) Exercise 2: Scientific Business Intelligence, discovery in combined bioinformatics data sets

    We’ll work with an RDF data set consisting of a combination of the publicly available Drugbank, DailyMeds, Sider, Diseasome and ClinicalTrialDB. We’ll create additional links between the texts in ClinicalTrialDb and specific drugs, diseases, targets and side effects. With that we’ll do some very interesting discovery in ways that are currently impossible without Semantic Technology.

 

References

1.  Wasserman and Faust: Social Network Analysis.  2006

 

Bio on the Presenter

Jans Aasman

Jans Aasman, CEO of Franz Inc. started his career as an experimental and cognitive psychologist, earning his PhD in cognitive science with a detailed model of car driver behavior using Lisp and Soar. He has spent most of his professional life in telecommunications research, specializing in intelligent user interfaces and applied artificial intelligence projects. From 1995 to 2004, he was also a part-time professor in the Industrial Design department of the Technical University of Delft. Jans is currently the CEO of Franz Inc., the leading supplier of commercial, persistent, and scalable RDF database products that provide the storage layer for powerful reasoning and ontology modeling capabilities for Semantic Web applications. Dr. Aasman has gained notoriety as a conference speaker at such events as Semantic Technologies Conference, International Semantic Web Conference, Java One, Linked Data Planet, INSA, GeoWeb, ICSC, RuleML and DEBS.

Copyright © 2014 Franz Inc., All Rights Reserved | Privacy Statement
Delicious Google Buzz Twitter Google+