This is a question I have heard a couple of times.
( warning: this is a dense technical blog post. It was an email but I decided to post it here. Sorry in advance ).
Well, there are several ways of doing this. But, conceptually, you need to understand that there are three distinct steps in the process, each of which can be accomplished in several ways. ( Note: I use DBPedia as a "toy" example but in real life I use this to create BI and EDI workflows for real life apps ).
The 3 steps are: Linking, Importing and Querying.
== Step 1 ==
Linking ( or aligning ) your data with DBPedia's
This is generally about mapping your URIs to those of "equivalent" concepts in the DBPedia namespace. The result of this process is usually a set of owl:sameAs triples, but they could be another sort of alignment ( ABox or TBox ).
More sophisticated tools like Google Refine can give you a hand when it comes to mapping large datasets.
There are other approaches as well.
Finally, for simple literal ID alignments ( like EAN-UCC, SKU, ISBNs, etc ), a simple SPARQL query using string comparison heuristics will work just fine.
Note to self: Keep an eye on Geospatial Alignment tools.
==Step 2 ==
Once you have created one or more owl:sameAs relationships or some other kind of alignment data, you most certainly want to exploit the result by issuing queries that consider the sum of both datasets. Again, there are several options here and the final strategy will depend on your queries, desired response times, etc. The main factor here is to figure out if you will load some fraction from DBPedia into your system and, if yes, how much of it.
Let me walk you through an example.
If you just want to enrich certain entities ( for example, stealing the labels for cities or for music bands ), then the cheapest and easiest way is to insert only those triples.
One common way to do this in Virtuoso is to take advantage of the built-in sponger ( which is a fancy name for a Linked Data adapter ).
Here's one technique that works pretty well. You can use this for an infinite number of scenarios.
The following is Virtuoso SPASQL, but it could be written as pure SPARQL HTTP as well.
sparql clear graph <XXX>;
sparql define get:soft "soft" select * from <XXX> where { ?s ?p ?o } ;
Where XXX is the URI of a SPARQL construct query.
Say what?
OK. Let's slow down a bit. Suppose you want to retrieve all the labels and descriptions available for Paul Mccartney. You can play around with SPARQL in DBPedia and you would probably come up with something like the following:
prefix res: <http://dbpedia.org/resource/>
prefix dbpedia-owl: <http://dbpedia.org/ontology/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?label ?abstract where { res:Paul_McCartney dbpedia-owl:abstract ?abstract; rdfs:label ?label }
( Go see an HTML representation of this query's results here )
Nice. That's the data you want to add to your app. But how do you store query results into your local Quad Store?
You don't. What you want to store is not the "tabular" select query results, but the data itself. The Triples. Or subgraph if you wish.
No problem, SPARQL construct to the rescue.
prefix res: <http://dbpedia.org/resource/>
prefix dbpedia-owl: <http://dbpedia.org/ontology/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
construct
{ res:Paul_McCartney rdfs:label ?label; dbpedia-owl:abstract ?abstract }
where
{ res:Paul_McCartney dbpedia-owl:abstract ?abstract; rdfs:label ?label }
Now, if you run the above query in the default SPARQL endpoint UI for DBPedia ( http://dbpedia.org/sparql ) you will get back a N3/TTL file containing the raw triples for you to insert.
Click here to try it ( Note: your browser will most likely ask you to download a file. Accept and then take a look inside... Yes! triples! ).
The uncompressed URI for the file is pretty long, but it contains your query and some other parameters, effectively exposing a REST API for SPARQL execution ( this transparent HTTP magic is, in fact, an integral part of SPARQL ).
FYI, the URI looks like this ( I remove prefixes to save some space ).
http://dbpedia.org/sparql?query={{prefixes-go-here}}construct+%0D%0A%7B+res%3APaul_McCartney+rdfs%3Alabel+%3Flabel%3B+dbpedia-owl%3Aabstract+%3Fabstract+%7D+%0D%0Awhere%0D%0A%7B+res%3APaul_McCartney+dbpedia-owl%3Aabstract+%3Fabstract%3B+rdfs%3Alabel+%3Flabel++%7D&format=text%2Frdf%2Bn3
Notice the "format=text/rdf+n3" parameter at the end.
So, remember that XXX placeholder above? This is the URI you should use there. Virtuoso will then perform a GET request, download the file, figure out it is a valid RDF serialization and insert it.
Of course, manually composing such a query for each and every one of your aligned resources is a PITA. So I would suggest building a simple script or stored procedure that does the work for you.
OK. I hope you are not too dizzy by now. This example may seem like overkill at first, but if you think about it for a while, we are actually doing something very simple yet powerful here. We are taking a subgraph of a remote RDF dataset and "importing" it into your local environment. This is unique to Linked Data due to its use of URIs ( no collisions ) a triple based KRF ( no need to create tables, just add data ) and, finally, its transparent use of the HTTP protocol.
Hopefully you can build on this to come up with more complex workflows. I have built really crazy things using these simple pieces. Of course this is not the only way, in fact there are thousands of combinations and tools at your disposal. Some ideas include:
- Downloading TTL files using wget, scripts, etc and loading using stored procedures ( faster in some scenarios )
- Syncing to/from a remote Graphs using Virtuoso RDF Graph Replication feature ( this is very robust and efficient as it is based on time tested and industry-standard SQL replication functionality and uses an optimized "changeset-based" messaging protocol )
- Downloading a complete DBPedia dump and loading it completely into your machine. This can be used during the alignment process, for example, to run demanding queries, etc.
- Using Federated SPARQL when it becomes available ( basically, forget about importing, just use multiple endpoints and let the SPARQL engine do the guessing ).
== Step 3 ==
Of course ;)
I almost forget.
If the result of your alignment was a set of owl:sameAs links, you should remember to turn on Virtuoso's owl:sameAs inference:
sparql define input:same-as "yes" select * where ...