Thursday, September 11, 2008

Virtuoso + Cypher = Dream Team, or Check Your Foundation!

I'm not an easy customer when it comes to technology selection. I've come across a lot of seemingly interesting frameworks, etc, that had a great vision, but were just done wrong, and not having a piece is a lot better than having a piece that's done wrong.

The Semantic Web is one of those things that can easily be done wrong if you're not careful. A good example is how people were adding data to the semweb before the insurgence of Linked Data. At that time, people were just putting Turtle and N3 in .zip files and posting them to the web, we didn't know any better. Time and experience has lead to better design philosophies. And one of the very few companies I've encountered in this space that is doing the Semantic Web the right way is OpenLink. I was cynical when I first was exposed to some of their products, like the data browser, and the query builder, because it's not easy to seperate those who execute right from those who don't. Let me deviate a second here and give you two examples, SIMILE and Sesame (these aren't negative reviews or anything, as both of these projects are profoundly innovative, but just some short comings I've experienced which may be improved in future versions).

These guys have some really powerful concepts oozing from their labs, their Longwell/PiggyBank browser was the first true RDF browser, and it was the first time I saw anyone implement faceted browsing. When I heard that they released Exhibit to take these core concepts and streamline the framework for quick/easy development, I immediately jumped on it and used it as a result set browser for Cypher (which at the time lacked a UI for browsing result sets). They had the concepts down, but the implementation turned out to be wrong, and if not corrected, will cause Exhibit to be unusable in any real Semantic Web app. These include:
  • JSON input: Exhibit takes JSON data and uses it to generate the browser. This is great, since lots of Web 2.0 apps speak in JSON. But it's cumbersome when Semantic Web apps want to display results of SPARQL queries (varible/value pairs called tuples). And since Sesame has a built in JSON formatter for result sets, I only needed to extend it to suite my need. Good, right? Well, in order to produce a JSON output that had the required properties for Exhibit (i.e. label, id, etc), plus the 'evidence' properties (i.e. the properties which causes a resource to be included in the result set), I had to generate a SPARQL query containing about a dozen variables, some of which had multiple values for each result, in which case you encounter the permutation problem, which can lead to tens of thousands of rows representing only a few actual resources. As the response time progressively increased, I needed a solution quick. Also, Exhibit didn't like the fact that I was sending URIs as property names, and had a problem displaying them in the interface. The JSON tweaking took about 24 precious hours to complete, code that eventually was tossed anyway.
  • No sense of linked data: Beacuse Exhibit wouldn't recognize the URIs I sent it as Web assessible URLs, I was not alble to simply click a link and get a new results browser.
Openlink Data Explorer
In order to attempt to remedy the Exhibit problems, I asked Kingsley if there was a way we could solve the permutation issue with tuples, perhaps in the Virtuoso query processor itself. He then gave me a huge epiphany, SPARQL Construct queries. Construct was one of the areas of SPARQL that I understood, but probably had never actually used because I didn't really see the value in being able to make a graph from a query, since you had the tuples resulting from a Select query, and could just construct your own graph from those results. But Kingsley pointed me to a page which took as an argument a SPARQL construct query, then allowed me to browse the resulting graph. No change was needed to Cypher, except to replace the Select query with the Construct query (i.e. add a construct clause to the query whose paths are basically the select paths), and BOOM!, I was browsing my Cypher results set!! The entire integration took about 30 mins to implement and test, only because I had to make the change in Cyparkler because I (encouraged by this) decided to go ahead and enhance it to support Describe and Ask as well. The ease of point-and-play integration is what the Web 3.0 is all about.

The Semantic Web community has not yet agreed on a standard synonymous with JDBC or ODBC, i.e. a standard for making transactions against and managing triple/quad stores in a platform agnostic way, and it makes for integration between frameworks difficult. So your framework winds up getting "wedded" to one platform or the other. Cypher and Sesame have been married now for about 7 years. It is not only a triple store, but more importantly, it is a framework for making transactions with and manipulating triples stores in Java. Jena is another. Sesame's API called Storage and Inference Layer (SAIL) allows for plugging into other types of back-end triples stores.

Problem: Virtuoso is a great product, but was notorious for requiring you to either write straight SQL or SPARQL, or use stored procedures or use the proprietary VSP language to interact with it. I.e., there was no API or API implementation availible to help bootstrap my Virtuoso integration.

Solution: We simply wrote a Sesame Repository implementation which: 1) connects to an instance of Virtuoso via JDBC, and 2) connects to any SPARQLEndpoint via the SPARQLEndpoint protocol. Now I get all the benifits of Virtuoso while leveraging all the framework code I have for Sesame.

Result: Once upon a time, Cypher could only store data to and query Sesame HTTPRepositories out of the box, but the problem is that Cypher greedy about data, the more repositories it can connect to the better it's performance. The number of publicily availible Sesame HTTPRepositories serving data live on the web can be counted on one hand, so the demos were... well, boring. But now, Cypher can now attach to any one of the growing number of SPARQLEndpoints or Virtuoso databases coming online, right out of the box by configuring a connection in the startup properties file. This was a huge development, because overnight, I went from asking things like who knows Sherman Monroe (from my FOAF flat file) and Sheman's address (from my FOAF flat file) to fun stuff like the presidents who were influenced by people who played in films (from dbpedia).

