Monrai Blog

News about Cypher, Semantic Web, Natural Language Processing, and Computational Linguistics

Saturday, September 20, 2008

How to Insert Web Links Using Ubiquity and the Summon Command

I recently complained about the tedious labor involved in linking to stuff, such as the links within this article. So I created a ubiquity firefox (http://tinyurl.com/5m9lhb) command that allow you to get the link representing any text, and insert it into a web page. Install the command here. For example, if I wanted to point to the show Firefly (http://tinyurl.com/6g3zfj) , I only do one step:

1. Call ubiquity firefox (http://tinyurl.com/5m9lhb) using the ctl+space command (ctrl-alt for mac), then type summon text.

And that's it. When you press enter, the link for text is automatically inserted into the page you're viewing. You get a list of about 10 choices, choice #1 is inserted by default. If you want to choose another from the list, just type pick number after your input: e.g. summon bill clinton pick 3 will insert the 3rd link from the list of possible links to pages that represent Bill Clinton. The first item in the list is always the page returned from Google's I'm Feeling Lucky (http://tinyurl.com/6qs8fq) search.

This command is great for quickly inserting links into email (http://tinyurl.com/692yha) , blogs (http://tinyurl.com/56u7ys) , blog comments, etc. Optionally, you can just select some text in the page, and ubqi summon this to replace that text with the link.

It uses ubiquity firefox, sindice (http://tinyurl.com/4fabtc) to do the URI lookup, OpenLink (http://tinyurl.com/4dp9s2) data explorer to view the URI description, and Google (http://tinyurl.com/3gsc6g) for Feeling Lucky results, and delicious (http://tinyurl.com/4jguko) for bookmark search.

Nice to haves:
- search your personal dataspaces for links (delicious, etc)
- conditionally insert only title link (for Rich text editors like gmail) or title and parenthetical URL (for non-rich text editors like twitter)

Update: here is another insert link command, pretty cool stuff

Labels: , , ,

Thursday, September 11, 2008

Semantic Web Value Proposition

Dan Grigorovici has been blogging a lot lately about nailing down the Web 3.0 value proposition. I think that we as Semantic Web evangelists must also be good salesmen, and therefore need a good pitch. I think two of the big value propositions for Web 3.0 are: Automate Tedious Tasks and Seridipity/Knowledge.

Automation
The automation value prop came to mind as I experienced this real-world use case recently:

My nephew needed a video of Julius Cesear for his class, so I thought cool, let rent it. We then did the following:
  1. Collected the numbers of video stores in the city (via hard copy of yellowpages)
  2. Called each video stores to check availibility
  3. Hold the line as the clerk manually searched on their inventory
  4. Mapquest each store and determine the closest
My grandma's personal database automates this tedious process. It allows you to summon stores near me that have movies about Julius Cesear and get a result set of all stores, and the "proof" of why that store is in the results, i.e. the properties involved in the query and their values, including location, the movie, movie's description, and any other property that caused it to appear in the results.

Serendipity
The serendipity value prop is also important articulate. I hear the terms data, information, and knowledge listed a lot, accompanied by someone's definition, so I will offer mine to help frame this point:
  • Data = a set of things (e.g. a list of shoes, a table of dates)
  • Information = statements about data (e.g. the price of this shoe is $60, your appointment is on this date)
  • Knowledge = statements about information (e.g. your appointment happens to be on the 3rd anniversary of the day you purchased this shoe)
Thus, the ability to have knowledge represented in the Semantic Web leads to the idea of serendipity, which (if I may borrow a term from information retrieval) helps us to increase the recall in our lives, the missed opportunities and overlooked connections that can improve the quality of life and cause of to be much more productive. Serendipity is a concept that a hear tossed around, and so I will offer a concrete example of what this means. I was running some test input through Cypher, and summoned the authors who starred in films. I was surprised to see Adolf Hitler at the top of the list (the list was alphebetized). I naturally assumed this had to be a bug in the software, so I looked at the full results page, which included the proof. And there, I found that Adolf Hitler indeed was in several German propaganda films. That notion of learning that something I once thought was unlikely was actually quite likely, that's serendipity.

Labels: , ,

Virtuoso + Cypher = Dream Team, or Check Your Foundation!

I'm not an easy customer when it comes to technology selection. I've come across a lot of seemingly interesting frameworks, etc, that had a great vision, but were just done wrong, and not having a piece is a lot better than having a piece that's done wrong.

The Semantic Web is one of those things that can easily be done wrong if you're not careful. A good example is how people were adding data to the semweb before the insurgence of Linked Data. At that time, people were just putting Turtle and N3 in .zip files and posting them to the web, we didn't know any better. Time and experience has lead to better design philosophies. And one of the very few companies I've encountered in this space that is doing the Semantic Web the right way is OpenLink. I was cynical when I first was exposed to some of their products, like the data browser, and the query builder, because it's not easy to seperate those who execute right from those who don't. Let me deviate a second here and give you two examples, SIMILE and Sesame (these aren't negative reviews or anything, as both of these projects are profoundly innovative, but just some short comings I've experienced which may be improved in future versions).

SIMILE
These guys have some really powerful concepts oozing from their labs, their Longwell/PiggyBank browser was the first true RDF browser, and it was the first time I saw anyone implement faceted browsing. When I heard that they released Exhibit to take these core concepts and streamline the framework for quick/easy development, I immediately jumped on it and used it as a result set browser for Cypher (which at the time lacked a UI for browsing result sets). They had the concepts down, but the implementation turned out to be wrong, and if not corrected, will cause Exhibit to be unusable in any real Semantic Web app. These include:
  • JSON input: Exhibit takes JSON data and uses it to generate the browser. This is great, since lots of Web 2.0 apps speak in JSON. But it's cumbersome when Semantic Web apps want to display results of SPARQL queries (varible/value pairs called tuples). And since Sesame has a built in JSON formatter for result sets, I only needed to extend it to suite my need. Good, right? Well, in order to produce a JSON output that had the required properties for Exhibit (i.e. label, id, etc), plus the 'evidence' properties (i.e. the properties which causes a resource to be included in the result set), I had to generate a SPARQL query containing about a dozen variables, some of which had multiple values for each result, in which case you encounter the permutation problem, which can lead to tens of thousands of rows representing only a few actual resources. As the response time progressively increased, I needed a solution quick. Also, Exhibit didn't like the fact that I was sending URIs as property names, and had a problem displaying them in the interface. The JSON tweaking took about 24 precious hours to complete, code that eventually was tossed anyway.
  • No sense of linked data: Beacuse Exhibit wouldn't recognize the URIs I sent it as Web assessible URLs, I was not alble to simply click a link and get a new results browser.
Openlink Data Explorer
In order to attempt to remedy the Exhibit problems, I asked Kingsley if there was a way we could solve the permutation issue with tuples, perhaps in the Virtuoso query processor itself. He then gave me a huge epiphany, SPARQL Construct queries. Construct was one of the areas of SPARQL that I understood, but probably had never actually used because I didn't really see the value in being able to make a graph from a query, since you had the tuples resulting from a Select query, and could just construct your own graph from those results. But Kingsley pointed me to a page which took as an argument a SPARQL construct query, then allowed me to browse the resulting graph. No change was needed to Cypher, except to replace the Select query with the Construct query (i.e. add a construct clause to the query whose paths are basically the select paths), and BOOM!, I was browsing my Cypher results set!! The entire integration took about 30 mins to implement and test, only because I had to make the change in Cyparkler because I (encouraged by this) decided to go ahead and enhance it to support Describe and Ask as well. The ease of point-and-play integration is what the Web 3.0 is all about.

Sesame
The Semantic Web community has not yet agreed on a standard synonymous with JDBC or ODBC, i.e. a standard for making transactions against and managing triple/quad stores in a platform agnostic way, and it makes for integration between frameworks difficult. So your framework winds up getting "wedded" to one platform or the other. Cypher and Sesame have been married now for about 7 years. It is not only a triple store, but more importantly, it is a framework for making transactions with and manipulating triples stores in Java. Jena is another. Sesame's API called Storage and Inference Layer (SAIL) allows for plugging into other types of back-end triples stores.

Problem: Virtuoso is a great product, but was notorious for requiring you to either write straight SQL or SPARQL, or use stored procedures or use the proprietary VSP language to interact with it. I.e., there was no API or API implementation availible to help bootstrap my Virtuoso integration.

Solution: We simply wrote a Sesame Repository implementation which: 1) connects to an instance of Virtuoso via JDBC, and 2) connects to any SPARQLEndpoint via the SPARQLEndpoint protocol. Now I get all the benifits of Virtuoso while leveraging all the framework code I have for Sesame.

Result: Once upon a time, Cypher could only store data to and query Sesame HTTPRepositories out of the box, but the problem is that Cypher greedy about data, the more repositories it can connect to the better it's performance. The number of publicily availible Sesame HTTPRepositories serving data live on the web can be counted on one hand, so the demos were... well, boring. But now, Cypher can now attach to any one of the growing number of SPARQLEndpoints or Virtuoso databases coming online, right out of the box by configuring a connection in the startup properties file. This was a huge development, because overnight, I went from asking things like who knows Sherman Monroe (from my FOAF flat file) and Sheman's address (from my FOAF flat file) to fun stuff like the presidents who were influenced by people who played in films (from dbpedia).

Labels: , ,

Semantic Web + Cypher + Ubiquity = My Grandma's Personal Database

The idea for Cypher came to me a few years ago. The vision was vivid and complete, replace my mouse and keyboard for a headset (mic and speakers), allow me to talk to my computer to issue commands and summon "things". I remember drawing an interface for a web browser that had no buttons, and no menu bars, just the content of the web page. Saying a link would click it. I called this embodiment "Lewy", why for I know not, but which later became 'LUI' for Language Understanding Interface, or Linguisitic User Interface. I was young and naive, and abruptly took a sabbatical from college without any idea of what was required to make this real, but one thing I was certain of... if I could imagine it then it's complete possible. The resulting technology and it's industry has since grown by leaps and bounds, and when I was turned on to Ubiquity, I saw the final piece of this vision begining to be set into place. So let me talk about the first two pieces a little.

A Human Language Processor
The first requirment for LUI is a Human Language Processor. In my initial research, a great book called Symbolic Species made it clear to me that there are no short-cuts in NLP, if it's a "simple NL processor" then it's not really an NL processosr, because by definition, Natural Langauge is highly complex. This basically meant that I would need to figure out what processes are taking place in the brain while you're reading the New York Times. The task of NLP is a task in cryptology, thus the name Cypher. After 8 years, we finally have a framework for processing sentences like humans do. This is a 'cry wolf' type of statement, because of the many past promises and ensuing failed attempts of people/companies/instiutions of learning in this space. That's why I don't blog so much, instead, I'd rather spend that time setting up demos and releasing code, then let the work speak for itself, in every sense of that term :) (ok, ok, I'll stop :) So that part done.

A Universal Database
The Jetsons was a huge influence on me as a child. One of George's friends was an AI called RUDI (Referiential Universal Data Index). RUDI seemed to know everything, the entire body of all human knowledge. The WWW is the closest embodiment of RUDI we have today, with Google being the main interface. Cypher (the first piece) is dependent on a subset of human knowledge, called Lingusitic knowledge (i.e. a RUDI for language). The types of questions Cypher would pose to this database are: what is the structure of a noun phrase, does the verb 'marry' take a direct object? A preposition? How does one make the word 'ox' plural... all those language conventions you learned in elementary school. The WWW contains this data, but there is a problem. All the data is in human readable form, and would require an AI to extract it, which puts Cypher in a catch-22 problem. The solution, put all the required linguistic knowledge in a structured database. Done. The next problem is critical mass, most people don't realize this, but the amount of data a 3-year-old child has about language is astronimical!! It's nothing short of a mirical that childern are able to acquire language. The number of rules for combinations, phrase grammar, lexical restrictions, etc are innumerable, a certain critical mass is required for Cypher to work in unrestricted text. Since we don't already have an AI to populate this linguistic database automatically, we will need to someone do it manually. The Wikipedia has shown that a 'crowd-sourcing' approach is viable for this task. The Semantic Web allows a way to facilitate crowd-sourcing on a very large scale. The Semantic Web has been built, and the MetaLanguage Ontology (MLO) is now in it's first official release. So that part done.

Vision Without Action is Dead
So now my computer can accept my plain langauge phrase, ask the Semantic Web for a strategy for processing it, submit a 'coded' version of the post-processed input to the Semantic Web, then get a response (either a set of statements in Semantic Web langauage, or a set of results using Semantic Web URLs). That alone is really fun, and even in playing with the demo, I was able to find some very interesting facts. But remember, the vision has two parts, summon things (i.e. the nouns... done), and the second part was executing my commands automatically, i.e. the verbs, which takes us back to the beginning of this article... enter Ubiquity.

Let's start with a practical example, at the beginnig of this section, I wanted to reference Terrance Decon's book. To do so, I had to:
  1. open a new tab, google "Symbolic Species"
  2. click the link (because Google's result set links are googlfied)
  3. copy the URL from address bar
  4. take a breath
  5. nav back to this tab
  6. select the text for the link
  7. paste the URL
  8. repeat for all other links in article
... and this is 2008! The vision is to be able to summon a resource by saying Symbolic Species or Terrence Deacon's book and have my computer return the Amazon link (or whatever the net-citizens agree is the URL that represents that resource). This idea of 'summon a resource by description' is the piece of the vision that Ubiquity addressed. I wrote a prototype (just follow Ubquity instructions for installing it) which takes a natural langauge phrase (e.g. Terrance Deacon's book), and returns a table containing the list of 'answers'. It's only a "sound check" prototype, it only queries dbpedia for now. The vision for the plugin is to be able to select some text in a page, call a ubiq command like "get this", or alternatively type get Terrance Deacon's book, then have Ubiquity insert the link into the page or editor. The plugin will allow you to do this for anything that can be described, so summoning from your personal dataspace + the global dataspace, things like: my sister's boyfriend's alma mater or my grandmother's birthday or my car's gas milage and in response it inserts the link or text representing that thing.

So that part... in progress....

Labels: , , ,

Cypher 1.9 now ready

The long awaited Cypher 1.9 release is now ready for deployment, and is running live in the wild. The release will be deployed over the next couple of days, we thanks everyone for their patience with this release, and the many push-backs due to a huge overhaul of the framework.

I will be blogging about some of the new features that you can expect from this release, how this release is related to other interesting Semantic Web projects, and what is on the horizon for Cypher. Brb!!

Labels: , , , ,