Monrai Blog

News about Cypher, Semantic Web, Natural Language Processing, and Computational Linguistics

Saturday, September 02, 2006

Centralized Approach Revisited

Here is a comment I posted to Nova Spivack's blog concerning Radar Network's recent announcement concerning the scalibilty of their semantic database indexing technology:
For those of you who don't know, part of our system is a homegrown distributed grid server architecture for massive-scale semantic search. It's not the end-product, but it's something we need for our product. It's kind of our equivalent of Google's backend -- only semantically aware. Like Google, our distributed server architecture is designed to scale efficiently to large numbers of nodes and huge query loads. What's hard, and what's new about what we have done, is that we've accomplished this for much more complex data than the simple flat files that Google indexes.
I've reposted my comments here for archive purposes:

A few weeks ago, I blogged about how little confidence I had in centralized approaches to semantic web database building. Giovanni Tummerello (dbin.org) wrote a great paper on the subject, and let me tell you, it's one challenging undertaking. The main challenge facing any centralized approach is what's known as the computational burden problem:
"On the WWW, the interaction is based on HTTP requests/replies that in the great majority of the cases will be of limited impact on the server (e.g serving a file). This means that, disregarding anomalous cases, both the computational resources and network traffic required by a HTTP request are bounded. On the contrary, “requests” on the semantic web are naturally expressed in query languages and, given the graph nature of RDF structured information, the complexity of execution is not bounded a priori as it is a function of the query type as well as the quantity and the structure of the data. In other words, whoever would decide to offer the ability to answer “arbitrary questions” on a SW, would easily open himself to “denial of service” situations even in the ideal, good faith usage."
Creating a centralized database that solves the computational burden problem is one of the holy grails of the semantic web. My hat goes off to you and your team for tackling and solving this problem. I always predicted that P2P networks were the only feasible solution. Giovanni's approach is to periodically synchronize each peer's database, but only from within small peer groups, and once the data has been downloaded the query is sent to the local database, thus limiting the "damange" to the user's local resources. The obvious drawback is that no one peer has 100% visibility across the entire distributed database. So if the answer to a particular SPARQL query happens to exist in triples across seperate peers, and I haven't sych'd with each of those peers or I'm not in those peers' groups, then I'm just up the creek. The ideal repository would be centralized, and accept SPARQL with the speed and scaliblity of Google, which (correct me if I'm wrong) sounds to me you guys have achieved. Again, I'm jaw dropped. For example, this will have serious ramification for my work with Cypher, as my major Achilles Tendon is the lack of a centralized repository of shared lexical descriptions (in RDF) collected from across the semantic web. If your service/framework could crawl, collect and most importantly "cook" RDF lexical descriptions (as the last item is what's lacking in current services like Swoogle), and if it can serve Cypher results to arbitrary SPARQL which queries the metadata of lexical entries, then you've just sped up natural language processing for the Semantic Web by about 5 years!