Monrai Blog

News about Cypher, Semantic Web, Natural Language Processing, and Computational Linguistics

Sunday, August 27, 2006

Sunday Surfing: Java for .NET

A friend sent me a link to the IKVM framework a few weeks ago, and as the week winded down, I was finally able to look more into it. For you Microsofties who build on .NET, and for the Java developers looking to interoperate with the Microsoft development world, IKVM looks to be a great solution. It provides a VM implementented in .NET, and Jave core class libraries implemented in .NET. The payoff is that .NET applications can leverage Java libraries, and visa-versa. There are of course other ways of interoperating, but this approach really allows for tight integration, which is sometimes nessassary in a integration project. There's no support for AWT/Swing, but I'm guessing 99.9% of the developers looking at this don't care. There is a potential project comming up in which I may get to use this stuff in at least a prototype environment, so I plan to post the results and experience.

Radar Network Shows Some Skin

Nova Spivack's new venture, Radar Networks, is finally preparing to reveal the new and highly secretive (Web 2.0/Semantic Web/Meshup/PIM ???) project they've been working on for the last few years. I am really excited to hear they've gotten so far along in development, and am ancipating hearing just what this new technology platform their building is. More importantly, what will be its impact on the Semantic Web (and ergo Cypher):
...something happened that changed my mind about this recently. I had lunch with my friend Munjal Shah, the CEO of Riya, who has an investor, Peter Rip, in common with me. Listening to Munjal tell his stories about how he has blogged so openly about Riya's growth, even from way before their launch, and how that has provided him and his team with amazingly valuable community feedback, support, critiques, and new ideas, really got me thinking. Maybe it's time Radar Networks started telling a little more of its story? It seems like the team at Riya really benefitted from being so open. So although, we're still in stealth-mode and there are limits to what we can say at this point, I do think there are some aspects we can start to talk about, even before we've launched. And besides that our story itself is interesting -- it's the story of what it's like to build and work in a deep-technology play in today's venture economy.
Good to hear another Semantic Web company has found backing in the venture capital community. I'll be staying tuned.

Tuesday, August 22, 2006

RDF Radar and PingtheSemanticWeb

The creator of PingtheSemanticWeb.com has a post about a new Firefox plugin for detecting RDF on the web:

One of the new comer is the Semantic Radar wrote by Uldis Bojars. This plug-in for FireFox will notify you if it finds a FOAF, SIOC or DOAP RDF document on the web pages your surf.

The characteristic of semantic web documents is that they are not intended for humans, but for software agents (like search engines crawlers, personal agent software like Web Feed Readers, etc). The consequence is that humans do not see these documents, so no body really knows that the Semantic Web is growing and growing on the current Web.

This is the purpose of this new Semantic Radar: unveiling the Semantic Web to humans.

The Semantic Radar: much more than that

This plug-in is much more than that. Effectively, each time it detects one of these semantic web documents, it will notify PingtheSemanticWeb.com web service.

This is where the interaction between semantic web services and applications are starting to emerge. Now Web browsers will detect semantic web documents and notify a web service acting as a central repository for semantic web documents
I had the thought to extend Cypher to query the PingtheSemanticWeb.com service to detect Cypher datasets, and to notify when it has loaded new datasets created by the user. My question is, is there a way for my software to detect only the RDF documents it is concerned with ( i.e. Cypher dataset documents)? If so, I think developing a simple ontology that can be used to wrap Cypher dataset documents into, basically to point to their location on the web and other metadata, then having Cypher to download the datasets would be an excellent project.

Tuesday, August 15, 2006

50K Euro Compression Prize

Marcus Hutter has announced that a 50K purse will go to the developer of an algorithm which can compress the first 100MB of Wikipedia better than its predecessors:
Being able to compress well is closely related to intelligence as explained below. While intelligence is a slippery concept, file sizes are hard numbers. Wikipedia is an extensive snapshot of Human Knowledge. If you can compress the first 100MB of Wikipedia better than your predecessors, you(r compressor) likely has to be smart(er). The intention of this prize is to encourage development of intelligent compressors/programs.
If anyone wins the prize using Cypher's impeccable pattern-matching capabilities, we'll humbly accept your gratitude :)

More from Ebiquity Blog.

Release 0.7.2

A new Cypher release is available. This is a bug fix release:

Version 0.7.2

Fixes: from 0.7.1

-- Hardcoded reference to smonroe login for Sesame server now removed

With previous versions, users had to create a Sesame account to match the account in the Cypher config file. This fix allows users to change the config file to match their own Sesame login info.

Update:

There was also a hardcoded reference to the two default Sesame repositories which was also found and fixed.

Monday, August 14, 2006

Centralized Approach?

I ran across a centralized RDF search engine. Swoogle. From the site:

Swoogle has a collection of over 1M error-free RDF documents collected from the Web and an additional ~700K documents that have embedded RDF, are malformed but appear to be RDF, or are no longer accessible. We’ve intentionally limited the number of simple RSS and FOAF documents in the current collection.

A centralized database has obvious benefits, in an ideal world, a Google would crawl RDF documents and serve up queries through one central interface. But RDF isn't HTML, nor does SPARQL lend itself to any sort of straight-forward keyword mappings. Building a centralized database to process billions of open-ended queries per day is a mammoth undertaking. It appears that Google, who perhaps is the only company on the planet with enough imagination, incentive, and expertise to effectively build such a centralized database, is also the company who is most skeptical about the viability of the Semantic Web. The Semantic Web may also pose inherit threats to Google, who has built its empire on algorithms which attempt to address the deficiencies of the unstructured World Web Web.

I therefore believe that the path of least resistance for bootstrapping the Semantic Web will be a P2P network, or at the very least, a hybrid between the two. Swoogle seems like a great first attempt, and I'll be watching out for progress made by this and other centralized attempts, but I'd sooner bank on distributed P2P approaches.

Gartner's Hype Cycle


The industry research firm Gartner has announced its Emerging Technologies Hype Cycle for 2006, which analyses the maturity, impact and adoption speed of 36 technologies and trends over the next ten years. Among this year’s themes of technologies eliciting significant momentum is the Semantic Web. The list includes new or heavily hyped technologies, where organisations may be uncertain as to which will have most impact on their business.

Thursday, August 10, 2006

Release 0.7.0

A new release of Cypher is available. This is a feature enhancement release. Now Cypher can generate the integer representation of any arbitrary natural language number:

Version 0.7.0

Enhancements: from 0.6.9

-- added new NumberTranscoder_LITERAL; allows natural language numbers to generate integer representation, the integer is wrapped in RDF literals of type xsd:nonNegativeInteger and xsd:NegativeInteger, making it consumable for semantic web applications.

-- added new number pattern grammar example to exploit number transcoder

There are also a couple of new grammar definition files which cover natural language numbers in English e.g. Five hundred twenty eight million five. But extending them to cover numbers in other languages shouldn't be a problem. The extended example dataset covers numbers up to tresrigintillion (10^102 I think, but correct me if I'm wrong). Sense so many people have been waiting for an online demo, I plan to set up the number transcoder as an intermediate online demo, especially since the input set in this case is finite.

I will post a more detailed explanation of the new dataset most likely in an article to be posted on the main Monrai website. In the meantime, try starting Cypher and entering: Your Name is some long number, for example Chris is twenty two thousand forty nine. Then look at the output file. There should be a owl:sameAs triple near the top, and one object should be the number you said. The BE verb is set to output an owl:sameAs triple, but you can easily change it to set the subject's age ( e.g. myonto:age). Also, conjunctions are not covered by the number patterns I wrote, so nine hundred and two won't match, but nine hundred two will match. I leave as an exercises for the user, the task of extending the example number pattern grammar to cover conjunctions.

Natural language numbers are normally spoken as opposed to written/typed, so speech recognition systems are probably a more appropriate usecase for this dataset.

Have fun!

ConceptNet

Remember the Open Mind Project? Well, I recently heard about a group at MIT has taken that commonsense database and created a .NET explorer as well as a Natural Language Processing framework. Here's more from the site:
The ConceptNet knowledgebase is a semantic network presently available in two versions: concise (200,000 assertions) and full (1.6 million assertions). Commonsense knowledge in ConceptNet encompasses the spatial, physical, social, temporal, and psychological aspects of everyday life. Whereas similar large-scale semantic knowledgebases like Cyc and WordNet are carefully handcrafted, ConceptNet is generated automatically from the 700,000 sentences of the Open Mind Common Sense Project – a World Wide Web based collaboration with over 14,000 authors.
There's alot of talk in the docs about it using Microsoft IronPython, which I suppose is a derivation of Python. In my opinion, such common sense databases are akin to an RDF instance database. So while these types of databases don't explicitly offer the type of information Cypher needs to perform language processing, Cypher could be used to populate and query these databases using plain language. In addition, some data, such as type hierarchies, can be extracted from these sources to help in build lexicons. You can expect more Cypher support of such common sense resources as they continue to gain momentum.

OpenCyc 1.0

CycCorp has released OpenCyc 1.0. The Cyc system is a database of common sense assertions (e.g. rain is wet, grass is found outdoors). A couple of years back, I wrote a Cyc microtheory transcoder as a sort of toy application for Cypher. The system translated natural language descriptions, phrases and questions into microtheories in CycL and queries. But I couldn't get enough people interested to justify the work. Looks like I might be blowing the dust off that old code.

Here's more on the announcement from the OpenCyc website:

Release 1.0 of OpenCyc includes:

  • The entire Cyc ontology containing hundreds of thousands of terms, along with millions of assertions relating the terms to each other, forming an upper ontology whose domain is all of human consensus reality.

  • English strings corresponding to all concept terms, to assist with search and display.

  • A compiled version of the Cyc Inference Engine and the Cyc Knowledge Base Browser.

  • Documentation and self-paced learning materials to help users achieve a basic- to intermediate-level understanding of the issues of knowledge representation and application development using Cyc.

  • A specification of CycL, the language in which Cyc (and hence OpenCyc) is written.

  • A specification of the Cyc API for application development.


I'm trying to find a RDF view of the Cyc database that actually exposes the knowledge using RDF semantics, if anyone knows of one please let me know.

Tuesday, August 08, 2006

Release 0.6.9

There's a new Cypher release availible:

Fixes: from 0.6.8

-- dynamic addition of FOAF entry for proper nouns not already entered in the database

Monday, August 07, 2006

Slashfacet Semantic Interface

Today, I ran across Slashfacet, a generic browser for heterogeneous semantic web repositories. The browser works on any RDFS dataset without any additional configuration. The interface controls change depending on the data being viewed. Reminds me alot of Dbin. Here's the paper.

Sunday, August 06, 2006

Modified 'Star' Lexeme

I was testing the lastest example dataset release, and discovered the following input didn't produce output:

Tom Hanks stars in The Terminal

After investigation, I noticed there was no word sense for 'star' which accounted for the in preposition-object construction. So I added it, and now the following works fine:

Tom Hanks stars in the Terminal --> RDF
the movies that star Tom Hanks --> SeRQL

As a side note, the datasets for the two 'movie' examples covered in the Cypher User Manual page are still on the way. I discovered a bug in how nominal clausal modifiers which are missing both the verb and subject are processed. This affects the pattern Actresses who played in movies with Tom Hanks. As a quick hack however, I just treated the noun phrase as having one clausal modifer with two prepositional phrases, and it parses fine. In actually though, the last prep-phrase is attached to the noun head movies: movies with Tom Hanks. And this is actually an abbreviation for: movies that are casted with Tom Hanks. The big difference is that the framework works best when frame slots are filled by clauses (i.e. verb lexemes), not nouns. By expanding the noun phrase prepositional phrase into a clause, we can now goveren the semantics of the noun phrase prepostional phrase by just referencing a verb. So, instead of adding a new feature the the movie lexeme to cover each possible prepositional phrase complement, we just find a verb which governs the semantics, in effect, reusing other lexemes. A better explination of this is on the way. Please be patient as I update the lexicon definition language to address this phenomenon.

Saturday, August 05, 2006

AOL's Data Collection

eBiquity has a post on AOL's release of a large-scale data collection of natural language questions and answers, and includes 20K manually annotated queries:

AOL Research has released some interesting data collections, including:

  • 20K hand labeled, classified queries
  • 3.5M web question answering queries (who, what, where, when …)
  • Query streams for 500K users over three months (20M queries)
  • Query arrival rates for queuing analysis
  • 2M queries against US Government domains

Additional datasets are promised in the future.

Would be nice if someone took time to evaluate the use of this data for Cypher.

Google's N-gram Set

eBiquity's semantic web blog has reported Google's announcement that it will share a gigantic n-gram dataset generated from a corpus of one trillion words from Web pages. Google found 1.1B five-word sequences that appear at least 40 times and 13.8M words that appear at least 200 times. The dataset will be distributed by the Linguistic Data Consortium.

Google’s describes its motivation as follows:

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together.

I wonder if the data set will be annotated with linguistic information such as part of speech and word frequency. Sounds promising for semantic web data mining and hopefully will provide another corpus of data for Cypher and NLP.

Link Grammar Parser

I got contacted by one of the guys working on the Link Grammar Parser at Carnegie Mellon. The approach is one of the few (open systems) I've come across which attempts to derive a true semantic representation using a linguistic knowledge-based approach. I plan to download and compare with Cypher, and post the results.

Language of Thought

I thought this was a pretty cool peice of software. I haven't had a chance to play with it much, but I thought it might interest you:

i just visited ur site and think that the idea of understanding natural language and then translating it into a format suitable for the semantic web is a cool idea.im working on a project called Nelements that displays knowledge in the language of thought. in the future i was also planning to work on a Nelements translator that can translate natural language text into the language of thought.

Release 0.6.8

Cypher release 0.6.8 is a minor bug fix release. Among the issues fixed are:

-- fixed bug which prevented some Containers element constituients from matching
-- fixed bug which caused the console to stop listening for input
-- changed NounTranscoder's handling of possessive noun constructs, now possessive noun phrases are SeRQL queries which merge with the enclosing phrase, to accomidate such phrases as "my homes in Hoston"

The release also includes some minor changes to the Hello World example data set.

Thursday, August 03, 2006

Wiki for Cypher

This week, I've been testing a new service Monrai will be lauching in a few weeks: A database of lexicons, grammars, framenets, and RDF ontologies for Cypher. In the spirit of Wikipedia, the site will allow anyone to add and edit content to a shared dataset. The dataset will be availible as a large zip file.

We may ask some of you to serve as beta testers before making the official announcement. More to come...

Tuesday, August 01, 2006

Transcography - Part 1

Cypher is based on a sub-discipline of natural language processing called Transcography, which was developed by Monrai with the goal of merging the field of natural language processing with the increasingly popular Semantic Web movement. Transcography is the process of parsing the phrase structure of a natural language construct, and translating the grammar output into a semantic representation. The output of each NL construct is three things: 1) a URI representation of the NL construct, 2) a set of one or more subject-object-value triples involving the URI, and 3) the set of all triples produced by sub-phrases. So, Cypher views any and all lingusitic input as a URI + related triples. Knowing this is key to understanding why the Cypher lexicon is such a powerful NL resource.

As an example of transcographic output, consider the phrase: John's coach. The transcographic process produces a URI representing the phrase, for example: http://john.mysite.com/MrDouglass, and a set of triples representing the statements involved in the phrase:

{http://john.mysite.com/me} jo:isCoachedBy {http://john.mysite.com/MrDouglass}

Cypher leverages these triples to create either an RDF model or an SeRQL query. The mode of output is based on whether the NL construct is a clause or description, or if it's a noun phrase or question. The triples of sub-phrases are recursively merged to produce a root graph represeting the root NL phrase or clause. For example, consider: John's coach knows Martin. The URI produced will represent this clause (e.g. the URI of a reified RDF triple, or the URI of a semantic frame), and a graph containing:

{qv:node1} foaf:knows {http://john.mysite.com/MartinCrump}

The URI qv:node1 represents a SeRQL query variable of a SeRQL query which was serialized in RDF. This is because the phrase John's coach is a relational noun phrase, and thus, is anaphora reference. By re-constructing the SeRQL query for the variable (by following the links from qv:node1), and then executing the query, a program can retreive the resouce that represents John's coach at the time of the query. This technique is used because John may have a new coach at the time of the query. Transcography stipulates that any anaphora reference be represented by a query variable (linked to the RDF representation of the SeRQL query) unless the program is ready to apply the variable value (e.g. to presenting it to a human user in an interface).

The word transcography is the combination of transcode, which means "to convert media from one format to another", and -graphy which is "writing or text representation produced in a specified manner or by a specified process". Thus the literal meaning is "text transcoding". Knowledge representation frameworks used in the process include RDF and Frame Semantics.