Saturday, August 18, 2007

NLP and Global Warming

Those of us who were at EMNLP-CONLL 2007 remember the "NLP and Global Warming" exchange between James Clarke, Jason Eisner, and Dan Bikel at the Q/A session of the Clarke and Lapata paper. The transcript of this funny conversation is now online, thanks to Jason.

I really liked Hal's ending remark.

Wednesday, August 15, 2007

People Search on the Web

Wired has an article about, a people search engine that combines crawled and user added content. From the few searches I did, looks like this is good for celebrity names than a regular person with web content. For instance, searching a name like "David Smith" produces these results. Of the top 10 results, only 3 of them actually have the name "David Smith" or something closer and the first result is not one of them. Compare this with a general purpose search engine like Google. Among a dozen random NLP/ML academic names (professors) I tried, it only got Jason Eisner and Tom Mitchell correct. One reason for this poor recall is probably they don't get content from user home pages.
(Some sites where this data is derived from include MySpace, Friendster, IMDB, Wikipedia,, etc.)

Nevertheless, this website is a representative of interesting KDD-style problems that one could do with people names. It is also interesting as people names that we look for fall in the "long tail" without sufficient data to support calling for clever machine learning techniques.

Sunday, August 12, 2007

Digital Reasoning awarded contextual similarity patent?

I was lead to this article on Forbes via Damien's post. The article is about a company Digital Reasoning getting patent on what sounded to me as contextual similarity. Their "white paper" makes reference to a patent number 7249117 (via USPTO). Unlike research papers, reading the patent document was so difficult. Will get to it sometime later but here is an extract from their press release about what their technology can do.

* Learn the meanings of words, classes of words, and other symbols based on how they are used in context in natural language
* Create and manipulate models of this "meaning" - i.e. the mathematical patterns of usage - including the detection of groups or similar categories of words or development of hierarchies or creation of relationships between words
* Improve the models based on human feedback or using other structured information after model construction
* The representation or sharing of this model or learning in an ontology, graph structure, or programming languages

Anyone from the ACL/ML/AI community can immediately recognize this and start citing their favorite papers on these topics starting from at least a decade ago. A promotional video from the company on YouTube can be found here. Excerpt from the video: "... We treat the text representation of human language as a signal ... ".

I think everyone should stop taking patents seriously. Wishful thinking?

Thursday, August 2, 2007

Recommending scientific papers

I noticed a new feature in Citeseer which tries suggest an "alternate document" for a paper.
Clearly it does not do what it implies to do and it doesn't show up for all papers. (Experimental?) So, an interesting question is how does one recommend scientific papers? Something more than mere document similarity is required. If I am reading a CRF paper then there is no point in listing all papers containing similar words. Just listing nodes connected to inward and outward links of the paper in the citation graph wont suffice either. Ideal recommendations for a paper would depend on the role the user is playing. When I am reading a paper about some new topic, I would like to get pointed to original papers on the topic, some recent papers on the topic, and may be some survey papers or books. On the other hand when I am writing a paper, I would like to be pointed to all papers related to the topic (recall important than precision here to avoid reviewer comments on "missing reference") in some magical order that puts papers more relevant to your work above. Also these papers might not be related in directly through citations. If there is a recent related work in the Annals of Statistics, for instance, then it should show up when I am working on, say, approximate inference methods for graphical models. (Possible to deduce this from my previous queries?)

In spite of more information being present in a scientific paper than its text, recommending or ranking papers appears to be quite challenging.