Misc Research Stuff

Dealing with large scale graphs

2009-07-03T13:01:00.000-07:00

To a hammer everything looks like a nail but one great hammer to have in your toolbox is the graph. The ACL anthology alone lists more than 300 results for the query "graph based". Graphs based formalisms allow us to write down solutions in succinct linear algebra representation. However implementation of such solutions for large problems, or even for small datasets with blown-up graph representations can be challenging in limited resource environments. While some go for interesting approximate solutions, an alternative solution is to pool in several limited resource nodes into a map-reduce cluster and design a parallel algorithm to conquer scale with concurrency. This is easier said than done since designing some parallel algorithms requires a different perspective of the problem. This is well worth the effort as the new insights gained will reveal connections between things you already knew. For instance, in our TextGraphs 2009 paper we started out scaling up Label Propagation but eventually the connection to PageRank became obvious. To me this was a bigger learning moment than getting Label Propagation work for large graphs. [Preprint Copy]

For the actual implementation, we used Hadoop (surprise!) although bulk synchronous parallel models make more sense given the locality of the operations in most graph algorithms.

Sentiment Analysis is AI-Hard

2009-03-31T15:04:00.000-07:00

In a breezy article on sentiment analysis, Alex Wright quotes Bo Pang saying:

We are dealing with sentiment that can be expressed in subtle ways.

This is so true with the examples I've encountered while working and my favorite is this one I saw on iTunes recently.

While I commend Alex for writing an informative yet accessible article on the topic, I disagree with the article's opinion that sentiment analysis is a series of "filters". That is clearly an euphemism. Any working sentiment analysis system is actually an engineering feat often consisting of a series of hacks duct-taped by a glue handling special cases.

The article also seems to suggest that extracting factual information is somehow easier than opinions. I invite them to participate here.

On the way to Brewer's Art

2009-02-28T09:02:00.000-08:00

Never mind how we got to this topic:

me: Parsing is for fogies.

Markus: What?

Jason: I think he means crusty old linguists.

Markus: You should probably use a shallow parser.

me: I'm shallower than that; I use n-grams.

EACL Reading

2008-12-19T11:45:00.000-08:00

EACL 2009 accepted paper list is up. Here's my reading list:

WEAKLY SUPERVISED PART-OF-SPEECH TAGGING FOR RESOURCE-SCARCE LANGUAGES
Kazi Saidul Hasan and Vincent Ng

USING CYCLES AND QUASI-CYCLES TO DISAMBIGUATE DICTIONARY GLOSSES
Roberto Navigli

SYNTACTIC AND SEMANTIC KERNELS FOR SHORT TEXT PAIR CATEGORIZATION
Alessandro Moschitti

SENTIMENT SUMMARIZATION: EVALUATING AND LEARNING USER PREFERENCES
Kevin Lerman, Sasha Blair-Goldensohn and Ryan McDonald

PERSON IDENTIFICATION FROM TEXT AND SPEECH GENRE SAMPLES
Jade Goldstein-Stewart, Ransom Winder and Roberta Sabin

OUTCLASSING WIKIPEDIA IN OPEN-DOMAIN INFORMATION EXTRACTION: WEAKLY-SUPERVISED ACQUISITION OF ATTRIBUTES OVER CONCEPTUAL HIERARCHIES
Marius Pasca

GROWING FINELY-DISCRIMINATING TAXONOMIES FROM SEEDS OF VARYING QUALITY AND SIZE
Tony Veale, Guofu Li and Yanfen Hao

GENERATING A NON-ENGLISH SUBJECTIVITY LEXICON: RELATIONS THAT MATTER
Valentin Jijkoun and Katja Hofmann

CONTEXTUAL PHRASE-LEVEL POLARITY ANALYSIS USING LEXICAL AFFECT SCORING AND SYNTACTIC N-GRAMS
Apoorv Agarwal, Fadi Biadsy and Kathleen Mckeown

COMPANY-ORIENTED EXTRACTIVE SUMMARIZATION OF FINANCIAL NEWS
Katja Filippova, Mihai Surdeanu, Massimiliano Ciaramita and Hugo Zaragoza

ANALYSING WIKIPEDIA AND GOLD-STANDARD CORPORA FOR NER TRAINING
Joel Nothman, Tara Murphy and James R. Curran

And we're back ...

2008-12-02T16:08:00.000-08:00

Sometime back I wrote about Wordle to visualize textual information using frequency counts. Change.gov, Obama's transition team website uses it on the comments in response to their health care system. This is very interesting but I think Wordle should display top 100 collocations instead of top 100 words. But oh, we also learnt at last ACL how to learn collocation information from unigram frequencies.

Too many cooks?

2008-07-17T05:10:00.000-07:00

Computational Linguistics is becoming like the Science or Nature. For instance, see this paper in the current issue: (In this case, the broth wasn't spoiled ;-)

Guess which paper has the largest number of authors on the ACL anthology?

To theory or not to theory

2008-07-08T10:55:00.000-07:00

I stumbled upon this paper "Reflections after Refereeing Papers for NIPS" by Leo Breiman that gives some really candid insights into theory papers. (Unfortunately, I could not find a soft copy to share, except this link.) Some noteworthy observations:

"No theorems" implies "No theory"

"... more than 99% of the published papers are useless exercises."

"Mathematical theory is not critical to development of machine learning."

"Our fields would be better off with far fewer theorems, less emphasis on faddish stuff, and much more into scientific inquiry and engineering."

I really liked this article, especially coming from someone who has been working in theory all his life but I would still prefer reading papers giving theoretical insight, however useless, than pages and pages of feature engineering & experimentation using classifier X on problem Y -- the current trend at ACL.

A quick scan at ACL

2008-07-07T06:49:00.000-07:00

Mendicant Bug informs about a new tag-cloud service called Wordle. Here is a look at this year's ACL. Gives a clear idea of what is going on! A larger image is available here.

Powerset Natural Language Search

2008-05-11T22:43:00.000-07:00

Powerset, a company we only remember seeing as conference sponsors, now actually has something working. After receiving an email from them, I tried out several queries. At best, it seems to answer most Wh-questions and certain whole-part relations.

Try out the same query on Google.

Writing style

2008-04-03T20:35:00.000-07:00

The sweetest thing ever written in a paper: "The reader who is unfamiliar with this field or who has allowed his or her facility with some of its concepts to fall into disrepair may profit from a brief perusal of Feller (1950) and Gallagher (1968)."

- Brown et. al., "Class based n-gram Models of Natural Language.", Computational Linguistics, 1990

Searching ACL anthology

2008-03-28T11:54:00.000-07:00

If you are looking up the ACL anthology regularly, my friend Markus has a nice firefox search plugin to do that. You can get that and others from this page.

ACL accepted papers

2008-03-27T21:26:00.000-07:00

Hal posted a while back about the ACL accepted papers that I just read now -- I've been living under a rock for some time. You can get a printer friendly version here. I know, my paper did not make it to that list :(

New additions to my reading list:

Distributional Identification of Non-Referential Pronouns
Shane Bergsma, Dekang Lin and Randy Goebel

An Unsupervised Approach to Biography Production using Wikipedia
Fadi Biadsy, Julia Hirschberg and Elena Filatova

Resolving Personal Names in Email Using Context Expansion
Tamer Elsayed, Douglas Oard and Galileo Namata

Mining Wiki Resources for Multilingual Named Entity Recognition
Alexander Richman and Patrick Schone

Inducing Gazetteers for Named Entity Recognition by Large-scale Clustering of Dependency Relations
Jun'ichi Kazama and Kentaro Torisawa

Name Translation in Statistical Machine Translation - Learning When to Transliterate
Ulf Hermjakob, Kevin Knight and Hal Daume

The Tradeoffs Between Open and Traditional Relation Extraction
Michele Banko and Oren Etzioni

(Longest paper title)
Unsupervised Discovery of Generic Relationships Using Pattern Clusters and its Evaluation by Automatically Generated SAT Analogy Questions
Dmitry Davidov and Ari Rappoport

Finding Contradictions in Text
Marie-Catherine de Marneffe, Anna Rafferty and Christopher Manning

Extracting Question-Context-Answer Triples from Online Forums
Shilin Ding, Gao Cong, Chin-Yew Lin and Xiaoyan Zhu

EM Can Find Pretty Good HMM POS-Taggers (When Given a Good Start)
Yoav Goldberg, Meni Adler and Michael Elhadad

Extraction of Entailed Semantic Relations Through Syntax-based Comma Resolution
Vivek Srikumar, Roi Reichart, Mark Sammons, Ari Rappoport and Dan Roth

Learning Bigrams from Unigrams
Xiaojin Zhu, Andrew Goldberg, Michael Rabbat and Robert Nowak

Evaluating Roget's Thesauri
Alistair Kennedy and Stan Szpakowicz

Randomized Language Models via Perfect Hash Functions
David Talbot and Thorsten Brants

Solving Relational Similarity Problems Using the Web as a Corpus
Preslav Nakov and Marti Hearst

What do you do?

2008-02-24T17:50:00.001-08:00

As a grad student working on NLP how do you explain what you are working on, to friends and family? I inevitably end up referring to the Google search engine even though what I do is quite far from IR. Actually, thats not true. These days IR seems to consume everything but thats another story.

This reminds me of a funny conversation at CLSP recently:

Sanjeev is telling us about an incident where a concerned parent of a young child with a speaking disability is asking him for his opinion. Apparently, she is confused about "Language and Speech Processing" in CLSP.

Keith butts in: "Run a few more iterations of EM and he'll be fine."

A song on parsing

2008-02-14T01:23:00.000-08:00

We all know Jason's love for parsing from his work but it takes a different level of dedication to write a Valentine's Day song about parsing.

As Jason says, "Parsers just want to be appreciated, like everyone else."

Funny bone

2007-10-17T17:07:00.000-07:00

The frequentist exclaimed, "All your Bayes are belong to us!" to which the Bayesian responded, "Well, it depends."

NIPS papers are out

2007-09-20T22:58:00.000-07:00

For a full list see here. Some papers I want to read based on my current interests:

Random Projections for Manifold Learning
Chinmay Hegde, Michael Wakin, Richard Baraniuk

The Distribution Family of Similarity Distances
Gertjan Burghouts, Arnold Smeulders, Jan-Mark Geusebroek

Manifold Sculpting
Michael Gashler, Dan Ventura, Tony Martinez

A learning framework for nearest neighbor search
Lawrence Cayton, Sanjoy Dasgupta

Learning Bounds for Domain Adaptation
John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman

Convex Relaxations of EM
Yuhong Guo, Dale Schuurmans

A Randomized Algorithm for Large Scale Support Vector Learning
Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan

Bundle Methods for Machine Learning
Alex Smola, S V N Vishwanathan, Quoc Le

Regularized Boost for Semi-Supervised Learning
Ke Chen, Shihai Wang

Learning the structure of manifolds using random projections
Yoav Freund, Sanjoy Dasgupta, Mayank Kabra, Nakul Verma

A complexity measure for intuitive theories
Charles Kemp, Noah Goodman, Joshua Tenenbaum

NLP and Global Warming

2007-08-18T00:40:00.000-07:00

Those of us who were at EMNLP-CONLL 2007 remember the "NLP and Global Warming" exchange between James Clarke, Jason Eisner, and Dan Bikel at the Q/A session of the Clarke and Lapata paper. The transcript of this funny conversation is now online, thanks to Jason.

I really liked Hal's ending remark.

People Search on the Web

2007-08-15T16:16:00.001-07:00

Wired has an article about spock.com, a people search engine that combines crawled and user added content. From the few searches I did, looks like this is good for celebrity names than a regular person with web content. For instance, searching a name like "David Smith" produces these results. Of the top 10 results, only 3 of them actually have the name "David Smith" or something closer and the first result is not one of them. Compare this with a general purpose search engine like Google. Among a dozen random NLP/ML academic names (professors) I tried, it only got Jason Eisner and Tom Mitchell correct. One reason for this poor recall is probably they don't get content from user home pages.
(Some sites where this data is derived from include MySpace, Friendster, IMDB, Wikipedia, ratemyprofessors.com, etc.)

Nevertheless, this website is a representative of interesting KDD-style problems that one could do with people names. It is also interesting as people names that we look for fall in the "long tail" without sufficient data to support calling for clever machine learning techniques.

Digital Reasoning awarded contextual similarity patent?

2007-08-12T12:30:00.000-07:00

I was lead to this article on Forbes via Damien's post. The article is about a company Digital Reasoning getting patent on what sounded to me as contextual similarity. Their "white paper" makes reference to a patent number 7249117 (via USPTO). Unlike research papers, reading the patent document was so difficult. Will get to it sometime later but here is an extract from their press release about what their technology can do.

* Learn the meanings of words, classes of words, and other symbols based on how they are used in context in natural language
* Create and manipulate models of this "meaning" - i.e. the mathematical patterns of usage - including the detection of groups or similar categories of words or development of hierarchies or creation of relationships between words
* Improve the models based on human feedback or using other structured information after model construction
* The representation or sharing of this model or learning in an ontology, graph structure, or programming languages

Anyone from the ACL/ML/AI community can immediately recognize this and start citing their favorite papers on these topics starting from at least a decade ago. A promotional video from the company on YouTube can be found here. Excerpt from the video: "... We treat the text representation of human language as a signal ... ".

I think everyone should stop taking patents seriously. Wishful thinking?

Recommending scientific papers

2007-08-02T19:40:00.000-07:00

I noticed a new feature in Citeseer which tries suggest an "alternate document" for a paper.

Clearly it does not do what it implies to do and it doesn't show up for all papers. (Experimental?) So, an interesting question is how does one recommend scientific papers? Something more than mere document similarity is required. If I am reading a CRF paper then there is no point in listing all papers containing similar words. Just listing nodes connected to inward and outward links of the paper in the citation graph wont suffice either. Ideal recommendations for a paper would depend on the role the user is playing. When I am reading a paper about some new topic, I would like to get pointed to original papers on the topic, some recent papers on the topic, and may be some survey papers or books. On the other hand when I am writing a paper, I would like to be pointed to all papers related to the topic (recall important than precision here to avoid reviewer comments on "missing reference") in some magical order that puts papers more relevant to your work above. Also these papers might not be related in directly through citations. If there is a recent related work in the Annals of Statistics, for instance, then it should show up when I am working on, say, approximate inference methods for graphical models. (Possible to deduce this from my previous queries?)

In spite of more information being present in a scientific paper than its text, recommending or ranking papers appears to be quite challenging.

Google allows data binging for researchers

2007-07-26T22:14:00.000-07:00

Google now opened access to university researchers to its search and MT systems in today's announcement on their research blog. The search API documentation does not mention any restriction on the number of queries that can be posted for search (The earlier limit was 1000). Whatever the number is I am guessing it will be large (Drinking from the firehose?). However, the MT API allows 1000 queries per day with the documentation hinting that this need not be a hard limit.

Looking at the search API output, two things I really miss is the number of hits and the snippet for each search result. The number of hits has been used in several papers for interesting results. The other useful feature is snippets. Every search result from Google is accompanied by a small snippet extracted from the page, as shown below for an example query "Dekang Lin".

The information in the snippets can be used as informative features in different tasks like this one in person name disambiguation. (BTW, Dekang is now at Google)

Despite these minor quibbles, these new APIs will be quite useful to all of us and will certainly result in more papers on Googleology.

Later addition: Turns out we can sort of get the counts by simply counting the number of search results by repeatedly executing the request (only ten results per request) but the API caps this limit to 100. That means you could get a maximum of 1000 results. Which is not quite the same as "Results 1 - 10 of about 779,000,000". Though that number is approximate, it is still indicative of how strong the query is w.r.t the web. For example GoogleCount("Horse+animal") >> GoogleCount("Horse+truck").

Readings from SIGIR 2007

2007-07-24T09:53:00.000-07:00

SIGIR 2007 is happening now at Amsterdam!

Latent Concept Expansion Using Markov Random Fields, Donald Metzler, Bruce Croft

Random Walks on the Click Graph, Nick Craswell, Martin Szummer

Towards Automatic Extraction of Event and Place Semantics from Flickr Tags, Tye Rattenbury, Nathaniel Good, Mor Naaman

Clustering of Documents with Local and Global Regularization, Fei Wang, Changshui Zhang, Tao Li

Detecting, Categorizing and Clustering Entity Mentions in Chinese Text, Wenjie Li, Donglei Qian, Chunfa Yuan, Qin Lu

Principles of Hash-based Text Retrieval, Benno Stein

DiffusionRank: A Possible Penicillin for Web Spamming, Haixuan Yang, Irwin King, Michael R. Lyu

Context Sensitive Stemming for Web Search, Fuchun Peng, Nawaaz Ahmed, Xin Li, Yumao Lu

Combining Content and Link for Classification using Matrix Factorization, Shenghuo Zhu, Kai Yu, Yun Chi, Yihong Gong

ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs, Yang Liu, Jimmy Huang, Aijun An, Xiaohui Yu

Heavy-Tailed Distributions and Multi-Keyword Queries, Arnd Konig, Surajit Chaudhuri, Liying Sui, Kenneth Church

Readings from AAAI 2007

2007-07-23T15:56:00.000-07:00

AAAI 2007 is now going on at Vancouver. Here is my selection of NLP and Learning papers I would like to know more about.

Deriving a Large-Scale Taxonomy from Wikipedia, Simone Paolo Ponzetto, Michael Strube

Relation Extraction from Wikipedia Using Subtree Mining, Dat P. T. Nguyen, Yutaka Matsuo, Mitsuru Ishizuka

Finding Related Pages Using Green Measures: An Illustration with Wikipedia, Yann Ollivier, Pierre Senellart

Graph Partitioning Based on Link Distributions, Bo Long, Mark (Zhongfei) Zhang, Philip S. Yu

Semi-supervised Learning by Mixed Label Propagation, Wei Tong, Rong Jin

Semi-Supervised Learning with Very Few Labeled Training Examples, Zhi-Hua Zhou, De-Chuan Zhan, Qiang Yang

Clustering with Local and Global Regularization, Fei Wang, Changshui Zhang, Tao Li

Isometric Projection, Deng Cai, Xiaofei He, Jiawei Han

Improving Similarity Measures for Short Segments of Text, Wen-tau Yih, Christopher Meek

Topic Segmentation Algorithms for Text Summarization and Passage Retrieval: An Exhaustive Evaluation, Gaël Dias, Elsa Alves. José Gabriel Pereira Lopes

Robust Estimation of Google Counts for Social Network Extraction, Yutaka Matsuo, Hironori Tomobe, Takuichi Nishimura

Harvesting Relations from the Web - Quantifiying the Impact of Filtering Functions, Sebastian Blohm, Philipp Cimiano, Egon Stemle

Template-Independent News Extraction Based on Visual Consistency, Shuyi Zheng, Ruihua Song, Ji-Rong Wen

Comprehending and Generating Apt Metaphors: A Web-driven, Case-based Approach to Figurative Language, Tony Veale, Yanfen Hao

Mobile Service for Reputation Extraction from Weblogs - Public Experiment and Evaluation, Takahiro Kawamura, Shinichi Nagano, Masumi Inaba, Yumiko Mizoguchi

The Impact of Time on the Accuracy of Sentiment Classifiers Created from a Web Log Corpus, Kathleen T. Durant, Michael D. Smith

Nectar: Learning by Combining Observations and User Edits, Vittorio Castelli, Lawrence Bergman, Daniel Oblinger

Multi-Label Learning by Instance Differentiation, Min-Ling Zhang, Zhi-Hua Zhou

Extracting Influential Nodes for Information Diffusion on a Social Network, Masahiro Kimura, Kazumi Saito, Ryohei Nakano

Temporal and Information Flow Based Event Detection from Social Text Streams, Qiankun Zhao, Prasenjit Mitra, Bi Chen

Analyzing Reading Behavior by Blog Mining, Tadanobu Furukawa, Mitsuru Ishizuka, Yutaka Matsuo, Ikki Ohmukai, Koki Uchiyama

Reading List from KDD 2007

2007-07-18T12:57:00.000-07:00

KDD 2007 will be on Aug 12-15 in the neighborhood at San Jose. Here is my selection:

"Extracting Semantic Relations from Query Logs", Ricardo Baeza-Yates and Alessandro Tiberi

"Efficient Incremental Clustering with Constraints", Ian Davidson, S.S. Ravi, and Martin Ester

"A Probabilistic Framework for Relational Clustering", Bo Long, Zhongfei Zhang, and Philip S. Yu

"Tracking Multiple Topics for Finding Interesting Articles", Raymond Pon, Alfonso Cardenas, David Buttler, and Terence Critchlow

"Feature Selection Methods for Text Classification", Anirban Dasgupta, Petros Drineas, Boulos Harb, Vanja Josifovski, and Michael Mahoney

"Hierarchical Mixture Models: a Probabilistic Analysis", Mark Sandler

"Information distance from a question to an answer", Xian Zhang, Yu Hao, Xiaoyan Zhu, and Ming Li

"Statistical Change Detection for Multi-Dimensional Data", Xiuyao Song, Mingxi Wu, Chris Jermaine, and Sanjay Ranka

"Constraint-Driven Clustering", Rong Ge, Martin Ester, Wen Jin, and Ian Davidson

"Enhancing Semi-Supervised Clustering: A Feature Projection Perspective", Wei Tang, Hui Xiong, Shi Zhong, and Jie Wu

NLP in India?

2007-07-17T18:40:00.000-07:00

I was surprised to see Alex's post to which I don't agree fully.

... because NLP is so underdeveloped in India, even undergraduate-level projects may be contributing to the cutting edge of research.

Turns out he was referring to this post from an undergrad which tries to give "the Indian perspective", rather inaccurately. Having worked on NLP at one of the IITs I am compelled to write from a grad student perspective. Sunayana's post is interesting as it brings out several issues in Indic computing.

1. Lack of annotation data - corpora, treebanks, and aligned texts which are sinews and bones of any language processing system. Resources exist, largely due to the efforts of CIIL, various universities and other government agencies but these are dwarfed compared to resources that exist for other languages, like English or the European languages.

However, the rich morphology in Indian languages can be exploited to mitigate the amount of annotation data required for certain tasks, for instance POS tagging.

2. Encoding issues - As rightly pointed by Sunayana, before the adoption of unicode, several data sources were locked up in the fonts they use. But things are changing, there is more and more Indian language content in unicode today than ever. Websites like BBC and Wikipedia are spewing out a lot of content in unicode for those interested in collecting monolingual, comparable corpora. A cursory glance at Wikipedia statistics shows the number of articles in, say Hindi or Tamil for example, has more than doubled in the past six months.

3. Visibility - While there has been an increasing trend to publish in reputed conferences like ICML or ACL, more participation is certainly desirable. IJCAI 2007 was held in India and I highly recommend, if you are around, to submit (sub. deadline: Jul 31st) and/or attend IJCNLP 2008.

This is an exciting time to do NLP research on Indian languages. There is both corporate as well as government motivations which translate to grants and support to universities. The group at IIT Bombay, for example, implemented and deployed, local language based systems for helping farmers. Similar efforts have been taken by other institutes. Microsoft research at Bangalore, and IBM research at New Delhi and Bangalore are working on various projects on Indian Languages, including speech recognition.

At the end of all this, I must partially agree with the quote I made from Alex's blog. Yes, some undergrads do make brilliant contributions which is just because of what they have in their bones. This is true for any country or university.