Tuesday, July 17, 2007

NLP in India?

I was surprised to see

... because NLP is so underdeveloped in India, even undergraduate-level projects may be contributing to the cutting edge of research.
Turns out he was referring to this post from an undergrad which tries to give "the Indian perspective", rather inaccurately. Having worked on NLP at one of the IITs I am compelled to write from a grad student perspective. Sunayana's post is interesting as it brings out several issues in Indic computing.

1. Lack of annotation data - corpora, treebanks, and aligned texts which are sinews and bones of any language processing system. Resources exist, largely due to the efforts of CIIL, various universities and other government agencies but these are dwarfed compared to resources that exist for other languages, like English or the European languages.

However, the rich morphology in Indian languages can be exploited to mitigate the amount of annotation data required for certain tasks, for instance POS tagging.

2. Encoding issues - As rightly pointed by Sunayana, before the adoption of unicode, several data sources were locked up in the fonts they use. But things are changing, there is more and more Indian language content in unicode today than ever. Websites like BBC and Wikipedia are spewing out a lot of content in unicode for those interested in collecting monolingual, comparable corpora. A cursory glance at Wikipedia statistics shows the number of articles in, say Hindi or Tamil for example, has more than doubled in the past six months.

3. Visibility - While there has been an increasing trend to publish in reputed conferences like ICML or ACL, more participation is certainly desirable. IJCAI 2007 was held in India and I highly recommend, if you are around, to submit (sub. deadline: Jul 31st) and/or attend IJCNLP 2008.

This is an exciting time to do NLP research on Indian languages. There is both corporate as well as government motivations which translate to grants and support to universities. The group at IIT Bombay, for example, implemented and deployed, local language based systems for helping farmers. Similar efforts have been taken by other institutes. Microsoft research at Bangalore, and IBM research at New Delhi and Bangalore are working on various projects on Indian Languages, including speech recognition.

At the end of all this, I must partially agree with the quote I made from Alex's blog. Yes, some undergrads do make brilliant contributions which is just because of what they have in their bones. This is true for any country or university.


Alexandre said...

Thanks "Res"? (Your profile link is missing).

You have good arguments and great links. I have updated my article to point to your blog. I also subscribed to your blog myself.

Any reason you are not on the ACL Wiki list? You certainly know of it, but it only grows by individuals adding themselves (and others) to it.

Also, if you have anything to contribute on Indian NLP to the "State of the art” NLP Wiki I think it would be great.

delip said...

Thanks Alex. I sure will add myself on the wiki. I looked at the State of the art pages recently. Will be updating that soon.