The frequentist exclaimed, "All your Bayes are belong to us!" to which the Bayesian responded, "Well, it depends."
Wednesday, October 17, 2007
Thursday, September 20, 2007
For a full list see here. Some papers I want to read based on my current interests:
Random Projections for Manifold Learning
Chinmay Hegde, Michael Wakin, Richard Baraniuk
The Distribution Family of Similarity Distances
Gertjan Burghouts, Arnold Smeulders, Jan-Mark Geusebroek
Michael Gashler, Dan Ventura, Tony Martinez
A learning framework for nearest neighbor search
Lawrence Cayton, Sanjoy Dasgupta
Learning Bounds for Domain Adaptation
John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman
Convex Relaxations of EM
Yuhong Guo, Dale Schuurmans
A Randomized Algorithm for Large Scale Support Vector Learning
Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan
Bundle Methods for Machine Learning
Alex Smola, S V N Vishwanathan, Quoc Le
Regularized Boost for Semi-Supervised Learning
Ke Chen, Shihai Wang
Learning the structure of manifolds using random projections
Yoav Freund, Sanjoy Dasgupta, Mayank Kabra, Nakul Verma
A complexity measure for intuitive theories
Charles Kemp, Noah Goodman, Joshua Tenenbaum
Saturday, August 18, 2007
Those of us who were at EMNLP-CONLL 2007 remember the "NLP and Global Warming" exchange between James Clarke, Jason Eisner, and Dan Bikel at the Q/A session of the Clarke and Lapata paper. The transcript of this funny conversation is now online, thanks to Jason.
I really liked Hal's ending remark.
Wednesday, August 15, 2007
Wired has an article about spock.com, a people search engine that combines crawled and user added content. From the few searches I did, looks like this is good for celebrity names than a regular person with web content. For instance, searching a name like "David Smith" produces these results. Of the top 10 results, only 3 of them actually have the name "David Smith" or something closer and the first result is not one of them. Compare this with a general purpose search engine like Google. Among a dozen random NLP/ML academic names (professors) I tried, it only got Jason Eisner and Tom Mitchell correct. One reason for this poor recall is probably they don't get content from user home pages.
(Some sites where this data is derived from include MySpace, Friendster, IMDB, Wikipedia, ratemyprofessors.com, etc.)
Nevertheless, this website is a representative of interesting KDD-style problems that one could do with people names. It is also interesting as people names that we look for fall in the "long tail" without sufficient data to support calling for clever machine learning techniques.
Sunday, August 12, 2007
I was lead to this article on Forbes via Damien's post. The article is about a company Digital Reasoning getting patent on what sounded to me as contextual similarity. Their "white paper" makes reference to a patent number 7249117 (via USPTO). Unlike research papers, reading the patent document was so difficult. Will get to it sometime later but here is an extract from their press release about what their technology can do.
* Learn the meanings of words, classes of words, and other symbols based on how they are used in context in natural language
* Create and manipulate models of this "meaning" - i.e. the mathematical patterns of usage - including the detection of groups or similar categories of words or development of hierarchies or creation of relationships between words
* Improve the models based on human feedback or using other structured information after model construction
* The representation or sharing of this model or learning in an ontology, graph structure, or programming languages
Anyone from the ACL/ML/AI community can immediately recognize this and start citing their favorite papers on these topics starting from at least a decade ago. A promotional video from the company on YouTube can be found here. Excerpt from the video: "... We treat the text representation of human language as a signal ... ".
I think everyone should stop taking patents seriously. Wishful thinking?
Thursday, August 2, 2007
In spite of more information being present in a scientific paper than its text, recommending or ranking papers appears to be quite challenging.
Thursday, July 26, 2007
Google now opened access to university researchers to its search and MT systems in today's announcement on their research blog. The search API documentation does not mention any restriction on the number of queries that can be posted for search (The earlier limit was 1000). Whatever the number is I am guessing it will be large (Drinking from the firehose?). However, the MT API allows 1000 queries per day with the documentation hinting that this need not be a hard limit.
Looking at the search API output, two things I really miss is the number of hits and the snippet for each search result. The number of hits has been used in several papers for interesting results. The other useful feature is snippets. Every search result from Google is accompanied by a small snippet extracted from the page, as shown below for an example query "Dekang Lin".
The information in the snippets can be used as informative features in different tasks like this one in person name disambiguation. (BTW, Dekang is now at Google)
Despite these minor quibbles, these new APIs will be quite useful to all of us and will certainly result in more papers on Googleology.
Later addition: Turns out we can sort of get the counts by simply counting the number of search results by repeatedly executing the request (only ten results per request) but the API caps this limit to 100. That means you could get a maximum of 1000 results. Which is not quite the same as "Results 1 - 10 of about 779,000,000". Though that number is approximate, it is still indicative of how strong the query is w.r.t the web. For example GoogleCount("Horse+animal") >> GoogleCount("Horse+truck").
Tuesday, July 24, 2007
SIGIR 2007 is happening now at Amsterdam!
Latent Concept Expansion Using Markov Random Fields, Donald Metzler, Bruce Croft
Random Walks on the Click Graph, Nick Craswell, Martin Szummer
Towards Automatic Extraction of Event and Place Semantics from Flickr Tags, Tye Rattenbury, Nathaniel Good, Mor Naaman
Clustering of Documents with Local and Global Regularization, Fei Wang, Changshui Zhang, Tao Li
Detecting, Categorizing and Clustering Entity Mentions in Chinese Text, Wenjie Li, Donglei Qian, Chunfa Yuan, Qin Lu
Principles of Hash-based Text Retrieval, Benno Stein
DiffusionRank: A Possible Penicillin for Web Spamming, Haixuan Yang, Irwin King, Michael R. Lyu
Context Sensitive Stemming for Web Search, Fuchun Peng, Nawaaz Ahmed, Xin Li, Yumao Lu
Combining Content and Link for Classification using Matrix Factorization, Shenghuo Zhu, Kai Yu, Yun Chi, Yihong Gong
ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs, Yang Liu, Jimmy Huang, Aijun An, Xiaohui Yu
Heavy-Tailed Distributions and Multi-Keyword Queries, Arnd Konig, Surajit Chaudhuri, Liying Sui, Kenneth Church
Monday, July 23, 2007
AAAI 2007 is now going on at Vancouver. Here is my selection of NLP and Learning papers I would like to know more about.
Deriving a Large-Scale Taxonomy from Wikipedia, Simone Paolo Ponzetto, Michael Strube
Relation Extraction from Wikipedia Using Subtree Mining, Dat P. T. Nguyen, Yutaka Matsuo, Mitsuru Ishizuka
Finding Related Pages Using Green Measures: An Illustration with Wikipedia, Yann Ollivier, Pierre Senellart
Graph Partitioning Based on Link Distributions, Bo Long, Mark (Zhongfei) Zhang, Philip S. Yu
Semi-supervised Learning by Mixed Label Propagation, Wei Tong, Rong Jin
Semi-Supervised Learning with Very Few Labeled Training Examples, Zhi-Hua Zhou, De-Chuan Zhan, Qiang Yang
Clustering with Local and Global Regularization, Fei Wang, Changshui Zhang, Tao Li
Isometric Projection, Deng Cai, Xiaofei He, Jiawei Han
Improving Similarity Measures for Short Segments of Text, Wen-tau Yih, Christopher Meek
Topic Segmentation Algorithms for Text Summarization and Passage Retrieval: An Exhaustive Evaluation, Gaël Dias, Elsa Alves. José Gabriel Pereira Lopes
Robust Estimation of Google Counts for Social Network Extraction, Yutaka Matsuo, Hironori Tomobe, Takuichi Nishimura
Harvesting Relations from the Web - Quantifiying the Impact of Filtering Functions, Sebastian Blohm, Philipp Cimiano, Egon Stemle
Template-Independent News Extraction Based on Visual Consistency, Shuyi Zheng, Ruihua Song, Ji-Rong Wen
Comprehending and Generating Apt Metaphors: A Web-driven, Case-based Approach to Figurative Language, Tony Veale, Yanfen Hao
Mobile Service for Reputation Extraction from Weblogs - Public Experiment and Evaluation, Takahiro Kawamura, Shinichi Nagano, Masumi Inaba, Yumiko Mizoguchi
The Impact of Time on the Accuracy of Sentiment Classifiers Created from a Web Log Corpus, Kathleen T. Durant, Michael D. Smith
Nectar: Learning by Combining Observations and User Edits, Vittorio Castelli, Lawrence Bergman, Daniel Oblinger
Multi-Label Learning by Instance Differentiation, Min-Ling Zhang, Zhi-Hua Zhou
Extracting Influential Nodes for Information Diffusion on a Social Network, Masahiro Kimura, Kazumi Saito, Ryohei Nakano
Temporal and Information Flow Based Event Detection from Social Text Streams, Qiankun Zhao, Prasenjit Mitra, Bi Chen
Analyzing Reading Behavior by Blog Mining, Tadanobu Furukawa, Mitsuru Ishizuka, Yutaka Matsuo, Ikki Ohmukai, Koki Uchiyama
Wednesday, July 18, 2007
KDD 2007 will be on Aug 12-15 in the neighborhood at San Jose. Here is my selection:
"Extracting Semantic Relations from Query Logs", Ricardo Baeza-Yates and Alessandro Tiberi
"Efficient Incremental Clustering with Constraints", Ian Davidson, S.S. Ravi, and Martin Ester
"A Probabilistic Framework for Relational Clustering", Bo Long, Zhongfei Zhang, and Philip S. Yu
"Tracking Multiple Topics for Finding Interesting Articles", Raymond Pon, Alfonso Cardenas, David Buttler, and Terence Critchlow
"Feature Selection Methods for Text Classification", Anirban Dasgupta, Petros Drineas, Boulos Harb, Vanja Josifovski, and Michael Mahoney
"Hierarchical Mixture Models: a Probabilistic Analysis", Mark Sandler
"Information distance from a question to an answer", Xian Zhang, Yu Hao, Xiaoyan Zhu, and Ming Li
"Statistical Change Detection for Multi-Dimensional Data", Xiuyao Song, Mingxi Wu, Chris Jermaine, and Sanjay Ranka
"Constraint-Driven Clustering", Rong Ge, Martin Ester, Wen Jin, and Ian Davidson
"Enhancing Semi-Supervised Clustering: A Feature Projection Perspective", Wei Tang, Hui Xiong, Shi Zhong, and Jie Wu
Tuesday, July 17, 2007
... because NLP is so underdeveloped in India, even undergraduate-level projects may be contributing to the cutting edge of research.Turns out he was referring to this post from an undergrad which tries to give "the Indian perspective", rather inaccurately. Having worked on NLP at one of the IITs I am compelled to write from a grad student perspective. Sunayana's post is interesting as it brings out several issues in Indic computing.
1. Lack of annotation data - corpora, treebanks, and aligned texts which are sinews and bones of any language processing system. Resources exist, largely due to the efforts of CIIL, various universities and other government agencies but these are dwarfed compared to resources that exist for other languages, like English or the European languages.
However, the rich morphology in Indian languages can be exploited to mitigate the amount of annotation data required for certain tasks, for instance POS tagging.
2. Encoding issues - As rightly pointed by Sunayana, before the adoption of unicode, several data sources were locked up in the fonts they use. But things are changing, there is more and more Indian language content in unicode today than ever. Websites like BBC and Wikipedia are spewing out a lot of content in unicode for those interested in collecting monolingual, comparable corpora. A cursory glance at Wikipedia statistics shows the number of articles in, say Hindi or Tamil for example, has more than doubled in the past six months.
3. Visibility - While there has been an increasing trend to publish in reputed conferences like ICML or ACL, more participation is certainly desirable. IJCAI 2007 was held in India and I highly recommend, if you are around, to submit (sub. deadline: Jul 31st) and/or attend IJCNLP 2008.
This is an exciting time to do NLP research on Indian languages. There is both corporate as well as government motivations which translate to grants and support to universities. The group at IIT Bombay, for example, implemented and deployed, local language based systems for helping farmers. Similar efforts have been taken by other institutes. Microsoft research at Bangalore, and IBM research at New Delhi and Bangalore are working on various projects on Indian Languages, including speech recognition.
At the end of all this, I must partially agree with the quote I made from Alex's blog. Yes, some undergrads do make brilliant contributions which is just because of what they have in their bones. This is true for any country or university.
Using data on 11,000 graduate students from 100 departments over a 20 year period, I test whether graduate student outcomes (graduation rates, time to degree, publication success, and initial job placement) differ based on a student’s gender and marital status. I find that married men have better outcomes across every measure than single men. Married women do no worse than single women on any measure and actually have more publishing success and complete their degree in less time. The outcomes of cohabiting students generally fall between those of single and married students.
Monday, July 2, 2007
John Langford recommends:
Gilles Blanchard and François Fleuret, Occam’s Hammer. When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight. A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper.
Some papers I would like reading right away:
Discriminative Learning for Differing Training and Test Distributions
Steffen Bickel - Max Planck Institute for Computer Science, Germany
Michael Brüeckner - Max Planck Institute for Computer Science, Germany
Tobias Scheffer - Max Planck Institute for Computer Science, Germany
Sparse Eigen Methods by D.C. Programming
Bharath Sriperumbudur - University of California, San Diego, USA
David Torres - University of California, San Diego, USA
Gert Lanckriet - University of California, San Diego, USA
Graph Clustering With Network Structure Indices
Matthew J. Rattigan - University of Massachusetts Amherst, USA
Marc Maier - University of Massachusetts Amherst, USA
David Jensen - University of Massachusetts Amherst, USA
Fast and Effective Kernels for Relational Learning from Texts
Alessandro Moschitti - University of Trento, Italy
Fabio Massimo Zanzotto - University of Rome, Italy
Three New Graphical Models for Statistical Language Modelling
Andriy Mnih - University of Toronto, Canada
Geoffrey Hinton - University of Toronto, Canada
Simple, Robust, Scalable Semi-supervised Learning via Expectation Regularization
Gideon S. Mann - University of Massachusetts, USA
Andrew McCallum - University of Massachusetts, USA
The Rendezvous Algorithm: Multiclass Semi-Supervised Learning with Markov Random Walks
Arik Azran - University of Cambridge, UK
Information-Theoretic Metric Learning (one of the best paper awardees)
Jason V. Davis - University of Texas at Austin, USA
Brian Kulis - University of Texas at Austin, USA
Prateek Jain - University of Texas at Austin, USA
Suvrit Sra - University of Texas at Austin, USA
Inderjit S. Dhillon - University of Texas at Austin, USA
Agnostic Active Learning - not from ICML 2007 but exciting as it was discovered last year, theoretical bounds were proved this year in ICML 2007.
A Bound on the Label Complexity of Agnostic Active Learning
Steve Hanneke - Carnegie Mellon University, USA
Stumbled on Alekh Agarwal's tech report on Kernels. A good survey on kernel methods that includes recent work on this topic.
Another place to begin would be Thomas Gartner's SIGKDD explorations survey paper.
Thursday, June 7, 2007
Monday, April 30, 2007
A GRAPH-BASED APPROACH TO NAMED ENTITY CATEGORIZATION IN WIKIPEDIA
USING CONDITIONAL RANDOM FIELDS
Yotaro Watanabe, Masayuki Asahara and Yuji Matsumoto
A TOPIC MODEL FOR WORD SENSE DISAMBIGUATION
Jordan Boyd-Graber, Xiaojin Zhu and David Blei
BOOTSTRAPPING INFORMATION EXTRACTION FROM FIELD BOOKS
Sander Canisius and Caroline Sporleder
CROSS-LINGUAL DISTRIBUTIONAL PROFILES OF CONCEPTS FOR MEASURING
Saif Mohammad, Iryna Gurevych, Graeme Hirst and Torsten Zesch
CRYSTAL: ANALYZING PREDICTIVE OPINIONS ON THE WEB
Soo-Min Kim and Eduard Hovy
EXPLORATIONS IN AUTOMATIC BOOK SUMMARIZATION
Rada Mihalcea and Hakan Ceylan
LARGE SCALE NAMED ENTITY DISAMBIGUATION BASED ON WIKIPEDIA DATA
LEXICAL SEMANTIC RELATEDNESS WITH RANDOM GRAPH WALKS
Thad Hughes and Daniel Ramage
TOWARDS ROBUST UNSUPERVISED PERSONAL NAME DISAMBIGUATION
Ying Chen and James Martin
WORD SENSE DISAMBIGUATION INCORPORATING LEXICAL AND STRUCTURAL SEMANTIC
Takaaki Tanaka, Francis Bond, Timothy Baldwin, Sanae Fujita and Chikara
Sunday, April 15, 2007
The constant is defined as
For more details on the setup, refer here.
Thursday, March 22, 2007
Wednesday, March 21, 2007
Bafia (Mbam Cameroon) Wayumbe....
Bagesu (Central Africa) Watulire?
Bagesu (Central Africa) [answer] Natulire nili mlahi
Bajawa (Indonesia) ['where are you going'] Male de?
Bakitara (Central Africa) [morning] Oirwota?
Bakitara (Central Africa) [answer] Ndabanta
Bakitara (Central Africa) [after absense] Mirembe
Bakweri (Cameroon) [morning] O wusi
Balanta (Guinea-Bissau) Abala, lite utchole
Balinese (Bali) Om swastyastu
Balinese (Bali) [reply] Om shanti shanti shanti
Balti (India, Pakistan) Yang chi halyo?
Balti (India, Pakistan) [answer] Lyakhmo
That was "hello" in some languages. Jennifer Runner has this page with "Hello" and other pleasantries in a large number of languages. Don't forget to check her Internet Language Resources page.
Sunday, March 18, 2007
Computing Semantic Similarity between Skill Statements for Approximate Matching
Feng Pan and Robert Farrell
Extracting Appraisal Expressions
Kenneth Bloom, Shlomo Argamon and Navendu Garg
Unsupervised Resolution of Objects and Relations on the Web
Alexander Yates and Oren Etzioni
Near-Synonym Choice in an Intelligent Thesaurus
Using Wikipedia for Automatic Word Sense Disambiguation
An integrated approach to measuring Semantic Similarity between Words using Information available on the Web
Danushka Bollegala, Yutaka Matsuo and Mitsuru Ishizuka
Improving Relation Extraction Using Domain Information
Alfio Massimiliano Gliozzo, Marco Pennacchiotti and Patrick Pantel
High-Performance, Language-Independent Morphological Segmentation
Sajib Dasgupta and Vincent Ng
A Systematic Exploration of The Feature Space for Relation Extraction
Jing Jiang and ChengXiang Zhai
Data-Driven Graph Construction for Semi-Supervised Graph-Based Learning in NLP
Andrei Alexandrescu and Katrin Kirchhoff
Towards Domain-Independent Information Extraction from Web Tables
Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Kroepl, Bernhard Pollak
Organizing and Searching the World Wide Web of Facts - Step Two: Harnessing the Wisdom of the Crowds
A New Suffix Tree Similarity Measure for Document Clustering
Hung Chim, Xiaotie Deng
Scaling Up All-Pairs Similarity Search
Roberto Bayardo, Yiming Ma, Ramakrishnan Srikant
Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography
Lars Backstrom, Cynthia Dwork, Jon Kleinberg
Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs
Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai
Measuring Semantic Similarity between Words Using Web Search Engines
Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka
Using Google Distance to weight approximate ontology matches
Risto Risto Gligorov, Zharko Aleksovski, Warner ten Kate, Frank van Harmelen
Papers to read
1. Banko and Brill, ACL 2001
2. Deepak Ravichandran, ACL 2005
- Delip Rao at 8:38 PM
Friday, January 26, 2007
After much struggling with my ubuntu, I got Beryl finally on my laptop.
Getting it work with ATI was always a problem (for me) until this guide.
Incidentally, this is just a few days from the Vista launch. Who cares about Vista anymore?
Thursday, January 25, 2007
Ever wondered what happened when a message is converted from one language to another and so on and finally back to the source language?
Even a simple sentence, I am fine, gets distorted as:
They are much bond.
They are much plugging.
They are covering much.
Uses BabelFish underneath, I wouldn't be surprised if Google or any other MT system also shows similar output.
Thursday, January 18, 2007
Machine Leaning. Where are we heading? Tom Mitchell says it all - machine learning != statistics.
How to write a machine learning paper?