Tuesday, December 26, 2006

Book Review: Geometry and Meaning

Title: Geometry and Meaning
Author: Dominic Widdows
URL : http://infomap.stanford.edu/book/

This is an excellent book for even high school students to learn about IR. However, if you have read a few papers in this field then reading this book is a waste of time, except for the jewels in boxes throughout the book.

Thursday, December 21, 2006

semantic ambiguity?

(c) Bill Watterson

lectures on cognitive computing

A series of twelve lectures from the conference on cognitive computing at IBM's Almaden
Research lab with topics ranging from memory to consciousness to thought.

Lecture videos


A good review article on smoothing by Stan Chen and Roni Rosenfeld.
A thorough treatment can be found here.

Wednesday, December 20, 2006

Accuracy vs. Perplexity

If model A has higher accuracy than model B, does it necessarily imply
perplexity(A) < perplexity(B)?

Jason's reply:

No, that is not implied.
Accuracy = how correct is the highest-probability hypothesis?
Perplexity = how probable is the correct hypothesis?
(or more generally, how probable is the observed data?)

So they are really measuring different things.
Accuracy is what you really care about, in a sense,
but (1) it is only defined if you have supervised data,
(2) it requires an evaluation method for measuring degree
of correctness, (3) it is usually not a continuous function
of the parameters (since an epsilon change in the parameters
may not change which hypothesis has the highest probability)
and is therefore hard to optimize.

I usually recommend reporting both, which has become
the convention in speech recognition, where people report
WER (word error rate) and perplexity.

Finding concordances on the web

A cool website that does this: http://www.webcorp.org.uk
Allows using patterns but awfully slow.

TKDE special issue on semantic web

Its here ... slurrrp! yum yum.

Interesting papers to my general reading list:
1. From Wrapping to Knowledge
2. Mining Generalized Associations of Semantic Relations from Textual Web Content
3. A Taxonomy Learning Method and Its Application to Characterize a Scientific Web Community

You can access all articles here

Monday, December 18, 2006

New IR book

Just discovered this book by Chris Manning, Prabhakar Raghavan, and Hinrich Schütze.
This is going to be in my reading list for the IR Course, next spring.

Hidden Markov Models

After a brief venture in developing HMMs for sequence labeling at the NLP class, I am planning to use the HTK toolkit for more fun!

Get it today from: http://htk.eng.cam.ac.uk/
A tutorial style manual on HTK can be obtained here.

Also don't forget to read Hal's wonderful writeup on sequence labeling.

Update: If you are planning to write a HMM tagger of your own, in addition to the above handout, have a look at the following:

1. A practical Part-of-Speech tagger
A general introduction. Involves right mix of math and implementation details.
2. Equations for Part-of-Speech tagging
Derives all equations for PoS tagging using HMMs from first principles
(smoothing & EM included)

Though not related to HMMs, TagChunk by
Hal Daume is another way for sequence labeling (software included)

Adversarial Information Retrieval 2007


1. AIRWeb'07 Topics
2. Web Spam Challenge
3. Timeline
4. Organizers and Program Committee


Adversarial Information Retrieval addresses tasks such as gathering,
indexing, filtering, retrieving and ranking information from collections
wherein a subset has been manipulated maliciously. On the Web, the
predominant form of such manipulation is "search engine spamming" or
spamdexing, i.e., malicious attempts to influence the outcome of ranking
algorithms, aimed at getting an undeserved high ranking for some items
in the collection.

We solicit both full and short papers on any aspect of adversarial
information retrieval on the Web. Particular areas of interest include,
but are not limited to:

* Link spam
* Content spam
* Cloaking
* Comment spam
* Spam-oriented blogging
* Click fraud detection
* Reverse engineering of ranking algorithms
* Web content filtering
* Advertisement blocking
* Stealth crawling
* Malicious tagging

Proceedings of the workshop will be included in the ACM Digital Library.
Full papers are limited to 8 pages; work-in progress will be permitted 4

For more information, see


This year, we are introducing a novel element: a Web Spam Challenge for
testing web spam detection systems. We will be using the WEBSPAM-UK2006
collection for Web Spam Detection .

The collection includes large set of web pages, a web graph, and
human-provided labels for a set of hosts. We will also provide a set of
features extracted from the contents and links in the collection, which
may be used by the participant teams in addition to any automatic
technique they choose to use.

We ask that participants of the Web Spam Challenge submit predictions
(normal/spam) for all unlabeled hosts in the collection. Predictions
will be evaluated and results will be announced at the AIRWeb 2007

For more information, see


- 7 February 2007: E-mail intention to submit a workshop paper
(optional, but helpful)
- 15 February 2007: Deadline for workshop paper submissions
- 15 March 2007: Notification of acceptance of workshop papers
- 30 March 2007: Camera-ready copy due
- 30 March 2007: Challenge submissions due
- 8 May 2007: Date of workshop


Saturday, December 16, 2006

eye candy for developers

This is not directly related to research but if you are like me, spending a long time in front of the screen staring at the console (coding or even playing hangman!), check out the cool new fonts that come with vista. Although I have been windows free (just like my lab) for a year now, these vista fonts are something I can vouch for. Search Google for "six new vista fonts" to download them. Since they are in TTF, they work perfectly on my Ubuntu (not sure if this is right, heck I enjoy the font. Thanks Bill!).

I have been using the Consolas font for my Gnome terminal and it rocks!

MIT EECS Matlab tutorial

Nice introduction to Matlab

Graphical models reading group

Collection of some basic reading material on the subject. Warning: its old (2004)

Thursday, December 14, 2006

The Dragon Tooolkit

The Dragon Tooolkit is Java-based development package for academic research use in language modeling (LM) and information retrieval (IR). Language modeling has recently emerged as an attractive new framework for text information retrieval and text mining (TM). However, most Java-based free search engines such as Lucene does not support LM very well. The Lemur toolkit is designed for LM and IR, but written in C and C++, which may be a hindrance to people who prefer Java programming. Basically, the dragon toolkit is tailored for researchers who work on large-scale LM and IR and prefer Java programming. Moreover, different from Lucene and Lemur, it provides built-in supports for semantic-based IR and TM. The dragon tookit seamlessly intergrates and implements a set of NLP tools, which enable the toolkit to index text collections with various representation schemes including words, phrases, ontology-based concepts and relationships. However, to minimize the learning time, we intentionally keep the package small and simple. The toolkit does not have some features including distributed IR and cross-language IR which are part of Lemur toolkit.

How to Cite Dragon Toolkit

If you are using the Dragon Toolkit for research work, please cite it in your published papers:

Zhou, X., Zhang, X., and Hu, X., The Dragon Toolkit, Data Mining & Bioinformatics Lab, iSchool at Drexel University, http://www.ischool.drexel.edu/dmbio/dragontool

Download Dragon Toolkit

Get the Dragon Toolkit source code and binary libraries (including external libraries) and necessary supporting data. Click http://www.ischool.drexel.edu/dmbio/dragontool/default.asp to download.

Graph-based Methods for Natural Language Processing

NAACL/HLT 2007 Workshop
Graph-based Methods for Natural Language Processing


Rochester, NY, April 26, 2007

Recent years have shown an increased interest in bringing the field of
graph theory into Natural Language Processing. In many NLP
applications entities can be naturally represented as nodes in a graph
and relations between them can be represented as edges. Recent
research has shown that graph-based representations of linguistic
units as diverse as words, sentences and documents give rise to novel
and efficient solutions in a variety of NLP tasks, ranging from part
of speech tagging, word sense disambiguation and parsing to
information extraction, semantic role assignment, summarization and
sentiment analysis.

This workshop builds on the success of the first TextGraphs workshop at
HLT-NAACL 2006. The aim of this workshop is to bring together researchers
working on problems related to the use of graph-based algorithms for natural
language processing and on the theory of graph-based methods.
It will address a broader spectrum of research areas to foster
exchange of ideas and help to identify principles of using the graph
notions that go beyond an ad-hoc usage.
Unveiling these principles will give rise to applying generic graph
methods to many new problems that can be encoded in this framework.

We invite submissions of papers on graph-based methods applied to
NLP-related problems. Topics include, but are not limited to:

- Graph representations for ontology learning and word sense disambiguation
- Graph algorithms for Information Retrieval, text mining and understanding
- Graph matching for Information Extraction
- Random walk graph methods and Spectral graph clustering
- Graph labeling and edge labeling for semantic representations
- Encoding semantic distances in graphs
- Ranking algorithms based on graphs
- Small world graphs in natural language
- Semi-supervised graph-based methods
- Statistical network analysis and methods for NLP

Submission format:

Submissions will consist of regular full papers of max. 8 pages and
short papers of max. 4 pages, formatted following the NAACL 2007
guidelines. Papers should be submitted using the online submission
form: http://www.cs.rochester.edu/meetings/hlt-naacl07/workshops.shtml

Important dates:

Regular paper submission January 29
Short paper submissions February 4
Notification of acceptance February 22
Camera-ready papers March 1
Workshop April 26

AAAI 2007 track on AI and the Web

AAAI 2007 (July 22-26, Vancouver CN) will have a special
technical track on Artificial Intelligence and the Web. The
track seeks research papers on AI techniques, systems and
concepts involving or applied to the Web. Papers should
describe Web related research or clearly explain how the
work addresses problems, opportunities or issues underlying
the Web or Web-based systems. See [1] for suggested topics
and more track information and [2] for information on the
conference and details on submitting. Relevant deadlines are:

- Jan 25: student abstracts
- Feb 1: technical paper abstracts
- Feb 2: doctoral consortium applications
- Feb 6: technical papers
- Feb 27: nectar and senior member papers
- Apr 3: intelligent systems demo proposals

[1] http://cs.umbc.edu/aaai07/
[2] http://www.aaai.org/Conferences/AAAI/aaai07.php

CoNLL Shared Task 2007: multilingual dependency parsing

CoNLL Shared Task 2007


Keeping up the successful tradition, the Conference on
Computational Natural Language Learning (CoNLL) 2007 will
as usual include a shared task. For the second year running,
the task will be multilingual dependency parsing. The first
call for participation is scheduled to appear in later in
December with release of training data in late January and
submission of test results in late March. The CoNLL
conference scheduled to take place in June 2007.

The website for the shared task will be

Enquiries about the shared task can be sent to

The organizers

Joakim Nivre
Johan Hall
Sandra KŸbler
Ryan McDonald
Jens Nilsson
Sebastian Riedel
Deniz Yuret

The congressional speech corpus

The "congressional speech" corpus and associated graph information
used in our "Get out the vote: Determining support or opposition from
Congressional floor-debate transcripts" EMNLP 2006 paper is now

Specifically, the data includes speeches as individual documents,
together with:

* automatically-derived labels for whether the speakers supported
the legislation under discussion or not, allowing for
experiments with this kind of sentiment analysis

* indications of which debate each speech comes from (and the
position within the debate), allowing for consideration of
conversational structure

* indications of by-name references between speakers, allowing for
experiments with agreement classification (if one determines the
"true" labels from the support/oppose labels assigned to the
pair of speakers in question)

* the edge weights and other information we derived to create the
graphs we used for our experiments upon this data, facilitating
implementation of alternative graph-based classification methods
upon the graphs we constructed

The download site is:

Matt Thomas, Bo Pang, and Lillian Lee