Thursday, July 26, 2007

Google allows data binging for researchers

Google now opened access to university researchers to its search and MT systems in today's announcement on their research blog. The search API documentation does not mention any restriction on the number of queries that can be posted for search (The earlier limit was 1000). Whatever the number is I am guessing it will be large (Drinking from the firehose?). However, the MT API allows 1000 queries per day with the documentation hinting that this need not be a hard limit.

Looking at the search API output, two things I really miss is the number of hits and the snippet for each search result. The number of hits has been used in several papers for interesting results. The other useful feature is snippets. Every search result from Google is accompanied by a small snippet extracted from the page, as shown below for an example query "Dekang Lin".

The information in the snippets can be used as informative features in different tasks like this one in person name disambiguation. (BTW, Dekang is now at Google)

Despite these minor quibbles, these new APIs will be quite useful to all of us and will certainly result in more papers on Googleology.

Later addition: Turns out we can sort of get the counts by simply counting the number of search results by repeatedly executing the request (only ten results per request) but the API caps this limit to 100. That means you could get a maximum of 1000 results. Which is not quite the same as "Results 1 - 10 of about 779,000,000". Though that number is approximate, it is still indicative of how strong the query is w.r.t the web. For example GoogleCount("Horse+animal") >> GoogleCount("Horse+truck").

No comments: