MathLing: July 2009

Opening session: Susan Dumais' Salton Award Lecture

Dumais walked through a personal history of her work. She spent some time talking about the problem of re-finding: her data shows that 50-80% of web visits are re-visits and 30-50% of queries are re-queries, which speaks to a tremendous opportunity to use that knowledge to help users more effectively. In experiments with refinding on the desktop she discovered that while queries to a search box tend to be short with few operators, if you give people a rich faceted interface, they will use it. Being able to augment the results with new criteria and resort based on those criteria is key. She also pointed out that some attributes are highly resource dependent. For example, which "date" should you use? For a meeting: the time the meeting is scheduled for; for an email: when it was sent; for a photo: when it was created; for a document: when it was last changed, etc. She then spoke about some work on personalization, where prior clicks along only got to predictions that were right 45% of the time, whereas adding in prior clicks plus dwell time plus other session information got to 75%.

On web search at 15: "There is something amazing in that you can have 20 billion pages of diverse uncurated stuff and find anything!"

Although web search is amazingly effective, many tasks are still hard, and much remains to be done. "Search in the future will look nothing like today's simple search engine interfaces. If in 10 years we are still using a rectangular box and a list of results, I should be fired." [Mar 2007]

She spoke of thinking beyond traditional IR boundaries and identified several challenges:

information dynamics: information changes a lot and people revisit pages a lot. Queries have different temporal dynamics and pages have different rates of change. Including this information in the user interfaces would be very helpful. What if changes in search results from the last time you searched were highlighted in some way? What if changes in a page you re-vsit were highlighted in some way. She gave by way of example a browser plug-in she has been experimenting with, which showed her that there was a free continental breakfast announcement added to the SIGIR page. (Which I missed, not having this fancy plug-in, drat!)
data as a critical resource: research on user dynamics depends on data such as large scale web query logs, which are hard to get, and harder to share. We can also look at operational systems as an experimental resource using A/B testing (as Amazon does, for example).

In the Q&A someone raised the poverty of the web search box, which hasn't changed visually in 25 years. She remarked: "We are not nailing search anywhere. You're right. We are training ourselves to ask questions that those resources can answer." She then went on to remark that a lot has changed behind the scenes, but transparency matters and matters a lot more for re-finding: if you give people the tools to tell you what they know about what they are looking for they will be more successful.

Classification and Clustering

"Context-Aware Query Classification"
Derek Hao Hu (Hong Kong University of Science and Technology)
presenting for a long list of authors from the university and Microsoft

The basic idea was to apply a fairly simple statistical model to use the previous query and previously clicked links from that query to predict the category for the current one, using a query classification based on some web dictionary or taxonomy such as Wikipedia. The idea is to figure out if by "bass" you are talking about the fish or the instrument, for example.

"Refined Experts: Improving Classification in Large Taxonomies"
Paul Bennett (Microsoft) and Nam Nguyen (Cornell). Paul Bennett presenting.

Interesting data points:
If you have a hierarchy of classes, using hierarchical training is (slightly) better than flat training. What this means is that if you have taxonomy that says class A contains subclasses A1, A2, and A3, the training set for A1 is things in A, with positive examples being things in A1. The point is, you don't put things from B in as negative examples of A1 to do hierarchical
training.

Significantly better is something he calls Refinement and Refined Experts.

The basic idea for Refinement is to use cross-validation to predict which test documents belong in the class, and then use both the predicted and the actual to do the training lower down.

The basic idea for Refined Experts is to add "metafeatures" that are predictions from the child classes as part of the input to feature vectors for the parent class. This means augmenting the training data for class A with metadata for A1, A2, etc.. So you train from the bottom up, and use cross-validation from the leaves to add to the training set for larger classes.

So there is a bottom up training pass followed by a top-down training pass.

The intuition behind this is early cut-off of miscategorization from the top, and that it may be easier to distinguish some things at the leaves than in the context of larger classes at the top.

The Refined Experts algorithm led to a 20-30% improvement in F1.

"Dynamicity vs. Effectiveness: Studying Online Clustering for Scatter/Gather"
Weimao Ke et al (University of North Carolina) Weimao Ke presenting

The central observation here, albeit based on a very small study, is that you can pre-compute clusters statically and then just do tree cutting to select the appropriate clusters in at run time and get results (both in terms of precision/recall and user experience) that are just as good as you do from computing the clusters dynamically from the search results and from selected clusters of search results. Given that clustering algorithms run to the O(kn) to O(n^2) (n=#documents, k=#classes), being able to run this offline could be a win. The main down-side, of course, is that this doesn't play well with a highly dynamic document set.

Web 2.0

"A Statistical Comparison of Tag and Query Logs"
Mark J. Carman (University of Lugano), Mark Baillie (University of Strathclyde), Robert Gwadera, Fabio Crestani (University of Lugano) Mark Carman presenting

Given you are interested in personalization, you need personalized judgements of relevance. You could use query logs for this, perhaps, but what if they aren't available? Perhaps you can use tag data as a proxy for that. The question is: is this really valid?

This work reported on an experiment to compare the vocabulary distribution from queries, tags, and content to see. It compared AOL query log data against tag data from Delicious and categories from Open Directory Project (DMoz).

The conclusions:
For the tags and queries for a URL:

the vocabularies are similar
term distributions are correlated but not identical
the similarity doesn't seem to depend on category or popularity
content vocabulary more similar to queries than to tags
query/tag distributions more similar to each other than to content

In the Q&A it was pointed out that given that the documents are selected based on the query terms, the closeness of query terms to document terms is unsurprising.

"Simultaneously Modeling Semantics and Structure of Threaded Discussions: A Sparse Coding Approach and Its Applications"
Chen Lin (Fudan University), Jiang-Ming Yang , Rui Cai , Xin-Jing Wang (Microsoft Research, Asia), Wei Wang (Fudan University), Lei Zhang (Microsoft Research Asia)
Chen Lin presenting

This talk was all about figuring interesting things about about threaded email discussions. The central assumption was that you had no information about who replied to whom in the thread.

A mathematical model involving post/term and topic/term matrices and applying a minimization algorithm against a function that combines structural (reply-to) and semantic (topic) terms to infer topic/post relationships. The assumptions are that a thread may have several topics, each post is approximated as a linear combination of previous posts, a post is related to only a few topics, and each post is related to only a few other posts.

This model of the semantics and structure of a thread was then tested against other techniques on slashdot and Apple discussion forums for various purposes. The first was to construct reply relationships, based on the assumption that the reply is similar to the post it is replying to. The technique did beter than the various baselines (reply to nearest, reply to root, basic document (term) similarity, etc.). It also did well for junk identification (against term frequency, SVM, and others), and for identifying the experts (by finding hubs in the graphs of replies).

The interesting thing is that the structural terms dominated, mathematically, which says that if you actually know what the reply-to structure is (which you do for both slashdot and the Apple forums, for example -- that provided the basis for evaluating the quality), you've won half the battle in solving some of these problems.

"Enhancing Cluster Labeling Using Wikipedia"
David Carmel, Haggai Roitman, Naama Zwerdling (IBM Haifa Research Lab)
David Carmel presenting

The problem here is to come up with a label for a cluster of documents. The standard technique is to identify the "important terms" associated with the cluster using various statistical techniques. The problem is that such terms may be statistically useful, but they may not actually help humans figure out what the cluster is about. Further, a good label may not even appear in the actual content.

They did a small study to look at whether an ODP label could be identified as an important term from the content and in only 15% of the categories did that term appear in the top 5. Example: "Electronics" category gets terms electron, coil, voltage, tesla, etc. -- good terms, but none good as a label.

Their solution: use Wikipedia:
Use the cluster's content to search Wikipedia, and use metadata from Wikipedia to produce labels. The overall process is simple enough: cluster, get important terms using conventional techniques, take those as a query to Wikipedia (boosting the weights of the terms that were more important), extract metadata (title, category) from top N relevant documents, and then choose the best. For judging the best categories, they found that you needed to avoid getting too many relevant documents from Wikipedia, and a ranking system based on propagating scores (based on the linkage of documents and labels and terms) worked the better than one based on average mutual information. The technique seems fairly robust in the face of noise: even at 40% noise the results were reasonable.

For final thoughts:
Wikipedia is great for the topics it covers, but it doesn't cover everything.

Similar techniques could be used for other mining tasks: word sense disambiguation, semantic relatedness of fragments, coreference resolution, multi-lingual alignment or retrieval, query expansion, entity ranking, clustering and categorization, etc. etc.

Efficiency

"Compressing Term Positions in Web Indexes"
Hao Yan, Shuai Ding, Torsten Suel (Polytechnic Institute of NYU)
Thorsten Suel presenting

The paper looked at various techniques for compressing positions in an inverted index by exploiting clustering effects in relatively short documents (web pages). It then looked at some issues with query processing with position data.

The key observation is that positions in text are clustered; they are not evenly distributed. This observation can be exploited for specialized compression schemas (cf IPC Moffat/Stuiver 1996), LLRUN, others). The problem is that these don't work so well with short documents.

A second observation is that when storing positions we already know the term frequencies and document size. Various experiments were done with various adaptive encoding techniques, such as variants of Rice coding using information about frequency and document size to divide document into buckets and determine basis of coding, with and without adaptation from bucket to bucket, using regression to modify the encodings, using Huffman coding with different tables, chosen based on various features.

Different terms behaved differently, but the compression gains from all these techniques were disappointingly modest. Maybe a completely different approach is needed: storing approximate positions, or something else. Maybe we need to arrange to avoid decompressing most position data: if only use positions on top N (for position boost), there is little difference in the results, for example.

One comment: given the large size of the collections these days, storing a few extra parameters to improve modeling is not a big deal.

"Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce"
Jimmy Lin (University of Maryland)

The paper looked at computing pair-wise document similarity using MapReduce and various algorithms, with various parameter settings. The data set was some Medline documents. The take-home message was that in a MapReduce environent, the cost of keeping track of intermediate data and shipping that around the network was great enough that the brute-force approach of just computing the dot products on all pairs of term vectors actually took less time, and since it was an exact result rather than an estimate (which is what the other approaches were) it gave better answers also.

On the other hand, given that the study platform was Hadoop, which was evolving as the study was going on (so that the same test on different days would give very different times), the absolute performance measures here are probably not meaningful.

"Efficiency Trade-Offs in Two-Tier Web Search Systems"
Ricardo Baeza-Yates , Vanessa Murdock (Yahoo!), Claudia Hauff (University of Twente)
Ricardo Baeza-Yates presenting

The problem context for this paper is a search system which has a small local server (cheap, fast) and a larger 2nd tier server (expensive, slow). Queries that cannot be answered by the local server must be referred to the 2nd tier server.

Approach 1: Search the local server and if that does not give you good results then refer the search to the tier 2 server
Problem: you end up waiting longer than if you sent to tier 2 in the first place.

Approach 2: Search both in parallel and then merge or merge only if you need to.
Problem: Increases load on tier 2 and so is more costly.

Solution: Predict which queries are local, and send to tier 1 only or to both in parallel, depending on prediction. The paper looked at a mathematical model for determining when this was worth doing and a sample system was OK.

Result assessor:
If you failed to predict that you needed tier 2, and you did: you have to wait while the query is sent to tier 2; the query serialized (line in approach 1)
If you predicted you needed tier 2 and you didn't, you added extra load to tier 2 but you don't have to wait for it to come back.

Predictor:
Train based on top 20 p20(20) using SVM features from literature for
pre-retrieval prediction
Post-retrieval assessment: if have enough results, then OK

You can improve this a lot by caching false negatives.

A question came up about what the caching of whole results does to this model. It isn't clear.

MathLing

Blog Archive

About Me

Wednesday, July 22, 2009

SIGIR Day 1

Opening session: Susan Dumais' Salton Award Lecture

Classification and Clustering

Web 2.0

Efficiency