Thursday, December 24, 2009

Drat! A Reason to Boot to Vista

Some months back I acquired a nice practically brand-new quad-core box. The fellow I got in from was kind enough to install it as a dual-boot with Fedora and Vista.

What a pleasure it has been to to have to lug the laptop home and use it to tunnel in to work to get meaningful work done from home without worrying about whether things will fail in mysterious ways, and being able to configure my own imap server if I want to. Like all Vista machines, that laptop functions like an old VW Bug with funky quirks that don't apply to any other such machine ("reverse doesn't work, and you have to be gentle nudging it into second or it will stall"). In my case, the cursed laptop refuses to install anything except (mirable dictu, touch wood) Firefox updates.

In any event, after lo! these many months, I have never once felt anything close to the need to actually boot to Vista on my new box. Until yesterday, dang.

This is all my son's fault. He was playing with a fractal drawing tool I had installed, and wanted more. Since he mostly uses a Windows box, he went looking for something that would run on Windows. And he found a truly astonishing bit of donation-ware: Incendia. If you are remotely interested in computer art, or fractals, go check it out and toss the author some money, because he deserves it. Or check out some of the samples. Here's one my son did.

Bad news for me: this doesn't run on Linux, and really, nothing remotely close to it runs on Linux. So if I want to play too, I have to boot to Vista. Of course, it is Vista, and in this case I have to log in as administrator to run the program or suffer outrageous flicker.

Sunday, November 22, 2009

When Irish Eyes Are Not Smiling

France 2 Ireland 1 FIFA 0



Someone once told me that the perfect game was one that had a close score, a lot of chances, at least one switch in the lead, at least one moment of controversy, and a come-from-behind triumph.

Well, by that measure, the World Cup playoff between France and Ireland must be some kind of gem. I guess we forgot about the part "and the outcome not decided by cheating or an officiating mistake."

There has been much ink spilled, and all manner of ugly truths put about on this issue, from which many wild conclusions have followed.

Ugly truth (courtesy Roy Keane): France is going to the World Cup, get over it.

Ah Roy, politic as ever. And still burning with anger over his treatment by the FAI, apparently.

Ugly truths (Roy Keane, again): Look at how well you're developing your youth squads, Ireland: why are you even in the playoffs? Look at all the chances you didn't finish: why is the game so close that one bad call makes you lose the game? Look at how you defended the set-piece: why are you letting Henry goal-side, why are you letting the ball bounce in the box, where's my goal-keeper?

Roy is obviously a graduate of the blame-the-victim school of grief counseling. That said, he has a point. Aidy Boothroid said something similar when Attwell (may his name live in infamy forever) handed Reading a goal for a ball that never saw the inside of the posts: you should take your own chances and not put yourself in a position where one bad call makes the difference.

Ugly truth: Why is Ireland even playing France? Why did France get special seeding by FIFA changing the rules midstream? Maybe France should have been playing Portugal. (Not that the seeding helped Russia.)

FIFA deserves any black eye it gets for its continual bending of the rules to the benefit of the favoured few top teams. This seems a common failing of soccer federations: UEFA does it, the English FA does it, the MLS is a hopeless case in this regard. It all stems from a shocking misapprehension of a key attaction of the sport: the real chance that underdogs can win on any given day.

Ugly truth: What excuse do the officials have for being out of position on a set piece?

Well, they weren't out of position, actually. The linesman was down by the corner, and the ref was back outside of the area so he could get a good view of the shocking amount of pushing and pulling going on.

Ugly truth: "Everyone cheats, it is up to the refs to spot it"

The first half may be an ugly truth, although it must count as a poke in the eye to all the honest players out there, but the second half is an unwarranted conclusion.

And the most obvious ugly truth of all, courtesy Henry, trying to save his reputation: Yes, I handled the ball; I think the fair thing would be for a replay.

Yeah, right. Like that was ever going to happen. If you were so gung ho on fair play, where were you at the time?

Some have concluded from this swamp that soccer needs video replays to eliminate error.

Folks, soccer needs video replay about as much as it needs TV timeouts. The best proposal I have heard has a fifth official watching video replays who can "raise his flag" to the ref, if he thinks the officials on the field have missed anything, leaving it up to the center ref to decide whether to accept that call or not. Such a system might be quite workable, and may stop FIFA from suffering the ultimate humiliation of seeing a World Cup Final decided by cheating. My feeling is, it will neither satisfy the purists who want the game to flow without endless stoppages and revisionist decision-making, nor those who want to ensure there are no "incorrect" calls, ever. Some wrong decisions seem so trivial that it would be pointless to object in the flow, until they turn out to end up in a goal. Example: half-way line throw-in goes to the wrong team, who quickly launch the ball thirty yards upfield, forward gets behind the defender, bam! goal. (Real example, real game, by the way.) Now what? Rewind the last 15 seconds of play for a throw-in decision? Or review every throw-in decision in real time?

This game does not need to follow the NFL down the path to perdition where there is no such thing as cheating, only calculated application of specific infractions for which there is, somewhere in the 300 page rule book, a specified penalty. Do we really want soccer to become a game where you need to be advised that "if a push happens in the last twenty yards during the last 3 minutes of the game, and there are no timeouts left, the aggrieved team has an option of calling a bonus timeout, unless it is Tuesday, or a flying monkey is spotted in the stands" (or whatever the latest gerrymandered NFL rule is, I can't keep 'em straight). I say not.

What this game needs is a dose of sportsmanship. Stop calling cheating "simulation" for starters. Put some real teeth into the "ungentlemanly conduct" law -- if you're going to toss yellows about for foul language (another thing "everyone does"), how about for cheating? Any attempt to con the officials, should be dealt with very harshly. If you're going to add video-based fines and bans for fouls, how about for cheating? If Henry knew he would miss the first three games of the World Cup (which is about all France can expect to play, based on their recent form), he may not of fessed up in the heat of the moment, but at last France would suffer a realistic penalty for their actions. Or maybe: proactive red and yellow cards: you should have got a yellow in your last match but you conned the ref, fine, have one for the start of your next game.

Pull disclosure: My husband is half-Irish. Much dark muttering into Guinness followed last week's debacle.

Wednesday, July 22, 2009

SIGIR Day 1

Opening session: Susan Dumais' Salton Award Lecture

Dumais walked through a personal history of her work. She spent some time talking about the problem of re-finding: her data shows that 50-80% of web visits are re-visits and 30-50% of queries are re-queries, which speaks to a tremendous opportunity to use that knowledge to help users more effectively. In experiments with refinding on the desktop she discovered that while queries to a search box tend to be short with few operators, if you give people a rich faceted interface, they will use it. Being able to augment the results with new criteria and resort based on those criteria is key. She also pointed out that some attributes are highly resource dependent. For example, which "date" should you use? For a meeting: the time the meeting is scheduled for; for an email: when it was sent; for a photo: when it was created; for a document: when it was last changed, etc. She then spoke about some work on personalization, where prior clicks along only got to predictions that were right 45% of the time, whereas adding in prior clicks plus dwell time plus other session information got to 75%.

On web search at 15: "There is something amazing in that you can have 20 billion pages of diverse uncurated stuff and find anything!"

Although web search is amazingly effective, many tasks are still hard, and much remains to be done. "Search in the future will look nothing like today's simple search engine interfaces. If in 10 years we are still using a rectangular box and a list of results, I should be fired." [Mar 2007]

She spoke of thinking beyond traditional IR boundaries and identified several challenges:
  • information dynamics: information changes a lot and people revisit pages a lot. Queries have different temporal dynamics and pages have different rates of change. Including this information in the user interfaces would be very helpful. What if changes in search results from the last time you searched were highlighted in some way? What if changes in a page you re-vsit were highlighted in some way. She gave by way of example a browser plug-in she has been experimenting with, which showed her that there was a free continental breakfast announcement added to the SIGIR page. (Which I missed, not having this fancy plug-in, drat!)
  • data as a critical resource: research on user dynamics depends on data such as large scale web query logs, which are hard to get, and harder to share. We can also look at operational systems as an experimental resource using A/B testing (as Amazon does, for example).
In the Q&A someone raised the poverty of the web search box, which hasn't changed visually in 25 years. She remarked: "We are not nailing search anywhere. You're right. We are training ourselves to ask questions that those resources can answer." She then went on to remark that a lot has changed behind the scenes, but transparency matters and matters a lot more for re-finding: if you give people the tools to tell you what they know about what they are looking for they will be more successful.

Classification and Clustering

"Context-Aware Query Classification"
Derek Hao Hu (Hong Kong University of Science and Technology)
presenting for a long list of authors from the university and Microsoft

The basic idea was to apply a fairly simple statistical model to use the previous query and previously clicked links from that query to predict the category for the current one, using a query classification based on some web dictionary or taxonomy such as Wikipedia. The idea is to figure out if by "bass" you are talking about the fish or the instrument, for example.

"Refined Experts: Improving Classification in Large Taxonomies"
Paul Bennett (Microsoft) and Nam Nguyen (Cornell). Paul Bennett presenting.

Interesting data points:
If you have a hierarchy of classes, using hierarchical training is (slightly) better than flat training. What this means is that if you have taxonomy that says class A contains subclasses A1, A2, and A3, the training set for A1 is things in A, with positive examples being things in A1. The point is, you don't put things from B in as negative examples of A1 to do hierarchical
training.

Significantly better is something he calls Refinement and Refined Experts.

The basic idea for Refinement is to use cross-validation to predict which test documents belong in the class, and then use both the predicted and the actual to do the training lower down.

The basic idea for Refined Experts is to add "metafeatures" that are predictions from the child classes as part of the input to feature vectors for the parent class. This means augmenting the training data for class A with metadata for A1, A2, etc.. So you train from the bottom up, and use cross-validation from the leaves to add to the training set for larger classes.

So there is a bottom up training pass followed by a top-down training pass.

The intuition behind this is early cut-off of miscategorization from the top, and that it may be easier to distinguish some things at the leaves than in the context of larger classes at the top.

The Refined Experts algorithm led to a 20-30% improvement in F1.

"Dynamicity vs. Effectiveness: Studying Online Clustering for Scatter/Gather"
Weimao Ke et al (University of North Carolina) Weimao Ke presenting

The central observation here, albeit based on a very small study, is that you can pre-compute clusters statically and then just do tree cutting to select the appropriate clusters in at run time and get results (both in terms of precision/recall and user experience) that are just as good as you do from computing the clusters dynamically from the search results and from selected clusters of search results. Given that clustering algorithms run to the O(kn) to O(n^2) (n=#documents, k=#classes), being able to run this offline could be a win. The main down-side, of course, is that this doesn't play well with a highly dynamic document set.

Web 2.0


"A Statistical Comparison of Tag and Query Logs"
Mark J. Carman (University of Lugano), Mark Baillie (University of Strathclyde), Robert Gwadera, Fabio Crestani (University of Lugano) Mark Carman presenting

Given you are interested in personalization, you need personalized judgements of relevance. You could use query logs for this, perhaps, but what if they aren't available? Perhaps you can use tag data as a proxy for that. The question is: is this really valid?

This work reported on an experiment to compare the vocabulary distribution from queries, tags, and content to see. It compared AOL query log data against tag data from Delicious and categories from Open Directory Project (DMoz).

The conclusions:
For the tags and queries for a URL:
  • the vocabularies are similar
  • term distributions are correlated but not identical
  • the similarity doesn't seem to depend on category or popularity
  • content vocabulary more similar to queries than to tags
  • query/tag distributions more similar to each other than to content
In the Q&A it was pointed out that given that the documents are selected based on the query terms, the closeness of query terms to document terms is unsurprising.

"Simultaneously Modeling Semantics and Structure of Threaded Discussions: A Sparse Coding Approach and Its Applications"
Chen Lin (Fudan University), Jiang-Ming Yang , Rui Cai , Xin-Jing Wang (Microsoft Research, Asia), Wei Wang (Fudan University), Lei Zhang (Microsoft Research Asia)
Chen Lin presenting

This talk was all about figuring interesting things about about threaded email discussions. The central assumption was that you had no information about who replied to whom in the thread.

A mathematical model involving post/term and topic/term matrices and applying a minimization algorithm against a function that combines structural (reply-to) and semantic (topic) terms to infer topic/post relationships. The assumptions are that a thread may have several topics, each post is approximated as a linear combination of previous posts, a post is related to only a few topics, and each post is related to only a few other posts.

This model of the semantics and structure of a thread was then tested against other techniques on slashdot and Apple discussion forums for various purposes. The first was to construct reply relationships, based on the assumption that the reply is similar to the post it is replying to. The technique did beter than the various baselines (reply to nearest, reply to root, basic document (term) similarity, etc.). It also did well for junk identification (against term frequency, SVM, and others), and for identifying the experts (by finding hubs in the graphs of replies).

The interesting thing is that the structural terms dominated, mathematically, which says that if you actually know what the reply-to structure is (which you do for both slashdot and the Apple forums, for example -- that provided the basis for evaluating the quality), you've won half the battle in solving some of these problems.

"Enhancing Cluster Labeling Using Wikipedia"
David Carmel, Haggai Roitman, Naama Zwerdling (IBM Haifa Research Lab)
David Carmel presenting

The problem here is to come up with a label for a cluster of documents. The standard technique is to identify the "important terms" associated with the cluster using various statistical techniques. The problem is that such terms may be statistically useful, but they may not actually help humans figure out what the cluster is about. Further, a good label may not even appear in the actual content.

They did a small study to look at whether an ODP label could be identified as an important term from the content and in only 15% of the categories did that term appear in the top 5. Example: "Electronics" category gets terms electron, coil, voltage, tesla, etc. -- good terms, but none good as a label.

Their solution: use Wikipedia:
Use the cluster's content to search Wikipedia, and use metadata from Wikipedia to produce labels. The overall process is simple enough: cluster, get important terms using conventional techniques, take those as a query to Wikipedia (boosting the weights of the terms that were more important), extract metadata (title, category) from top N relevant documents, and then choose the best. For judging the best categories, they found that you needed to avoid getting too many relevant documents from Wikipedia, and a ranking system based on propagating scores (based on the linkage of documents and labels and terms) worked the better than one based on average mutual information. The technique seems fairly robust in the face of noise: even at 40% noise the results were reasonable.

For final thoughts:
Wikipedia is great for the topics it covers, but it doesn't cover everything.

Similar techniques could be used for other mining tasks: word sense disambiguation, semantic relatedness of fragments, coreference resolution, multi-lingual alignment or retrieval, query expansion, entity ranking, clustering and categorization, etc. etc.

Efficiency



"Compressing Term Positions in Web Indexes"
Hao Yan, Shuai Ding, Torsten Suel (Polytechnic Institute of NYU)
Thorsten Suel presenting

The paper looked at various techniques for compressing positions in an inverted index by exploiting clustering effects in relatively short documents (web pages). It then looked at some issues with query processing with position data.

The key observation is that positions in text are clustered; they are not evenly distributed. This observation can be exploited for specialized compression schemas (cf IPC Moffat/Stuiver 1996), LLRUN, others). The problem is that these don't work so well with short documents.

A second observation is that when storing positions we already know the term frequencies and document size. Various experiments were done with various adaptive encoding techniques, such as variants of Rice coding using information about frequency and document size to divide document into buckets and determine basis of coding, with and without adaptation from bucket to bucket, using regression to modify the encodings, using Huffman coding with different tables, chosen based on various features.

Different terms behaved differently, but the compression gains from all these techniques were disappointingly modest. Maybe a completely different approach is needed: storing approximate positions, or something else. Maybe we need to arrange to avoid decompressing most position data: if only use positions on top N (for position boost), there is little difference in the results, for example.

One comment: given the large size of the collections these days, storing a few extra parameters to improve modeling is not a big deal.


"Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce"
Jimmy Lin (University of Maryland)

The paper looked at computing pair-wise document similarity using MapReduce and various algorithms, with various parameter settings. The data set was some Medline documents. The take-home message was that in a MapReduce environent, the cost of keeping track of intermediate data and shipping that around the network was great enough that the brute-force approach of just computing the dot products on all pairs of term vectors actually took less time, and since it was an exact result rather than an estimate (which is what the other approaches were) it gave better answers also.

On the other hand, given that the study platform was Hadoop, which was evolving as the study was going on (so that the same test on different days would give very different times), the absolute performance measures here are probably not meaningful.

"Efficiency Trade-Offs in Two-Tier Web Search Systems"
Ricardo Baeza-Yates , Vanessa Murdock (Yahoo!), Claudia Hauff (University of Twente)
Ricardo Baeza-Yates presenting

The problem context for this paper is a search system which has a small local server (cheap, fast) and a larger 2nd tier server (expensive, slow). Queries that cannot be answered by the local server must be referred to the 2nd tier server.

Approach 1: Search the local server and if that does not give you good results then refer the search to the tier 2 server
Problem: you end up waiting longer than if you sent to tier 2 in the first place.

Approach 2: Search both in parallel and then merge or merge only if you need to.
Problem: Increases load on tier 2 and so is more costly.

Solution: Predict which queries are local, and send to tier 1 only or to both in parallel, depending on prediction. The paper looked at a mathematical model for determining when this was worth doing and a sample system was OK.

Result assessor:
If you failed to predict that you needed tier 2, and you did: you have to wait while the query is sent to tier 2; the query serialized (line in approach 1)
If you predicted you needed tier 2 and you didn't, you added extra load to tier 2 but you don't have to wait for it to come back.

Predictor:
Train based on top 20 p20(20) using SVM features from literature for
pre-retrieval prediction
Post-retrieval assessment: if have enough results, then OK

You can improve this a lot by caching false negatives.

A question came up about what the caching of whole results does to this model. It isn't clear.

Thursday, June 4, 2009

A Day at the Maker Faire

The Maker Faire has always been one of those things that have sounded cool and we've meant to go to, but hadn't got around to. This last weekend we went up and a had a very tiring, but interesting day out. We were a little hampered by the fact that we came in through the back (Caltrain) door, and didn't get a map. It turns out we missed two whole buildings, and didn't find the good food zone until after we had eaten some fairly lame sandwiches. Still, a good time was had by all.

The best part is talking with the individuals playing around with random hacks in their garage: "how does it work?" and "what is this?" always led to interesting conversations. There's really nothing so fun as listening to someone who is excited by what they are doing explaining it to you. Since we came in the back door, the first zone we entered was the musical hack zone, which immediately sold the boy child, who engaged in some in-depth discussions of types of pickups and digital effects on electric guitars. Our favourite here was the guy with the Wii controller in his head of his guitar and the flat-panel display in the body, wired up so that particular chords have particular imagery, flinging the guitar head down gave a wah-wah effect with a coordinated visual effect. Hitting a button gave a distorted sound, and a distorted image to match. Credit where credit is due: Ben Lewry, and he'll make one for you too, apparently.

Most interesting reward for lowest expectations: A small tent, looking for all the world like a something you'd expect to find a Tarot reader in, tucked away among the 3-D cameras and glasses. Even the label wasn't particularly enticing: "holomagistics" or something of that sort. Inside the small, darkened tent was a small oscilloscope
screen showing simple looped figures. Nothing too exciting, except the figures figures are 3-D. No fancy glasses. No need to stand in exactly the right spot. Amazing. If I understood correctly, this is done by reprogramming the scanning so that vertical scan lines alternate between different (computed) perspective views, and your brain performs the object fusion to give the sense of depth. Someone, somewhere, get this guy to think beyond "coordinated music and image displays" and invest.

The kinetic sculptures were engaging and beautiful (I love the displays in science museums of old brass-and-crystal machines, my favourite by far is the Fourier analysis machine in the London science museum, all gleaming spheres and dials), but my favourite art was Bulatov's metal sculptures. Cuteness points to the people with the candy-fab machine, slowly churning our sintered sugar creations."Please don't eat the sculptures!" is not a sign you get to read every day.

We did appreciate the Exploratorium's little puzzles too, though. My favourite: hollow aluminum tube, about 4 or 5 feet long, and a magnet. Drop the magnet through the tube, and it takes a suspiciously long time to fall through. This is the Exploratorium, mind, so we set aside our initial suspicions of some kind of slight-of-hand.

Somewhere late in the afternoon, we decided that we were well and truly tired out, so we sought out the Tesla coils, had a nice chat with the fellows there about their plans to build 12-story tall Tesla coils in the Nevada desert, and watched the lightning show. Even the 12 foot scaled down versions, running well below capacity, were impressive enough, and a good way to sign off the day.





Friday, April 10, 2009

Invidious comparisons

What with World Cup Qualifying, the UEFA Champions League, the CONCACAF Champions League, the Premiership, MLS, and the UEFA Cup, I've been spending a lot of hours watching football the last few weeks.

While it is true that it is hard to beat the top 4 Premiership teams for skill, as a competition the Premier League has become downright boring. The fact that in any given season the question of which team will come in 5th provides the most drama and the current spark of interest in games like Man Utd vs Aston Villa is based mainly on the fact that Chelsea and Man United have been showing cracks in their invulnerability is telling. The UEFA Champions League has essentially become the Premier League with the role of whipping boy played by various French teams. Oh, what a shock! Liverpool against Chelsea. There is a charm in watching Chelsea dink their little triangles up and down the field and in watching Ronaldo perform his little steppy-steppy moves, but honestly, is there really much surprise in watching Man Utd win on a late goal? Again? Yawn.

It used to be that the Euro matches provided the same pleasure that the World Cup did: watching teams you'd barely heard of and learning about new fantastic players you'd never seen before. Given that the same dozen top teams swap around the same top players, you don't get that so much, except in some of the early rounds.

On the other hand, the World Cup qualifiers have been rather fun, especially given the plethora of new European teams to enjoy: Kazakhstan has a young and lively team that's rather fun to watch, if somewhat overmatched -- who knew? We'll miss you in the final.

Other teams we'll miss: Gosh, Man City versus Hamburg. Two teams charging up and down the field going at it hammer and tongs. There was more pure entertainment in the first half-hour than in half a season at Stamford Bridge. Holy smokes!

The biggest entertainment surprise for me has been the CONCACAF Champions League semifinals. Sure, I'd heard of Cruz Azul and Atlante, but never really seen them play. And Santos Laguna? The Puerto Rico Islanders? Not on my radar. I confess to having watched the tournament only intermittently, largely to track the team-formerly-known-as-Earthquakes. The semi-finals were fantastic fun, however (at least for a neutral). I once heard the perfect soccer match described as having at least one change in the lead, at least one great goal, and at least one moment of controversy. Both the semi-finals obliged, although it was hard to beat Atlante game for sheer drama: an extra-time (winning) penalty and 3-red-card fight 3 minutes after that. Sad as I was the Dynamo didn't make it this far, I'm looking forward to Cruz Azul vs Atlante in the final, which is something I hardly expected.

Speaking of the Dynamo, they came to town a couple weeks ago to play the team-currently-known-as-Earthquakes. It was just as well that Ching was off on World Cup duty because if he had been the one finishing chances instead of Mullen, things may have turned out much worse for the home side. (I've whined about the MLS continuing with a full program during qualification week before: I see they now partly accomodate the national team by giving their wannabe Man Utd of the US the week off. Funny how LA always gets the special treatment.) It was about a half-hour into it, about the time Mullen skied his third "this is a goal, oh wait, no it isn't" chance, that I said to my son "I don't think this is going to stay scoreless" and he said "I think once there is one goal, there will be a boatload." Truer words were never spoken: 15 minutes later it was 3-2. It is odd and scary for a Yallop team to have such a fragile looking defense. It is going to be a long season if they don't tighten that up.

Wednesday, March 25, 2009

Season Two of the New-New-New-Quakes

Earthquakes 0 New England 1

Drat. On paper a midfield with Huckerby, Corrales, Convey, and Alvarez looks pretty hot, and there were brief flashes of brilliant football on Saturday. Mostly, however, we saw the classic weakness of the Quakes (both the new-new-new-Quakes and the new-new-Quakes under Yallop's tutelage previously), which is to say: nice approach work, lousy finishing. Or no finishing at all. Once again we revive the impassioned plea of Quake fans: "SHOOOOT!"

We also saw Huckerby spending far too much of the match completely disconnected from play out on his own on the wing. Has someone told him he's not allowed to stray 5 yards from the touchline or something? Or is the loss of the one guy on the field capable of hitting him at full sprint with a 40-yard cross-field pass (Ronnie O'Brien), put a huge hole in the team?

On the whole, the Quakes played the better football, but neither team looks particularly good, and after the goal the Quakes looked quite ragged.

And, astonishingly enough, this is the first time in many years of season tickets, that I have got seriously rained on at a Quakes match. Cold rain down my neck just about summed up this opener.

Next up: The relabeled-highjacked-still-playing-in-a-college-stadium-I-notice-new-new-Quakes.

Thursday, March 12, 2009

More Security Theatre

Surely, with the California economy in freefall, the budget process hopelessly broken, and a water crisis about to cause major damage to both agriculture and fishing, Joel Anderson (R, 77th) has more important things to worry about than the fact that shock! you can see both ground-level and satellite pictures of buildings on the Internet.

But, no, one of the brilliant minds that held California's budget hostage for months and was a cheerleader for the misbegotten anti-vehicle-registration-tax has decided that his next mission is to censor Google Earth, and so is putting forward AB-255 (see http://www.cnn.com/2009/TECH/03/11/google.earth.censor.california/).

The idea is to make Google blur images of "important" buildings: government buildings, schools, churches, and medical facilities. Supposedly this is to fight terrorism. How a blurry image of a hospital on Google Earth will prevent terrorists from parking a truck bomb in front of the main entrance remains unclear. How this will prevent terrorists from going to the hospital's own web site which probably has useful things like maps to help people find their way and a street-view picture of their main entrance anyway also remains unclear.

Why churches? Can't God protect His own? Anderson states that churches and synagogues have been attacked . True. I don't see any evidence that tossing a firebomb through a window was aided and abetted in any way by the presence of a picture of said building on the Internet. Indeed, a map indicating the location was probably more useful. Shall we censor that next? And then the white page listing giving the address?

Why not shopping malls? Theatres? Sporting venues? Don't people congregate there too?

It is ludicrous to suppose that this bill would prevent any terrorist attack, or that it would seriously hamper in any way the efforts of those intent on such attacks.

No, this is not about counter-terrorism at all. It is about the appearance of counter-terrorism. And what it is really about is taking a first step into censoring content on the Internet that some government busy-body finds objectionable. Anderson says that this would only be doing what other governments around the world have done. What do you want to bet that a year from now, the argument is that since we already censor pictures of government buildings, censoring information about government officials is just the next logical step in the "War on Terror"?

Wednesday, January 7, 2009

Fun with Statistics

I was working with my daughter on her AP Stats homework the other night. The section is all about using the normal model (the old Bell curve of infamy) based on information about proportions. Since normal distributions pop up fairly often in the real world, and proportions are a common way of expressing information about distributions, this is all good stuff with wide applicability.

It pays to be careful however. Not every distribution is a normal distribution (ages in Gaza, for example, is strongly skewed to the left) and not every sample is a reliable one. One thing I like about the presentation in this section is that all the problems require a check of a set of assumptions that should be met for the normal model to be applicable. Even better, some of the problems violate one or the other of the assumptions. Here are the conditions:
  1. The randomization condition: The sample needs to be an unbiased representative selection from the population.
  2. The 10% condition: The sample represents less that 10% of the population.
  3. Success/failure conditions: The sample needs to be enough that the number of successes and failures is more than 10. So if the sample size is n and the proportion is p, then both np >= 10 and n(1-p) >= 10.
Undoubtedly the last couple of assumptions will get mapped to some stronger mathematical basis involving confidence levels and the like later on (Statistics is annoying different from the rest of mathematics, being built from the top down instead of the bottom up) but this is a nice set of rules of thumb to apply to many statistical claims floating about out there.

So, given a proportion p and a sample size n that meet these conditions, what can we do? The fundamental operations all take place by computing areas under the normal curve to give probabilities. For example, given that 13% of the population is left-handed, an auditorium with 120 seats, of which 15 have left-handed desks, what is the probability that there will not be a left-handed desk available for some poor lefty?

If we know the number of standard deviations away from the mean this (15 lefties) represents, there are standard tables and fancy calculators that give the probability that a value is less (or greater, by subtracting from 1) this value. The standard deviation is easy enough to compute: σ=√(p (1-p)/n) 0.0307

Now we want to know P(#lefties > 15) based on this model. The z-score (number of standard deviations) is computed simply as well: (p-p̂ )/σ ≅ 0.1629
Looking this up, we get the cumulative probability that a sample will be less than 15 of about 0.8580 so the probability that it will be greater is 1-0.8580. So about 14.2% of the time, some poor lefty will have to do without a left-handed desk.

All this assumes that we meet the conditions in the first place, however. So, let's look. The success/failure condition is the easiest: np = 120*0.15 =18 and n(1-p) = 120*0.85 = 102, both of which are greater than 10. Does the sample represent less than 10% of the population? Well, that depends on what you take the "population" to be. You have to be a little careful to avoid getting fooled. Is the population "all possible students everywhere"? Then the sample surely does represent less than 10% of the population. If this auditorium is for the exclusive use of the fine folks of the East Krumwich School for the Sinister Arts, maybe a better characterization of the population is "all possible students of EKSSA" and it could be that 120 is not less than 10% of the population. But in all likelihood, the 10% condition will be met, or the auditorium was a very bad investment. What about randomization? Again, it all depends. If this auditorium is at a highly selective college, then you have to ask whether the selection the college applied in accepting students biases the sample in an important way with respect to left-handedness.

The randomization condition is where the assumptions of normality can really head south. It is very easy to come up with samples that are biased in some way or other. In the recent election certain national polls were biased against younger voters because the sampling was conducted exclusively on land-lines, and a larger percentage of younger voters use cell-phones and have no land-line. Similarly, a survey of the number of people who make confidential information available over the web by looking at the number of FaceBook users who post confidential information is useless: people who aren't going to make confidential information available over the web are less likely to use FaceBook in the first place.

The take-home message here is, whenever you see some statement of the form "X% of Ys are Z", your very first question needs to be "how could the selection of Ys be biased?"

It is perhaps worth pointing out that there is also a fundamental assumption in applying the normal model in this way that you know what the proportion actually is. Since the usual way to come up with p is by performing some kind of statistical sampling, there is an uncomfortable circularity possible. Maybe the proportion of lefties in the world at large is 13%, but the proportion of lefties at an art school is going to be higher, because the population is different. (Or, alternatively, you can say that the selection of a sample of students in an art school auditorium is not an unbiased sampling of the population of students as a whole.)

Another interesting aspect of the normal model is that everything hinges on the variance. A distribution with greater variance is more flattened out, with larger tails; one with less variance is squished in towards the middle, with less in the tails. That is, loss of variance means reversion to the mean, and reversion to the mean means fewer extremes, and that means that the expected difference between two selections is less. At some level this is obvious (true by definition), but the implications can be interesting and are not always appreciated. Stephen Jay Gould devoted a whole book to this idea (Full House, ISBN 978-0609801406) with examples from such diverse arenas as baseball and the Cambrian explosion. The capsule summary is that in competitive arenas there is a long term tendency to reduce variance, which means that the difference between the worst and the best gets smaller, which means that you no longer see the 22-0 blow-outs in English league play that you did in the 1880s, and you can make the case that it was easier for Babe Ruth to rack up a lot of home runs than for Hank Aaron, because Babe Ruth got to play against relatively worse pitchers more of the time. The long term trend is towards mediocrity. (Although please note: the absolute mean may in fact be higher, and probably is, as there is a convergence on the techniques that get the most benefit to the limits of what is possible. So "mediocre" in the sense of "closer to the mean" not in the sense of "bad".)

I think you can make the case that this also appears to apply to the world of politics: as time goes by, the relative excellence of the candidates tends to get closer and we have fewer really bad candidates as well as fewer really outstanding candidates, and the result is closer elections. Where are the statesmen of the caliber of the founding fathers (set aside the bias from the rosy glow of the passage of time)? More unlikely to appear. On the plus side, grossly inadequate candidates are less likely as well. The only way to escape from the trap of mediocrity and small differences in excellence is to change the game and jump to a completely new distribution: take steroids, start leveraging new social media, get a fancy high-tech swimsuit.