real gl: talent court

Summary

Introduction

This post expands upon the ideas of the quality game post. It is an idea dump that I'd like to turn into a paper at some point (note that some of this 'dump' is written by Allison Moberger, who by coincidence is currently at the top of TalentCourt writing [or was, when this post was originally written], so the bad writing here is probably mine).

Motivation

In short, we want real-time access to experts to perform micro-tasks. For example, we might want to hire an expert JavaScript programmer for the next 10 minutes to help write a utility method. Here are some example use-cases:

Logo design: Imagine designing a logo in the following way: pay 20 expert sketch artists $1 each for a 5 minute sketch. Then use traditional crowdsourcing to find the best 5 sketches. Then pay 5 expert photoshop artists $5 each for a 20 minute mockup of the best sketches. Then use traditional crowdsourcing to find the best mockup, and pay 1 expert designer $20 for 1 hour to put the finishing touches on the best mockup. Now we've paid a bit over $100 for a logo. It would be interesting to see how this logo compares to what is generated for a similar amount on a site like 99 designs. The advantage of this approach is that nobody works for no pay, which may be more efficient, e.g., more quality per dollar.

Micro-outsourcing: Max Goldman has built a system that supports real-time collaborative programming called Collabode. One use-case mentioned in the Collabode paper is micro-outsourcing, where a main programmer delegates tasks, like filling in method bodies of a new class, in real-time as part of their flow. It would be interesting to see if this style of programming actually works. One could also imagine a similar working style for writing, where a main writer writes an outline for paragraphs of a paper, and then hires expert writers in real-time to flesh-out the outliness. This might allow the writer to spend more time thinking at a higher level about the overal ideas and organization of the paper, and "directing" the creation of the paper.

Self-repairing web services: Imagine that you have written a web service that periodically grabs data from one source (e.g. Rhapsody), munges it, and sends it to another service (e.g. last.fm). Now imagine that the data source changes the format of their data, which breaks some regular expressions in the parsing script, raising an exception. It might be cool if the system could handle the exception by automatically hiring an expert to repair the script.

Problem

The essential problem is this: if you are going to hire someone in real-time, you need some very fast method of determining whether they are qualified for your job, and currently there is no such method. There are lots of systems that try to do this, but they all fall short in some way:

Standardized tests: The most commonly used method of identifying skilled applicants is the standardized test. Nearly everyone has been subject to this, as the use of tests like the SAT, ACT, GRE, MCAT, and LSAT are widespread throughout all levels of the academic community. If an applicant performs poorly on a standardized test, the admissions office can simply dismiss the application and move on to others. However, there are downsides to standardized testing: they're expensive to create and administer, difficult to keep secret, and easily-gameable. Rather than learning and understanding all of the math or language through life experience, many simply take classes to study specifically for the test, and this could cause an applicant to receive a high score even though their knowledge is lacking. For online tests, cheating is even harder to prevent. For example, oDesk has standardized tests for many skills, but when I type "odesk java test" into Google, it auto-suggests "odesk java test answers", and the top link has the answers.

Ratings: Sites like eBay and oDesk allow users to rate prior transactions. On oDesk especially, this can create a "cold-start" problem, where one needs work to get a rating, but needs a good rating to get work. Ratings also have an added problem of "grade inflation"; as anything less than the best rating could hurt a candidate's future prospects, employers feel some pressure to hand out the best rating unless something went very bad. They are also static; a past employer's ratings tell a new potential employer nothing about any skills a candidate has gained since they last worked for the past employer. Finally, ratings do not include a notion of how hard a task was, or how closely related it is to the current employers needs.

Portfolio: Artists and designers often present a portfolio of their work that employers can use to judge their skill and style. Unfortunately, portfolios can also be gamed — people can post work that they didn't do. A more suble and less intentional way of cheating is for someone to include work in a portfolio that they contributed to, without explaining exactly what they did. This is understandable in a close collaboration, where it is difficult to tease apart who contributed what, but the net effect in any case is that portfolios cannot always be trusted.

Interviews: Another option, frequently used by technical employers, is the expert interview; that is, applicants have a face-to-face interview with an expert in the field, whose knowledge allows them to evaluate the skill of the applicants. This also tends to be gameable to some degree, as a large portion of the internet seems devoted to preparing interviewees to answer "Microsoft questions"; it also has a problem of scale, as the time and energy required to find the best candidates increases to prohibitive levels very quickly relative to the pool size. Finally, it also has another flaw at the meta-assessment level: how does one identify an expert interviewer, except through an interview with another expert? There is no good way to "assess the assessors".
The last two techniques are not really suitable for real-time hiring anyway, because they are subjective, and generally require a human to spend time converting the material in the portfolio or interview into a yes/no decision about whether to hire them. In practice, this assessment time can take more time than a micro-task itself.

Solution

Our essential idea is to build competitive games around different skills, and use an elo-style rating system to encode the skill of the players. One way to think of these games is as games with a purpose, where the purpose of each game is to evaluate the skill of the players. Depending on the skill, the game may involve players doing some or all of the following:

generating questions
answering questions
comparing answers

Why do we think this will work? We see a number of potential advantages:

By generating random questions, or having players generate new questions as part of the game, we can afford to keep the lifetime of questions very small, to mitigate the danger of people posting "cheat sheets" online.
By comparing questions, we can guard against needing to know how difficult a question is, since it will be the same difficulty for both players, and each player just needs to answer it better than their opponent.
By having humans evaluate questions, we allow for "essay questions", or questions with subjective answers, which guards against many issues with multiple choice questions.
By making each contest small, we allow players to increase their score without blocking out a large time in their schedule (e.g. 5 minutes games as opposed to a 30min or 1hour test).
Because of the dynamic nature of the elo rating system, it notices as a player increases their skill (as opposed to standardized tests, which may require people to wait a month or a year before taking a test again.. presumably to prevent people from simply remembering the questions from the last time).
Again because of the nature of the elo rating system, a player may not need to play many games before the system has a good idea how skilled they are (similar to the adaptive questions in the GRE).
Because scores are relative to other people, the elo rating system can differentiate between people over a very broad range of skill levels, as opposed to standardized tests like the GRE math test, where many people get a perfect score, and the test has no ability to discriminate between these people.

What have we done so far?
TalentCourt currently includes a Writing and Drawing game. Each game proceeds as follows: two users are shown a random prompt (3 random words for writing, and 1 random word for drawing). Each user then has 5 minutes to write a short passage using all three of the words in some meaningful way, or draw the given word. Then three different people are asked to select the best passage or sketch. We previously gave voters instructions like choose "the most natural sounding paragraph", but the current version has no instructions for voting. The passage with the most votes wins, and this user's score increases, while the loser's score decreases. Scores are updated according to the elo rating system (and will probably move to the TrueSkill algorithm soon).

Cheating

We want to use the scores from these games for hiring decisions, which provides a lot of incentive to cheat. We designed the games to mitigate various methods of cheating:

Perhaps the most obvious way to cheat at either game is to use Google. In the English game, one could enter the three words as a query to try to find a paragraph someone else has written using the words. The search results tend to be webpages that contain all 3 words without all the words being close enough together to use as input in the game. Hopefully, under time pressure, finding a short passage with all 3 words in a sensible context is more difficult than writing one from scratch. If it becomes a problem, we could also use four words instead of three.

Similarly, in the Sketch game, one could search the word and find an image, but the data uploaded to the server includes a sequence of painting strokes (rather than the raw pixel data), making it difficult to cheat by uploading a pre-drawn image. This does not guard against the possibility of a user tracing an image or drawing using another image as a reference. (Note: actually, the current implementation does upload the raw image data in addition to the strokes, but this is a temporary work-around for some technical issues.)

A different way to cheat would be to game the votes, rather than the inputs. It is possible for users to have friends, or other accounts, vote for their entries; to help prevent this, we do not reveal the authors of each entry to voters until after their votes have been cast, but there is still the possibility that users could communicate outside the system. We also do not let users choose which contests to vote on, so in a liquid market, the likelihood of voting on a friend's work would hopefully be small.

A related way to cheat would be to have confederate competitors deliberately perform poorly. The design of the game makes this difficult for a number of reasons: users do not get to choose who to compete against, and are not told who they are competing against until the voting on their input is closed; the elo rating system also makes it difficult to gain substantial rating increases from this method, as large score increases are gained from defeating superior opponents, not those who perform poorly.

Future Work

What ideas do we have for the future?

Programming Game: There are two ways this might work:

Way 1: Do it similar to the writing game, but instead of words, we use methods from the standard API of the language being tested, and ask people to write a short program that uses all 3 methods in some meaningful way.

Way 2: Have people generate programming interview questions. This idea is less baked — it is not clear whether people will be good at coming up with questions, so our first tests will involve just asking programmers to come up with questions, to get an idea of what sorts of questions they'll ask.

Graphic Design Game: This could be similar to the sketch game, where there is an HTML-based drawing tool, except this one might include a set of shapes and text that the designer is allowed to move/scale/rotate and set the color of. The prompt might also include a word like "peaceful", "energetic", or "efficient" that the design should convey.

real gl

5/24/12

talent court