Tuesday, November 29, 2005

A Matter of Trust

There's an old saying that "Trust must be earned." As it stands right now on Amazon's Mechanical Turk, everyone starts out being trusted. That's very noble, but it causes problems when the script kiddie barbarians show up at the gate and they get waved in with the rest of us.

My experience as a parent and what I learned when studying for my education degree tells me that trust must be given to a child as they show more responsibility. If they stay within the box that defines the rules of the adult/child relationship, the box expands and they are given more freedom. If they step outside that box, it shrinks down to a more restrictive level until the child earns the trust to expand it again.

My computer science degree and my experience in the IT field allows me to define a way to reflect this in a system to govern hit acceptance in the MTurk environment. As I understand it, the method of accepting or rejecting hits is left to the client, so this is my thoughts on how to do that in a way that rewards success and limits the damage that auto-accept and submit scripts can do. I'm going to use the A9 Image Adjustment (IA) hits for this, but the method should work with any of the hit types seen so far on MTurk.

If we extrapolate my example of the adult/child relationship, then we have Amazon as the adult and the workers as the child. I'm not implying of course that the Turk Wurkers are children, but the relationship is similar because the adult/A9 has no reason to trust the child/worker when the relationship begins. If a method is put into place that allows trust by A9 to be measured and modified for each worker, I think many of the current problems with IA hits could be eliminated.

To begin, a variable has to be assigned to each Turker that measures Trust. This would be a simple integer value and would begin at a low number of say five (5) for new workers. A value of zero would imply that there is a neutral level of Trust between A9 and the worker. The value would increase as trust is gained and decrease as trust is removed. Any negative value would imply lack of trust and will be discussed later.

In order to determine whether you trust a new worker, you have to have a basis to judge them against. This requires a seed of absolute trust. For each group of images that you plan to test in the MTurk system, you have internal workers or admins select a small percentage of them and choose the definite correct answer to establish them as Trust Markers (TM). These sets of images can be placed semi-randomly in the work flow. Each correct answer will raise the Trust level of a worker by one point. The higher the Trust value is for a given worker, the less often these Trust Markers need to show up for that worker. In this way A9 would have to spend less on these types of images as more trust is gained. This is analogous to a company spending more on a worker when they are first hired in order to train them.

An incorrect answer to a TM would result in one point being subtracted from that workers Trust value. Once Trust goes negative, the worker would no longer be allowed to accept any hits until Trust reaches zero again. Negative points could decay at a rate set by A9, so it could take an hour or a day before the new user could try again. The value could also be allowed to increment back to the starting point of five after a certain amount of time. This would allow for a bit more leniency.

Just the addition of this functionality would severely hamper how much damage a scripter could do to the results, but one more feature is needed to eliminate the need to pay them for the random hits they did get correct before being locked out. I call this feature a Trust Lock.

A Trust Lock is created by taking a standard set of A9 images, including the "None of the Others" image and changing one of the images so that it reads "Submit This Image" in the same style as the NotO image. The worker would obviously be required to submit that particular image to answer the hit correctly. The Trust Lock would be dropped in much less frequently than the TMs, but answering the Trust Lock image set incorrectly would lead to an immediate Trust value of negative one (-1). This sounds harsh, but only a script or someone not paying attention would miss one of these.

In addition, all hits submitted since the last time you correctly answered a Trust Lock hit would be automatically rejected, whether they were answered correctly or not. Again, this sounds harsh but there's no reason to ever answer one incorrectly unless you're running an auto-accept script or working at a pace that is too high.

So let's use a few real world examples to walk through the methodology I just covered. First, let's say Johnny Turker heard about MTurk from his roommate and logs in and creates an account. He's assigned an initial Trust value of five (Trust = 5). Johnny then goes off and selects a group of IA hits to work on and starts turking.

At some point within the first 20 hits, Johnny is unknowingly presented a Trust Marker hit, and being inexperienced, gets it wrong. This drops his Trust value to four. At this point the MTurk system may decide to assign the next TM within 10 to 15 hits, since the trust level is less. If he had answered the first TM correctly, the MTurk system may wait until 20 to 25 more hits, depending on the algorithm used to determine how often these hits are presented to the worker. The system could also decide to immediately drop in a Trust Lock hit since missing the first TM that was presented could raise suspicions of him using a script.

Regardless, within the first 50 hits Johnny is presented with his first Trust Lock hit and he answers it correctly. At that point all the hits he submitted before the Trust Lock are eligible to be processed, while all the hits after this point will be processed when the next Trust Lock is answered correctly.

After doing about 100 hits, Johnny calls it a night and logs out with a Trust value of say six (6) since he improved how often he answered the TMs correctly. Later that night, Johnny falls under the influence of an evil script kiddy buddy down the hall in his dorm, who tells him he has a script that will randomly answer IA hits and make him lots of money while he sleeps.

Johnny installs the script and it starts running. Within 20 hits or so it encounters it first TM hit. It has a 1 in 7 chance of answering this correctly, which is quite possible but somewhat unlikely. Since Johnny's Trust value is still relatively low, he will also be presented a Trust Lock hit soon as well. Between the two types of Trust hits, it is unlikely the script will run for very many hits before Johnny is locked out of accepting any more hits. This also makes it unlikely that A9 will have to pay him for the submitted hits since they can know with some confidence that they're likely junk submittals.

The Trust value also allows Amazon and A9 to remove more of these restrictions after Trust reaches a certain level. The restriction on hits being processed until a Trust Lock is passed could be removed after Trust reaches a value of 25 or whatever level is determined to be appropriate. The Trust level could also be used to allow access to other new types of hits that pay better or that are more sensitive to script manipulation.

A possible algorithm for how often a turker is presented with TMs could be TL(20) - random(1...TL(10)). So at a Trust Level of 10 a turker would see a TM within the next 100 to 199 hits, which is 200 minus a random number between 1 and 100. Trust Locks could occur more randomly, but should one probably be presented to the turker within a few hits of a TM being answered incorrectly.

The existing qualifications can be combined with the Trust level to create new seeds of TM hits. If enough people with a certain level of Trust and a high level of accuracy agree on a certain image, it could be turned into a new TM hit. The Trust value could also be used to reduce the number of workers a hit has to be presented to to verify it. The higher the Trust value of a turker, the more weight is given to their response, so one Turker with a Trust value of 100 and an accuracy of 90% could replace multiple submitals by turkers with lower values.

This all leads me to a method where A9 could get more value out of their IA hits, but I'll leave that for the next gigantipost. Please feel free to comment on anything I missed or ways this method could be abused. I'll edit this post with any changes we come up with.


Brad Mecoli said...

this has to be one of the best solutions to this issue I've heard

all that college really paid off ;)

travisl said...

Impressive. I like your "choose this image" idea. I was thinking more along the lines of

Pick the McDonald's:

[] img=Forest
[] img=Forest
[] img=None of the Others
[] img=Forest
[] img=Pefectly centered McDonalds with address visible
[] img=Forest
[] img=Forest

But your two pronged approach sounds solid.

spliffy said...

that sounds like an excellent approach.

Alan said...

I think the type of hit you're talking about would be good for a Trust Marker, Travis. I'll try to clarify that section a little when I get a chance.