// @flow

import * as React from "react";

import Page from "../../../components/Page/Page";
import Section from "../../../components/Section/Section";
import Subsection from "../../../components/Subsection/Subsection";
import GridLayout from "../../../components/GridLayout/GridLayout";

const Microtasks = () => (
  <Page
    className="MicroTasks"
    title="Markets for microtasks with uncertain rewards"
    next={{
      url: "/team",
      title: "Team",
      description: "Our fledgling group",
    }}
  >
    <GridLayout>
      <Section title="summary">
        <div>
          Suppose you want to improve the quality of a Wikipedia page and you
          have decided to spend $100 on this project within the next month. You
          don’t want to hire a single editor — instead, you would like to
          crowdsource this task to the Wikipedia community and reward members to
          the extent that their work improves your article. You make an
          announcement, and at the end of the month, you see that there have
          been about 500 edits to your page, some changing single characters,
          others making major revisions, some undoing previous edits. How much
          do you pay to whom?
        </div>
      </Section>
      <Section title="Motivation">
        <p>Here are some analogous situations:</p>
        <ul>
          <li>
            You have a repository on Github and you want to reward commits
            depending on how much they help your project achieve its goals.
          </li>
          <li>
            You want help brainstorming, say to solve a medical case, and people
            can see and build on previous suggestions.
          </li>
          <li>
            You want to outsource a logo design and contributors can see
            previous designs.
          </li>
          <li>
            You want to reward high-quality submissions and upvotes on a social
            news site such as Reddit or Hacker News.
          </li>
        </ul>
        <p>
          In general, whenever you want to incentivize contributions to a
          project where users can build on the previous state of work, you face
          a credit assignment problem.
        </p>
      </Section>
      <Section title="Why this is difficult">
        <p>
          What the situations above have in common is that individual
          contributions (or “tasks”) are of highly variable and uncertain value
          to the payer. Some edits to a Wikipedia article, changes to code, or
          incremental variations on an idea are much more helpful than others,
          but the exact value of each task is difficult to determine. This
          distinguishes the tasks from essentially all tasks that are currently
          available on crowdsourcing platforms such as Mechanical Turk.
        </p>
        <p>
          At the same time, as the payer, you would like to give out rewards in
          proportion to how helpful a contribution is, so that market
          participants are incentivized to be as helpful as possible. However,
          the overhead of evaluation can easily be comparable to or larger than
          the value of the task itself; sometimes much larger. It is generally
          economically infeasible to evaluate each task to determine how much to
          pay.
        </p>
      </Section>
      <Section title="A baseline strategy: randomized evaluation">
        <p>
          In this situation, is it possible at all to set up a mechanism with
          the correct incentives? Consider the following strategy:
        </p>
        <ol>
          <li>
            For each task, evaluate it in depth with probability ε and decide
            what reward r to pay.
          </li>
          <li>Pay r/ε if a task gets evaluated and nothing otherwise.</li>
        </ol>
        <p>
          By reducing ε, the expected evaluation cost can be made arbitrarily
          small. In expectation, every participant still gets paid the correct
          amount.
        </p>
        <p>
          Of course, the variance of rewards under this strategy can be
          ridiculously high and real participants are not risk-neutral. But the
          fact that there is a strategy that reduces the cost of supervision and
          that has the right incentives matters: It shows that we can view our
          goal as reducing the variance in our payments while introducing as
          little bias as possible.
        </p>
        <p>
          (I think I first encountered this view talking to Paul Christiano.)
        </p>
      </Section>
      <Section title="Predicting rewards using supervised machine learning">
        <p>
          What if we don’t want to only reward contributions that we evaluate in
          depth? A natural variation on the strategy above is to evaluate some
          fraction of tasks, pay exactly the elicited rewards for the tasks that
          we do evaluate, and predict the rewards for the tasks that we don’t
          evaluate.
        </p>
        <p>
          We can start by treating this as a supervised machine learning
          problem. The tasks where we do evaluate in depth serve as a training
          set. We’re trying to learn a function f: X → P that takes as input a
          set of task features x and returns a distribution on rewards p. When
          we need to decide on a reward, a first solution is to provide exact
          rewards where we do have evaluations, and to simply take the expected
          value of p otherwise.
        </p>
        <p>
          There’s a wide range of features we can use, most prominently the
          identity of the task author and associated information (such as their
          history of rewards) and judgments by other participants (likes,
          upvotes/downvotes, 1-5 star ratings, proposed rewards). We can also
          provide the content of the contribution, but making good use of
          metadata such as author identity is easier and will probably do most
          of the work in the near future.
        </p>
      </Section>
      <Section title="Complications and extensions">
        <Subsection title="The true value of a task may be difficult to determine">
          <p>
            Previously, we’ve talked abstractly about evaluating a subset of
            tasks in depth. In a first concrete attempt, we might simply ask the
            payer: “How much is this task worth to you?” However, this may be
            difficult to answer directly, even if we allow for plenty of time to
            reflect.
          </p>
          <p>
            One strategy we can apply here is to not just ask a single question
            about the reward for a task, but to ask many questions in order to
            triangulate the desired reward. For example:
          </p>
          <ul>
            <li>How helpful were the tasks as a whole?</li>
            <li>Which tasks were most helpful?</li>
            <li>Which tasks were least helpful?</li>
            <li>Is task x more helpful than task y?</li>
            <li>
              How much worse would you have been off if task x hadn’t happened?
            </li>
            <li>
              What is the next bigger group of tasks that x could be grouped
              with?
            </li>
            <li>How helpful was this entire group?</li>
          </ul>
          <p>
            If we define a formal semantics for each of these questions and a
            probabilistic model that relates their answers to each other and to
            the question “How much should we pay for task x?” we can use answers
            to these questions to automatically infer the reward without ever
            asking explicitly about it (although we might want to do that as
            well, e.g. asking “The inferred reward is r — does that seem too
            high, too low, or just right?”).
          </p>
        </Subsection>
        <Subsection title="Task features may not be sufficient to make high-quality reward predictions">
          <p>
            First, if we predict distributions on rewards, we can use the
            entropy of the predicted distribution to decide when to go with the
            expected value (when the distribution is fairly peaked around a
            single value) and when to gather more information.
          </p>
          <p>
            Second, we can incentivize the creation of predictive signals: we
            can reward contributors who provide information (such as informative
            likes, or accurate reward guesses) that helps reduce our uncertainty
            about what rewards to pay.
          </p>
        </Subsection>
        <Subsection title="Markets are anti-inductive">
          <p>
            If a feature is predictive of quality and is therefore used to
            incentivize good contributions, people will try to produce
            contributions that have this feature, independent of contribution
            quality, and it will tend to stop being predictive. If long comments
            tend to get the highest ratings on Hacker News and we use this fact
            to automatically reward the best comments, we’ll soon see long
            low-quality comments. In other words: people will try to cheat.
          </p>
          <p>
            This is{" "}
            <a href="https://en.wikipedia.org/wiki/Goodhart%27s_law">
              Goodhart’s law
            </a>
            :
          </p>{" "}
          <blockquote>
            Any observed statistical regularity will tend to collapse once
            pressure is placed upon it for control purposes.
          </blockquote>
          <p>
            We are indeed trying to apply machine learning in a setting where
            the underlying distribution is shifting, i.e. in a non-stationary
            setting. This will make the learning problem harder, and our
            predictions will be more uncertain, but I expect that there are
            techniques we can apply to get acceptable performance. Ongoing work
            in ML, perhaps along the lines of{" "}
            <a href="https://arxiv.org/abs/1607.03594">Kuleshov 2017</a>, will
            hopefully improve the situation over time.
          </p>
          <p>
            Rewarding highly predictive signals (such as upvotes by a particular
            user) can also help prevent some signals from deteriorating.
          </p>
        </Subsection>
        <Subsection title="Some tasks can only be evaluated in context">
          <p>
            For example, consider deletions in a document that someone made in
            preparation for larger edits, compared to deletions that are simply
            vandalism.
          </p>
          <p>There’s a lot we can do:</p>
          <ul>
            <li>
              We can show context (such as previous and subsequent tasks) when
              doing reward evaluations, and also use contextual features as
              additional inputs into our prediction algorithm.
            </li>
            <li>
              We can require contributors to batch tasks into semantically
              meaningful pieces, and give low rewards otherwise.
            </li>
            <li>
              We can require contributors to provide annotations and
              justifications for tasks that are otherwise hard to judge.
            </li>
            <li>
              We can evaluate larger groups of tasks to determine the value of
              smaller pieces (see 1).
            </li>
          </ul>
        </Subsection>
        <Subsection
          title="To arrange fair trades, consider both parties"
          isLast={true}
        >
          <p>
            If we want to arrange trades that are fair to both parties, or if we
            want to control how gains from trade are distributed more generally,
            it’s not enough to determine how the payer should value each task.
            In addition, we need to take into account that tasks vary in how
            much effort they require. For mutually beneficial trade, the
            payments need to be somewhere between what the tasks are worth to
            the payer and what the efforts cost workers. I have only considered
            the former above. For information work, I expect that workers will
            generally have a better grasp on effort than payers do on value, so
            I think it makes sense to start with the latter, but ultimately we
            want to get a handle on both.
          </p>
        </Subsection>
      </Section>
      <Section title="Related work">
        <p>
          This is a slightly generalized version of the ideas in the section on{" "}
          <a href="https://stuhlmueller.org/dialog-markets/chapters/reward-distribution.html#sec:approach-models">
            reward distribution using probabilistic models
          </a>{" "}
          in the{" "}
          <a href="https://stuhlmueller.org/dialog-markets/">
            original report on Dialog Markets
          </a>
          .{" "}
          <a href="https://sideways-view.com/2016/12/02/crowdsourcing-moderation-without-sacrificing-quality/">
            Virtual Moderation
          </a>{" "}
          is an application of similar ideas to the setting of moderating online
          communities.
        </p>
      </Section>
    </GridLayout>
  </Page>
);

export default Microtasks;
