Humans in the loop
One of the paradoxes of today's internet platforms is they are vastly automated, and have no human control or interaction over what any given person sees, and yet they are also totally dependent on human behavior, because what they’re really doing is observing, extracting and inferring things from what hundreds of millions or billions of people do.
The genesis of this was PageRank. Instead of relying on hand-crafted rules to understand what each page might be about, or indexing the raw text, PageRank looks at what people have done or have said about that page. Who linked to it, what text did they use, and who linked to the people who linked to it? And at the other end of the pipe, Google gets every user to curate every set of search results by hand: it gives you 10 blue links and you tell Google which one was right. The same thing for Facebook: Facebook doesn't know really know who you are, or what you're interested in, or what that piece of content is. It knows who you follow, what you press ‘like’ on, who else liked that and what else they liked and followed. Facebook is PageRank for people. The same applies, by extension, to YouTube: it never knew what the video was, only what people typed next to it and what else they watched and liked.
In effect, these systems are vast mechanical Turks. They don't know what anything is of itself - rather, they try to create, capture and channel human annotation around those things. They’re vast distributed computing systems in which the CPUs are people and the platform is the router and the interconnections. (This reminds me a little of the idea in the Hitchhiker’s Guide to the Galaxy that the whole Earth is actually a vast purpose-built computer and our daily lives are part of the calculations.)
This means that a lot of the system design is around finding the right points of leverage to apply people to an automated system. Do you capture activity that’s already happening? Google began by using the links that already existed. Do you have to stimulate activity in order to capture the value within it? Facebook had to create behaviors before it could use them. Can you apply your own people to some point of extreme leverage? This is Apple Music’s approach, with manually curated playlists matched automatically to tens of millions of users. Or do you have to pay people to do ‘all’ of it?
The original Yahoo internet directory was an attempt at the ‘pay people to do all of it’ approach - Yahoo paid people to catalogue the whole of the web. To begin with this looked feasible, but as the web took off it quickly became an impossibly large problem, and when Yahoo gave up the directory had passed 3m pages. The answer was PageRank. Conversely, Google Maps has humans (for now) driving cars with cameras along almost every street on earth and other humans looking at the pictures, and this is not an impossibly large problem - it’s just an expensive one. Google Maps is a private mechanical Turk. We’re exploring the same question now with human moderation of social content - how many tens of thousands of people do you need to look at every post, and how much can you automate that? Is this an impossibly large problem or just an expensive one?
If you look at these platforms as using billions of human beings to do the actual computation, this prompts two interesting questions: what does this tell us about abuse of the platforms, and how much might machine learning change all of this?
In the past, when we thought about abuse of computer systems, we thought about technical exploits of various kinds - stolen or weak passwords, unpatched systems, bugs, buffer overruns and SQL injection. We thought about ‘hackers’ finding gaps in the software engineering. But if YouTube or Facebook are distributed computer systems where the routers are old-fashioned software but the CPUs are people, then a bad actor thinks of finding exploits in the people as well as the software. Common cognitive biases become as important as common programming errors.
That is, there are two ways to rob a bank - you can bypass the alarm and pick the lock on the safe, or you can con the manager. These are both ways that your processing systems are failing, but now one of the processing systems is us. Hence, as I wrote here looking at Facebook’s recent strategic pivot to privacy and security, human moderation of the data on these platforms is conceptually very similar to the software virus scanners that boomed in response to malware on Windows two decades ago. One part of the computer watches another part to see if it’s doing something it shouldn’t.
Even without thinking about deliberate abuse, there are problems inherent in using human activity to analyse human activity. Once you start using the computer to analyse the computer, you risk creating feedback cycles. You can see this in the idea of filter bubbles, or ‘YouTube radicalisation’, or even SEO spam. Meanwhile, one of the problems that Facebook has faced is that sometimes having or generating more data degrades the value of the data. This is the newsfeed overload problem: you add 50 or 150 friends, and you share 5 or 10 things every day or so, but so do all of them, and so you have 1,500 items in your feed every day. Dunbar’s number + Zuckerberg’s law = overload ... which gets us to Goodhart’s Law.
“Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” - Charles Goodhart
Machine learning is now changing where that point of leverage might be. It opens up to automation a class of question that would previously have needed people - perhaps millions or billions of people. But it doesn’t actually remove the people - it just moves them somewhere else, and makes them replicable.
In the past, if you had a photo of a cat, it could only be labeled ‘cat’ if you had enough users (or employees, or moderators) that someone would look at and label that particular image. Today, you don’t need any users to see that particular cat picture - you just need some other users, somewhere else, at some point in the past, to have labeled enough other cat pictures to generate a decent recognition model. So: how many problems could you previously only solve if you applied the aggregate behavior of millions or hundreds of millions of people, that you now could solve with machine learning, without having any users of your own? In effect, you’ve turned the mechanical turk into something you can automate.
This is just another form of leveraging people: you need people to do the labelling (and to write the rules for how the people do the labelling). But we move the point of leverage, and change, perhaps radically, how many people we need, and so we change some of the ‘winner takes all’ effects. After all, these giant social platforms are vast collections of manually labeled data, so is the glass half empty or half full? Glass half full: they have the world's largest collection of manually label data (in their chosen domain). Glass half empty: it’s manually labeled.