AI, Apple and Google

(Note - for a good introduction to the history and current state of AI, see my colleague Frank Chen’s presentation here.)

In the last couple of years, magic started happening in AI. Techniques started working, or started working much better, and new techniques have appeared, especially around machine learning ('ML'), and when those were applied to some long-standing and important use cases we started getting dramatically better results. For example, the error rates for image recognition, speech recognition and natural language processing have collapsed to close to human rates, at least on some measurements. 

So you can say to your phone: 'show me pictures of my dog at the beach' and a speech recognition system turns the audio into text, natural language processing takes the text, works out that this is a photo query and hands it off to your photo app, and your photo app, which has used ML systems to tag your photos with ‘dog’ and 'beach’, runs a database query and shows you the tagged images. Magic. 

There are really two things going on here - you’re using voice to fill in a dialogue box for a query, and that dialogue box can run queries that might not have been possible before. Both of these are enabled by machine learning, but they’re built quite separately, and indeed the most interesting part is not the voice but the query. In fact, the important structural change behind being able to ask for ‘Pictures with dogs at the beach’ is not that the computer can find it but that the computer has worked out, itself, how to find it. You give it a million pictures labelled ‘this has a dog in it’ and a million labelled ‘this doesn’t have a dog’ and it works out how to work out what a dog looks like. Now, try that with ‘customers in this data set who were about to churn’, or ‘this network had a security breach’, or ‘stories that people read and shared a lot’. Then try it without labels ('unsupervised' rather than 'supervised' learning). 

Today you would spend hours or weeks in data analysis tools looking for the right criteria to find these, and you’d need people doing that work - sorting and resorting that Excel table and eyeballing for the weird result, metaphorically speaking, but with a million rows and a thousand columns.  Machine learning offers the promise that a lot of very large and very boring analyses of data can be automated - not just running the search, but working out what the search should be to find the result you want. 

That is, the eye-catching demos of speech interfaces or image recognition are just the most visible demos of the underlying techniques, but those have much broader applications - you can also apply them to a keyboard, a music recommendation system, a network security model or a self-driving car. Maybe. 

This is clearly a fundamental change for Google. Narrowly, image and speech recognition mean that it will be able to understand questions better and index audio, images and video better. But more importantly, it will answer questions better, and answer questions that it could never really answer before at all. Hence, as we saw at Google IO, the company is being recentred on these techniques. And of course, all of these techniques will be used in different ways to varying degrees for different use cases, just as AlphaGo uses a range of different techniques. The thing that gets the attention is ‘Google Assistant - a front-end using voice and analysis of your behaviour to try both to capture questions better and address some questions before they’re asked. But that's just the tip of the spear - the real change is in the quality of understanding of the corpus of data that Google has gathered, and in the kind of queries that Google will be able to answer in all sorts of different products. That's really just at the very beginning right now. 

The same applies in different ways to Microsoft, which (having missed mobile entirely) is creating cloud-based tools to allow developers to build their own applications on these techniques, and for Facebook (what is the newsfeed if not a machine learning application?), and indeed for IBM. Anyone who handles lots of data for money, or helps other people do it, will change, and there will be a whole bunch of new companies created around this. 

On the other hand, while we have magic we do not have HAL 9000 - we do not have a system that is close to human intelligence (so-called 'general AI'). Nor really do we have a good theory as to what that would mean - whether human intelligence is the sum of techniques and ideas we already have, but more, or whether there is something else. Rather, we have a bunch of tools that need to be built and linked together. I can ask Google or Siri to show me pictures of my dog on a beach because Google and Apple have linked together tools to do that, but I can't ask it to book me a restaurant unless they've added an API integration with Opentable. This is the fundamental challenge for Siri, Google Assistant or any chat bot (as I discussed here) - what can you ask? 

This takes us to a whole class of jokes often made about what does and does not count as AI in the first place: 

  • "Is that AI or just a bunch of IF statements?"
  • "Every time we figure out a piece of it [AI], it stops being magical; we say, 'Oh, that's just a computation
  • "AI is whatever isn't been done yet"

These jokes reflect two issues. The first is that it's not totally apparent that human intelligence itself is actually more than 'a bunch of IF statements', of a few different kinds and at very large scale, at least at a conceptual level. But the second is that this movement from magic to banality is a feature of all technology and all computing, and doesn't mean that it's not working but that it is. That is, technology is in a sense anything that hasn't been working for very long. We don't call electricity technology, nor a washing machine a robot, and you could replace "is that AI or just computation?" with "is that technology or just engineering?" 

I think a foundational point here is Eric Raymond's rule that a computer should 'never ask the user for any information that it can autodetect, copy, or deduce' - especially, here, deduce. One way to see the whole development of computing over the past 50 years is as removing questions that a computer needed to ask, and adding new questions that it could ask. Lots of those things didn't necessarily look like questions as they're presented to the user, but they were, and computers don't ask them anymore:

  • Where do you want to save this file?
  • Do you want to defragment your hard disk?
  • What interrupt should your sound card use?
  • Do you want to quit this application?
  • Which photos do you want to delete to save space?
  • Which of these 10 search criteria do you want to fill in to run a web search?
  • What's the PIN for your phone?
  • What kind of memory do you want to run this program in?
  • What's the right way to spell that word?
  • What number is this page?
  • Which of your friends' updates do you want to see? 

It strikes me sometimes, as a reader of very old science fiction, that scifi did indeed mostly miss computing, but it talked a lot about 'automatic'. If you look at that list, none of the items really look like 'AI' (though some might well use it in future), but a lot of them are 'automatic'. And that's what any 'AI' short of HAL 9000 really is - the automatic pilot, the automatic spell checker, the automatic hardware configuration, the automatic image search or voice recogniser, the automatic restaurant-booker or cab-caller... They're all clerical work your computer doesn't make you do anymore, because it gained the intelligence, artificially, to do them for you. 

This takes me to Apple. 

Apple has been making computers that ask you fewer questions since 1984, and people have been complaining about that for just as long - one user's question is another user's free choice (something you can see clearly in the contrasts between iOS and Android today). Steve Jobs once said that the interface for iDVD should just have one button: ‘BURN’. It launched Data Detectors in 1997 - a framework that tried to look at text and extract structured data in a helpful way - appointments, phone numbers or addresses. Today you'd use AI techniques to get there, so was that AI? Or a 'bunch of IF statements'? Is there a canonical list of algorithm that count as AI? Does it matter? To a user who can tap on a number to dial instead of copy & pasting, is that a meaningful question?

This certainly seems to be one way that Apple is looking at AI on the device. In iOS 10, Apple is sprinkling AI through the interface. Sometimes this is an obviously new thing, such as image search, but more often it's an old feature that works better or a small new feature to an existing application. Apple really does seem to see 'AI' as 'just computation'.

Meanwhile (and this is what gets a lot of attention) Apple has been very vocal that companies should not collect and analyse user data, and has been explicit that it is not doing so to provide any of these services. Quite what that means varies a lot. Part of the point of neural networks is that training them is distinct from running them. You can train a neural network in the cloud with a vast image set at leisure, and then load the trained system onto a phone and run it on local data without anything leaving the device. This, for example, is how Google Translate works on mobile - the training is done in advance in the cloud but the analysis is local. Apple says it's doing the same for Apple Photos - 'it turns out we don't need your photos of mountains to train a system to recognize mountains. We can get our own pictures of mountains'. It also has APIs to allow developers to run pre-trained neutral networks locally with access to the GPU. For other services it's using 'differential privacy', which uses encryption to obfuscate the data such that though it's collected by Apple and analyses at scale, you can't (in theory) work out which users it relates to. 

The sheer variety of places and ways that Apple is doing this, and the different techniques, makes it pretty hard to make categorical statements on the lines of 'Apple is missing this'. Apple has explicitly decided to do at least some of this with one hand tied behind its back, but it's not clear how many services that really affects, or how much. Maybe you don't need my photos of mountains, but how about training to recognize my son - is that being done on the device? Is the training data being updated? How much better is Google's training data? How much would it benefit from that? 

Looking beyond just privacy, this field is moving so fast that it's not easy to say where the strongest leads necessarily are, nor to work out which things will be commodities and which will be strong points of difference. Though most of the primary computer science around these techniques is being published and open-sourced, the implementation is not trivial - these techniques are not necessarily commodities, yet. But there's definitely a contrast with Apple's approach to chip design. Since buying PA Semi in 2008 (if not earlier) Apple has approach the design of the SOCs in its devices as a fundamental core competence and competitive advantage, and it now designs chips for itself that are unquestionably market-leading (which, incidentally, will be a major advantage when it launches a VR product). It's not clear whether Apple looks at 'AI' in the same way. 

(There's also, perhaps cynically, a 'power of defaults' issue here - if Google Photos is always 10-15% better than Apple Photos at object classification, will users notice beyond a certain level of shared accuracy? After all, Apple Maps has 3x more users than Google Maps on the iPhone and Google Maps is definitely better. And any Google lead be offset by, say, Apple's Photostream or other features layered on top? Again, little of this is clear yet.)  

A common thread for both Apple and Google, and the apps on their platforms, is that eventually many 'AI' techniques will be APIs and development tools across everything, rather like, say, location. 15 years ago geolocating a mobile phone was witchcraft and mobile operators had revenue forecasts for 'location-based services'. GPS and wifi-lookup made LBS just another API call: 'where are you?' became another question that a computer never has to ask you. But though location became just an API - just a database lookup - just another IF statement - the services created with it sit on a spectrum. At one end are things like Foursquare - products that are only possible with real-time location and use it to do magic. Slightly behind are Uber or Lyft - it's useful for Lyft to know where you are when you call a car, but not essential (it is essential for the drivers' app, or course). But then there's something like Instagram, where location is a free nice-to-have - it's useful to be able to geotag a photo automatically, but not essential and you might not want to anyway. (Conversely, image recognition is going to transform Instagram, though they'll need a careful taxonomy of different types of coffee in the training data). And finally, there is, say, an airline app, that can ask you what city you're in when you do a flight search, but really needn't bother. 

In the same way, there will be products that are only possible because of machine learning, whether applied to images or speech or something else entirely (no-one at all looked at location and thought 'this could change taxis"). There will be services that are enriched by it but could do without, and there will be things where it may not be that relevant at all (that anyone has realised yet). So, Apple offers photo recognition, but also a smarter keyboard and venue suggestions in the calendar app - it's sprinkled 'AI' all over the place, much like location. And, like any computer science tool, there will be techniques that are commodities and techniques that aren't, yet. 

All of this, so far, presumes that the impact of AI forms a sort of T-shaped model: there will be a vertical, search, in which AI techniques are utterly transformative, and then a layer across everything where it changes things (much as location did). But there's another potential model in which AI becomes a new layer for the phone itself - that it changes the interaction model and relocates services from within app silos to a new runtime of some sort. Does it change the layer of aggregation on the phone - it makes apps better, but does it change what apps are? That's potentially much more destabilizing to the model that Apple invented. 

Clearly in some cases the answer is 'yes'. At the very least, structural changes in what search means change the competitive landscape and destabilize the mix between Google's general purposes search and its vertical competitors: a Yelp search might become a Google question, or an answer offered before you've even asked. This is another case of removing a question - instead of Google offering you ten search results that it thinks might answer your question, it will try to give you the answer itself, and it will also try to give you that answer before you asked the question.   

More interestingly, an Uber or Lyft request, or an Opentable booking, might also be re-aggregated from an app to a suggestion or answer within a voice UI or, say, Maps. An app with one button - that asks one simple question - could become a request pretty easily, whether in Google Assistant, Siri, Apple or Googles Maps or a messaging app. in fact, one way to look at Apple's opening of APIs into Maps, Siri, Messenger and so on is as a counter to Google. Where Google will find you a cinema, restaurant or hotel itself, Apple will lean on developers to solve the same use case. Google Allo suggests a restaurant where Apple's iMessage gives you an Opentable plugin. 

How broad is this, though? Yes, you could tell Siri or Google Assistant to ‘show me all the new Instagram posts’, but why is putting that inside to a feed of responses to all your questions a better UI? Why is Google’s ML interface a better place to see this than chrome designed by Instagram? ML might (indeed, will) make the Facebook newsfeed better, but does it remove the difference between one-to-many and one-to-one communication channels? Why is the general-purpose rendering layer better than the special-purpose one? Does being subsumed into a general purpose ML layer change this? 

One could propose a rebundling because it allows an easier interface - your home screen could show you documents, emails and meetings of the day instead of you having to go into each app to deal with them. Perhaps ‘which app do you want to open next?’ is a question that can be moved - the car is ordered, the meeting accepted, the expense report approved. This is already what Facebook does for a whole section of interaction, and before ML - what shared posts to see, who to talk to, what news to read. But it’s not the only thing on the phone. And, again, we don't have HAL 9000. We don't actually have a system that knows you, and everything you want, and everything inside all of your apps, and we're not anywhere close to that. So the idea that Google can subsume everything you do on your phone into a single unified AI-based layer that sits on top looks rather like what one might call a 'Dr Evil plan' - it's too clever by half and needs technology (the killer laser satellite!) that doesn't actually exist.   

It seem to me that there are two things that make it hard to talk about the AI explosion. The first is that 'AI' is an impossibly broad term that implies we have a new magic hammer that turns every problem into a nail. We don't - we have a bunch of new tools that solve some, but not all, problems, and the promise of extracting new insight from all sorts of data pools will not always be met. It might be the wrong data, or the wrong insight. The second is that this field is growing and changing very fast, and things that weren't working now are, and new things are being discussed all the time. So we have excitement and bullshit, skepticism and vision, and a bunch of amazing companies being created. Some of this stuff will be in everything and you won't even notice it, and some of it will be the next Amazon.