Cameras, ecommerce and machine learning

Mobile means that, for the first time, pretty much everyone on earth will have a camera, taking vastly more images than were ever taken on film ('How many pictures?'). This feels like a profound change on a par with, say, the transistor radio making music ubiquitous.

Then, the image sensor in a phone is more than just a camera that takes pictures - it’s also part of new ways of thinking about mobile UIs and services ('Imaging, Snapchat and mobile'), and part of a general shift in what a computer can do ('From mobile first to mobile native'). 

Meanwhile, image sensors are part of a flood of cheap commodity components coming out of the smartphone supply chain, that enable all kinds of other connected devices - everything from the Amazon Echo and Google Home to an August door lock or Snapchat Spectacles (and of course a botnet of hacked IoT devices). When combined with cloud services and, increasingly, machine learning, these are no longer just cameras or microphones but new endpoints or distribution for services - they’re unbundled pieces of apps. ('Echo, interfaces and friction') This process is only just beginning - it now seems that some machine learning use cases can be embedded into very small and cheap devices. You might train an ‘is there a person in this image?’ neural network in the cloud with a vast image set - but to run it, you can put it on a cheap DSP with a cheap camera, wrap it in plastic and sell it for $10 or $20. These devices will let you use machine learning everywhere, but also let machine learning watch or listen everywhere. 

So, smartphones and the smartphone supply chain are enabling a flood of UX and device innovation, with machine learning lighting it all up. 

However, I think it’s also worth thinking much more broadly about what computer vision in particular might now mean - thinking about what it might mean that images and video will become almost as transparent to computers as text has always been. You could always search text for ‘dog’ but could never search pictures for a dog - now you’ll be able to do both, and, further, start to get some understanding of what might actually be happening. 

We should expect that every image ever taken can be searched or analyzed, and some kind of insight extracted, at massive scale. Every glossy magazine archive is now a structured data set, and so is every video feed. With that incentive (and that smarthone supply chain) far more images and video will be captured. 

So, some questions for the future:

  • Every autonomous car will, necessarily, capture HD 360 degree video whenever it’s moving. Who owns that data, what else can you do with it beyond driving and how do our ideas of privacy adjust?
  • A retailer can deploy cheap commodity wireless HD cameras thoughout the store, or a mall operator the mall, and finally  know exactly what track every single person entering took through the building, and what they looked at, and then connect that to the tills for purchase data. How much does that change (surviving) retail?
  • What happens to the fashion industry when half a dozen static $100 cameras can tell you everything that anyone in Shoreditch wore this year - when you can trace a trend through social and street photography from start to the mass-market, and then look for the next emerging patterns?
  • What happens to ecommerce recommendations when a system might be able to infer things about your taste from your Instagram or Facebook photos, without needing tags or purchase history - when it can see your purchase history in your selfies?

Online retailers have been extremely good at retail as logistics, but much less good at retail as discovery and recommendation - much less good at showing you something you didn’t know you might like ('The Facebook of ecommerce'). I sometimes compare Amazon to Sears Roebuck a century ago - they let you buy anything you could buy in a big city, but they don’t let you shop the way you can in a big city. (I think this is also a big reason why ebook sales have flatlined - what do you buy?) 

Now, suppose you buy the last ten years’ issues of Elle Decoration on eBay and drop them into just the right neural networks, and then give that system a photo of your living room and ask which lamps it recommends? All those captioned photos, and the copy around them, are training data. And yet, if you don’t show the user an actual photo from that archive, just a recommendation based on it, you probably don’t need to pay the original print publisher itself anything at all. (Machine learning will be fruitful grounds for IP lawyers.) We don’t have this yet, but we know, pretty much, how we might do it. We have a roadmap to recognize some kind of preferences, automatically, at scale. 

The key thing here is that the nice attention-grabbing demos of computer vision that recognize a dog or a tree, or a pedestrian, are just the first, obvious use cases for a fundamental new capability - to read images. And not just to read them the way humans can, but to read a billion and see the patterns. Among many other things, that has implications for a lot of retail, including parts not really affected by Amazon, and indeed for the $500bn spent every year on advertising, 

Really, though, we don't know what all the implications might be. I've suggested a few of the crass, commercial possibilities that come out of this, but there are plenty of others. Science has already overturned some Old Master attributions and created others - might we find, or unfind, a Rembrandt?  Will we transcribe the Cairo Geniza in a decade instead of a century? When we can turn images into data, we’ll find lots of sets of images that we never really thought of as data before, and lots of problems that didn't look like image recognition problems.

From mobile first to mobile native

A couple of years ago internet companies moved from having a mobile team and a mobile strategy to what they called ‘mobile first’. Instead of building a product and deciding how and if it would work on mobile, new things are build for mobile by default, and don’t necessarily make their way back to the desktop. 

Now, though, I think we can see an evolution beyond ‘mobile first’. What happens if you just forget about the PC altogether? But also, what happens if you forget about featurephones? What happens if you presume all of the sophistication that a modern smartphone has and a PC does not, and if you also presume that, with 650m iPhones in use and 2.5bn smartphones in total, you can build a big company without thinking about the low end anymore?

There are a couple of building blocks to think about here. 

  • There is the image sensor as a primary input method, not just a way to take photographs, especially paired with touch. That image sensor is now generally the best ‘camera’ most people have ever owned, in absolute image quality, and is also presumed to be good for capture more or less anywhere 
  • There’s the presumption of a context that makes sound OK, both for listening and talking to your device - we’re not in an open-plan office anymore. 
  • There’s bandwidth (either LTE or wifi, which is half of smartphone use) that makes autoplaying video - indeed, video that might not even have a ‘play’ button or controls - banal, and live video trivial. I think a lot of video use now is effectively a replacement for HTML, or Flash - video as content, not as a Iive-action clip. 
  • With bandwidth there’s also battery, or a willingness to charge, and as this becomes the main device and is used at home, battery matters less. 
  • The personal device thats always in your pocket makes the phone and its app much more closely connected to you, and much more immediate, perhaps for sharing something small and personal that you’d never save for turning on a PC when you get home, or (say) watching a live stream that’s happening right now.
  • There’s a multi-tasking OS and ecosystem that lets you run lots of apps, try new ones by the dozen (helped by a common address book and photo library) and makes new apps free or cheap and (especially on iOS) safe
  • And there are chips and software tools (especially now machine learning) that let you compress and stream in real time a live high-definition video, and broadcast to millions of people, and automatically layer funny effects on top - and make that seem like a commodity.

On this last point, it’s useful to think about just how many of these building blocks the crop of live video apps presume, and how many different reasons there are that it would be impossible to build the same thing on the desktop.

It strikes me that smartphones are both much more sophisticated and much easier to use than PCs, and certainly than the PC internet. They can do all of these things that you couldn’t do with the web browser/keyboard/mouse model, and that means both more possibilities for publishers and developers but also far more for ordinary users - far more creation than ever happened on PCs. And there’s a mobile-native generation that takes this for granted, and will tell you which apparently hot apps (doing something that would have blown your mind in 2007) are only for little kids now. A child born when iPhone was announced will be 10 years old in 2 months, after all. 

This change, from building on mobile ‘first’ to really leveraging what a billion or so high-end smartphones can do in 2016, reminds me a little of the ‘Web 2.0’ products of a decade or so ago. One (and only one) way you could characterize these is that they said: ”you know, we don’t necessarily have to think about Lynx, and CGI scripts, and IE2, and dialup. We’ve evolved the web beyond the point that <IMG> tags were controversial and can make new assumptions about what will work, and that enables new ways to think about interfaces and services.”

In the same way, you could build a ‘mobile-first’ app today that would still make perfect sense on a desktop - indeed, you could mock up a smartphone app in Visual Basic. The original iPhone UI, and many major social apps today, could be navigated fine with a mouse and keyboard or even with a keyboard alone, pressing tab to go from button to button. If your eye is on all of those 2.5bn smartphones in use today and the 5bn that there'll be in a few years, that might be the right strategy. But it seems to me that building out from 'mobile native’ rather than up from ‘mobile first’ might be a good strategy too.

Echo, interfaces and friction

Mobile phones and then smartphones have been swallowing other products for a long time - everything from clocks to cameras to music players has been turned from hardware into an app. But that process also runs in reverse sometimes - you take part of a smartphone, wrap it in plastic and sell it as a new thing. This happened first in a very simple way, with companies riding on the smartphone supply chain to create new kinds of product with the components it produced, most obviously the GoPro. Now, though, there are a few more threads to think about. 

First, sometimes we're unbundling not just components but apps, and especially pieces of apps. We take an input or an output from an app on a phone and move it to a new context. So where a GoPro is an alternative to the smartphone camera, an Amazon Echo is taking a piece of the Amazon app and putting it next to you as you do the laundry. In doing so, it changes the context but also changes the friction. You could put down the laundry, find your phone, tap on the Amazon app and search for Tide, but then you’re doing the computer’s work for it - you’re going through a bunch of intermediate steps that have nothing to do with your need. Using Alexa, you effectively have a deep link directly to the task you want, with none of the friction or busywork of getting there. 

Next, and again removing friction, we’re removing or changing how we use power switches, buttons and batteries. You don’t turn an Echo or Google Home on or off, nor AirPods, a ChromeCast or an Apple Watch. Most of these devices don’t have a power switch, and if they do you don’t normally use it. You don’t have to do anything to wake them up. They’re always just there, present and waiting for you. You say ‘Hey Google’, or you look at your Apple Watch, or you put the AirPods in your ear, and that’s it. You don’t have to tell them you want them. (Part of this is 'ambient computing', but that doesn't capture a watch or earphones very well.)

Meanwhile charging, for those devices that do have batteries, feels quite different. We go from devices with big batteries that last hours or at best a day and take a meaningful amount of time to charge, to devices with very small batteries that charge very quickly and last a long time - days or weeks. The ratio of use to charging time is different. Even the Apple Watch, mocked as ‘a watch that needs to be charged!’, is now good for two days of normal use, which in practice means that, presuming you take it off at night, you never think about the battery at all. Again, this is all about friction, or perhaps mental load. You don’t have to think about cables and power management and switches and starting up - you don’t have to do the routine of managing your computer. (This is also some of the point of using an iPad instead of a laptop.)

A nicely polarising example of this is in Apple’s AirPods, where the friction is being moved rather than removed, exactly. You can complain that you have to charge your headphones, but you can also say that instead of plugging them in every time you listen (and muttering swearwords to yourself as you untangle the cable), you can just put them in your ears, and with 30 hours of battery life between the case and the pods themselves, you have a week or two of use. You fiddle with a cable and plug them in twice a month instead of every single time you use them. Apple hopes that's less friction - we'll see, but it's certainly different. 

A common thread linking all these little devices is this attempt to get rid of management, or friction, or, one could say, clerical work. That links the Apple Watch, ChromeCast, Echo and Home, Snapchat's Spectacles, AirPods and even perhaps the Apple Pencil. They try to reduce the clerical work that a computer or digital device or service makes you do before you can use it - charging, turning on, restarting, waking up, plugging in, choosing an app and so on. A smartphone interface reduces the management you do within the software (file management, settings etc) but these are more changing how much you have to manage the hardware itself. There’s a shift towards direct manipulation and interaction - less abstraction of buttons between you and the thing you want. They don’t ask you questions that only matter to the computer (“do you want me to wake up now? Am I charged enough?”). The device is transparent to the task. 

Of course, questions can be friction, but they’re also choice. So, if it's not just a microphone but an end-point for a cloud service, it’s also an end-point only for Google or Amazon's cloud service (and if I tell Alexa to “buy soap powder”, what brand does it pick, and why?). If GoPro is just a camera but SnapChat spectacles are an end-point for Snapchat, they’re only for Snapchat. As platforms, Alexa or Google Home look a little like a feature-phone with a carrier deck, or a cable box - a sealed, subsidised device with centrally controlled services (Amazon probably wants to give Prime customers Echoes for free, or almost). Your choice of voice assistant is made when you choose to buy an Echo or a Home, and not afterwards (assuming you don’t buy both, and assuming they don’t argue with each other in your kitchen). 

That means that this is about reducing friction, yes, but it's also about the reach of cloud and web service companies, and how they think about a broader world in which the PC web is increasingly left behind, the smartphone OS is the platform and the platform is often controlled by their competitors, and how else they can build services beyond fitting into a smartphone API model that someone else defined. There’s an element of push from big companies with a strategic desire, at least as much as there is consumer pull. Google Home is an end-point for the Google Assistant, but so is the Allo messaging app, an Android watch or indeed an Android phone. Facebook hasn’t tried to make a device so far (beyond Oculus, which is a very different conversation, and a feature-phone partnership a long time ago), but like Google it has been circling around what the right run-time or touch-point might be beyond apps, most recently with the Messenger Bot platform.  

The final thing to think about here is how many of these devices are driven by some form of AI. The obvious manifestation of that is in voice assistants, which don’t need a UI beyond a microphone and speaker and so theoretically can be completely ‘transparent’. But since in fact we do not have HAL 9000 - we do not have general, human level AI - voice assistants can sometimes feel more like IVRs, or a command line - you can only ask certain things but there’s no indication of what. The mental load could be higher rather than lower. This was where Apple went wrong with Siri - it led people to think that Siri could answer anything when it couldn’t. Conversely, part of Amazon’s success with Alexa, I think, is in communicating how narrow the scope really is. Hence the paradox - voice looks like the ultimate unlimited, general purpose UI, but actually only works if you can narrow the domain. Of course, this will get better, but in addition, one shouldn’t think of sound and voice as the only AI-based UI - we really haven't tried to see how such an appliance-type model could apply to images, for example. That gets especially interesting when one thinks that, say, a face recognition engine (or a voice/language engine) could be embedded in a small and very cheap device, with the data itself never leaving the device. So an alarm sensor could be a people sensor instead of an IR sensor, that just sends out a binary ‘yes/no there are people’ signal. It might be battery powered and last for years. 

Of course, Amazon already sells a small, battery-powered sensor that sends only a very simple signal - the Amazon Dash button. Is it easier to put an Echo in your laundry room, or a Dash? There’s a neat contrast here - these devices are either very smart or very dumb. They represent either the cutting edge of AI research (perhaps locally, perhaps as the end-point to the cloud), or the simplest device possible, and sometimes both at the same time, and both getting you more Tide.