The first decade of augmented reality

In February 2006, Jeff Han gave a demo of an experimental 'multitouch' interface, as a 'TED' talk. I've embedded the video below. Watching this today, the things he shows seems pretty banal - every $50 Android phone does this! - and yet the audience, mostly relatively sophisticated and tech-focused people, gasps and applauds. What is banal now was amazing then. And a year later, Apple unveiled the iPhone and the tech industry was reset to zero around multitouch. 

Looking back at this a decade later, there were really four launches for multitouch. There was a point at which multitouch became an interesting concept in research labs, a point at which the first demos of what this might actually do started appearing in public, a point at which the first really viable consumer product appeared in the iPhone, and then, several years later, a point at which sales really started exploding, as the iPhone evolved and Android followed it. You can see some of that lag in the chart below - it took several years after the 2007 launch of the iPhone for sales to take off (even after the pricing model changed). Most revolutionary technologies emerge like this, in stages - it's rare for anything to spring into life fully formed. And in the meantime, there were parallel tracks that turned out to be the wrong approach - both Symbian in the west and iMode et al in Japan.

Today, I think augmented reality* is somewhere between points two and three - we've seen some great demos and the first prototypes and we don't have a mass-market commercial  product, but we're close.

So, Microsoft is shipping the Hololens: this has really good positional tracking, has integrated the computing into the headset, though at the cost of making it bulky, has a very small field of view (much less than is suggested in Microsoft's marketing videos), and costs $3000. A second version is planned for, apparently, 2019. Apple is pretty clearly working on something, going by the hires and acquisitions that are public as well as the CEO's comments (I suspect the work on miniaturisation, power, radio and so on that's gone into the Apple Watch and AirPods is relevant here). There may be something else from Google, or perhaps Facebook or Amazon. And there are a range of smaller companies and startups doing interesting work.

Meanwhile Magic Leap (an a16z investment) is working on its own, wearable technology and has released a series of videos showing what is already possible with such a device. I've embedded one of these below: the video is cool, but just as there was a huge difference between seeing a video of an iPhone and using one, there's a huge difference between seeing an AR video and wearing this, walking around and seeing things appear in front of your eyes. I've tried it - it's not too shabby. 

All of this means that if we're at the 'Jeff Han' stage today, we're not far away from the iPhone 1 stage, and then, over the next decade or so, we could evolve to a truly mass-market product.

What might that look like for a billion people? 

The first level to AR is what we had with Google Glass - a screen floats things in space in front of you, but it's not connected to the world in any way. In fact, Google Glass was conceptually pretty similar to a smart watch, except that you looked up and to the right instead of down and to the left. It gave you a new screen but it didn't have any sense of what was in the world in front of you. With more advanced technology, though, we could extend this to make your entire field of view a sphere that could show windows, 3D objects or anything else, floating in space. 

However, with what one might call 'true AR', or, perhaps 'mixed reality', the device starts to have some awareness of your surroundings, and can place things in the world such that if you suspend disbelief you could imagine that they're really there. Unlike Google Glass, the headset will do 3D mapping of your surroundings and track your head position all the time. That would make it possible to place a 'virtual TV' on a wall and have it stay there as you walk around, or make the whole wall a display.  Or you could put Minecraft (or Populous) on the coffee table and raise and lower mountains with your hands, as though you're modelling clay. By extension, anyone else wearing the same glasses could be able to see the same things - you could make a wall or conference table into a display and your whole team could use it at once, or you and your children could play god around the same Minecraft map. Or that little robot could hide behind the sofa, or you could hide it there for your children to find. (This has some overlap with some use cases for VR, of course, especially when people talk about adding external cameras.) This is mixed reality as a screen - or perhaps, as turning the world around you into infinite screen. 

But there's somewhere else to go with this, because you're still only really doing SLAM - you're mapping the 3D surfaces of a room, but not understanding it. Now, suppose I meet you at a networking event and I see your LinkedIn profile card over your head, or see a note from Salesforce telling me you're a key target customer, or a note from Truecaller saying you're going to try to sell me insurance and to walk away. Or, as Black Mirror suggested, you could just block people. That is, the cluster of image sensors in the 'glasses' aren't just mapping the objects around me but recognising them. This is the real augmentation of reality - you're not just showing things in parallel to the world but as part of it. Your glasses can show you the things that you might look at on a smartphone or a 2000 inch screen, but they can also unbundle that screen into the real world, and change it. So, there's a spectrum - on one hand, you can enrich (or pollute) the entire world by making everything a screen, but on the other, this might let you place the subtlest of cues or changes into the world - not just translate every sign into your own language when you travel, but correct 'American English' into English. If today people install a Chrome extension that replaces 'millennial' with 'snake people', what would an MR extension change? How about making your boss puke rainbows? (Fun is, after all, the most important real problem to solve.)

Do you wear the glasses all day, though, once they become small enough? If not, many of the more ambient applications don't work as well. You might, perhaps, have a combination of a watch or phone that's the 'always on' device, and glasses that you put on like reading glasses in the right context. That might address some of the social questions that Google Glass ran into - taking out a phone, looking at a watch or putting on a pair of glasses send a clear signal that people can understand in a way that wearing a Google Glass in a bar did not. 

This touches on a related question - do AR and VR merge? It's certainly possible, and they are doing related things with related engineering challenges. One challenge of doing both in one device is that VR, to place you into another world, needs to black out everything else, so the glasses need to be sealed around the edges, where AR does not need this. In parallel, the whole challenge of AR is to let the world through while occluding what you don't want (and it's probably not great in bright sunlight for a while), where VR wants to start with a black screen. With AR glasses, you can see the person's eyes. In a decade or two lots of thing are possible, but for now these seem like different technology tracks. In the late 1990s we argued about whether 'mobile internet' devices would have a separate radio unit and screen, plus an earpiece, or perhaps a keyboard, or a clamshell with a keyboard and screen - we were in form-factor-discovery, and it took until 2007 (or later) to resolve on a single piece of glass. VR and AR may be in discovery for a while as well. 

Then, how would you control and interact with things that have no physical presence? Are the physical controllers of VR enough? Is hand tracking good enough (without a clear view of your fingers)? Multitouch in smartphones means that we have direct physical interaction, so that we touch the thing we want on the screen instead of moving a mouse a foot or two away from it, but would we 'touch' AR objects that hover in the air? Is that a good 'all-day' interface model? Magic Leap can certainly create a sense of depth such that you believe you can touch things, but do you want an interface where your hand slides though what appears to be solid? Should we use voice instead - and how much does that constrain what you can do (even with perfect voice recognition, imagine trying to controlling a phone or computer entirely by talking to it)? Or is eye-tracking the key - if the glasses do iris-tracking, do you look at the thing you want and tap your watch to select it? These are of course all the same kinds of questions that had to be resolved for smartphones and indeed PCs before them, just like form-factors - as in 2000 or 1990, the answers aren't clear, and neither, actually, are the questions. 

It does seem to me, though, that the more you think about AR as placing objects and data into the world around you, the more that this becomes an AI question as much as a physical interface question. What should I see as I walk up to you in particular? LinkedIn or Tinder? When should I see that new message - should it be shown to me now or later? Do I stand outside a restaurant and say 'Hey Foursquare, is this any good?' or does the device's OS do that automatically? How is this brokered - by the OS, the services that you've added or by a single 'Google Brain' in the cloud? Google, Apple, Microsoft and Magic Leap might all have different philosophical attitudes to this, but it seems to me that a lot of it has to be automatic - to be AI - if you want it to work well. If one recalls the Eric Raymond line that a computer should never ask you something that it should be able to work out, then a computer that can see everything you see and know what you're looking at, and that has the next decade of development of machine learning to build on, ought to remove whole layers of questions that today we assume we have to deal with manually. So, when we went from the windows/keyboard/mouse UI model of desktop computers to the touch and direct interaction of smartphones, a whole layer of questions were removed -  the level of abstraction changed. A smartphone doesn't ask you where to save a photo, or where you are when you order a car, or which email app to use, or (with fingerprint scanners) what your password is - it removes questions (and also choices). AR ought to be another step change in the same direction: it's about much more than having smartphone apps float in front of you in little square windows. Snapchat doesn't work like Facebook's desktop website, and an ambient, intangible, AI-led UI would change everything again. 

In the meantime, the more that AR glasses are trying to understand the world around you (and you yourself), the more that they are watching, and sending some of what they see to a myriad of different cloud services, depending on the context, the use case and the application model. Is this a face and are you talking to it? Send it (or a compressed abstract of it - and yes, there are major bandwidth implications to all of this) to Salesforce, LinkedIn, TrueCaller, Facebook and Tinder. A pair of shoes? Pinterest, Amazon and Net a Porter. Or, send everything to Google. And if everyone is bored during that meeting, do you log that to Success Factors? This poses some pretty obvious privacy and security issues. I suggested in another post that since an autonomous car is capturing HD 3D 360 degree video all of the time, a city full of them is the ultimate panopticon. So what happens if everyone is wearing AR glasses as well - would it even be possible to go on the run? And what if you get hacked? If your connected home is hacked you'll have poltergeists, but if your AR glasses are hacked you'll hallucinate.  

Finally, a quite important question - how many people will have one of these? Is AR going to be an accessory that a subset of mobile phone users have (like, say, smart watches)? Or will every small town in Brazil and Indonesia have shops selling dozens of different $50 Chinese AR glasses where today they sell Androids? (What would bandwidth cost by then?) Again, it's very early to say. But the other useful argument from the late 1990s and early 2000s was around whether everyone would have the same kind of mobile data devices, or whether some people would have what we now call smartphones but most people would have what we now call 'feature phones', on a spectrum all the way down to simple devices with no camera or colour screen. In hindsight, that was like arguing about whether everyone would have PCs or some people would stick with word processors. The logic of scale and general purpose computing meant that first the PC and then the smartphone became the single universal device - of 5bn people with a mobile phone today, 2.5-3bn have a smartphone and it is clear that the overwhelming majority of the rest will follow. So do most people stick with smartphones and some (100m? 500m? 1bn?) move on to glasses as an accessory, or is this the new universal product? Any answer to that is a leap of imagination, not analysis. But then, so was saying in 1995 that everyone on earth would have a phone. 


* What, exactly, does 'augmented reality' mean? The term often gets used for things like Pokemon Go or Snapchat lenses (or even things like location-based museum audio guides), but here I'm talking about wearing a device over your eyes, that you look through, that places things into the real world - glasses, in effect. The term mixed reality also floats around here as well, but AR will do.