There’s no such thing as data

Technology is full of narratives, but one of the loudest is around something called ‘data’. AI is the future, and it’s all about data, and data is the future, and we should own it and maybe be paid for it, and countries need data strategies and data sovereignty. Data is the new oil!

This is mostly nonsense. There is no such thing as ‘data’, it isn’t worth anything, and it doesn’t really belong to you anyway.

Most obviously, ‘data’ is not one thing, but innumerable different collections of information, each of them specific to a particular application, that aren’t interchangeable. Siemens has wind turbine telemetry and Transport for London has ticket swipes, and you can’t use the turbine telemetry to plan a new bus route. If you gave both sets of data to Google or Tencent, that wouldn’t help them build a better image recognition system.

This might seem trivial put so bluntly, but it points to the uselessness of very common assertions, especially from people outside tech, on the lines of ’China has more data’ or ‘America will have more data’ - more of what data? Meituan delivers 50m restaurant orders a day, and that lets it build a more efficient routing algorithm, but you can’t use that for a missile guidance system. You might not even be able to use it to build restaurant delivery in London. ‘Data’ does not exist as one, single, unified thing, where you can add every row and table of every different kind to one giant pool and get more and more insight. Creating a ‘national data strategy’ is like demanding a ‘national spreadsheet strategy’ or ‘national SQL strategy’.

Of course, when people talk about ‘data’ they mostly really mean your data - your personal information and the things that you do on the internet, some of which is sifted, aggregated and deployed by technology companies. We want more privacy controls, but we also think we should have ownership of that data, wherever it is.

The trouble is, most of the meaning and hence the value in most of ‘your’ data is not in you but in all of the intersections with other people. What you post on Instagram means very little: the signal is in who liked your posts and what else they liked, in what you liked and who else liked it, and in who follows you, who else they follow and who follows them, and so-on outwards in a mesh of interactions between a billion people. If I like your picture, that is not your ‘my’ data or ‘your’ data alone, and it’s not worth much without the context of all the other likes and follows. You can't take that with you, because it’s a lot of other people’s data (and privacy!) as well, and even if you did you probably couldn’t plug it into TikTok, because TikTok has a different mesh and the users don’t overlap.

That is, for many of these systems the value isn't in the ‘data’ at all but in the flow of activity around it - the meaning is not in the picture or video you post but in how the network reacts to it, and how the products creates and captures that reaction. You could see Instagram, TikTok or PageRank as vast mechanical Turks - we do not (yet) have AI that can understand what every page, picture or video are in themselves, and so we need humans - all of us - in the loop somewhere, at the right point of leverage, liking, linking, clicking and watching (and, of course, creating). These are systems, not data, and the value is in the flow.

All of this prompted Tim O'Reilly to say that ‘data isn't oil - it's sand’ - data is valuable only in the aggregate of millions. Indeed, this can be true even on a simple cashflow basis - in Q1 2022 Meta made just 99 cents of free cashflow per daily active user per month.

This also applies even for ‘personal’ data where you can meaningfully say that it’s ‘yours’. Your electricity usage is not about other people, but it’s not valuable by itself, only in the aggregate of all domestic electricity usage in south London or Brooklyn. And DeepMind’s researchers might be able to uncover some new and clinically important correlation from a million chest x-rays - but yours, by itself, doesn’t get them anything, and they didn’t feed those x-rays into AlphaGo. Again, data isn’t one thing.

We’ve been here before: today’s discussions around AI and around data look a lot like discussions around databases in the 1980s. We transform what we can do with information and what questions we can ask, and how organisations can function. When databases were new, we worried, and some of those worries were real, but no-one today asks if America has more SQL, or if it matters that SAP is German. No-one at Davos talks about ‘SQL colonialism’. These technologies are not national strategic assets - anyone can have them, but what for? Databases enabled just-in-time supply chains, and Walmart, and let Apple make iPhones in China - those are the strategic questions. The same for AI, and ‘data’ - it’s not the new oil, just more software, so what do you build with it?

A version of this essay appeared in the Financial Times this weekend.