Apple’s WWDC keynote unveiled some new fancy products arriving with iOS 10, most of features are touting differential privacy, a statistical method that’s become a valuable tool for protecting user data.
The details of the system are complicated (there’s a more detailed explanation here) but in essence, it means adding randomized data to mask individual entries without changing the aggregate result. That way you could have a good idea of how many people are using a particular emoji without being able to pin down any specific user to a specific emoji use. But even knowing the fundamentals, we were still left in the dark about what this system would mean for iOS 10.
In the days after that keynote, Apple has begun to open up about exactly where differential privacy is being used, and how it’s changing the shape of data collection in iOS 10. Apple’s never been as aggressive about data collection as Google, Facebook, or even Amazon, but the new generation of data-driven AI services makes at least some level of collection a necessity. That leaves a gap between the data Apple needs and the data it’s willing to collect from user’s phones. In its early uses, differential privacy is what fills that gap.
As detailed in the iOS 10 Preview Guide, differential privacy is being used in four different parts of iOS 10. The first two parts have to do with Apple’s new messaging app, which is significantly more predictive than previous versions. The new emoji replacement feature will be driven in part by aggregate data from iPhones around the world. If lots of people start replacing the word “butt” with the peach emoji, you’ll begin to see that recommendation pop up in your messages. The same thing is true with the new version of predictive text. Apple has traditionally drawn from local messaging data, looking only at the texts you write and the words you choose. But the new and improved predictive text will make smarter predictions based on more context, and draw on lots of aggregated data to do it.
The other two uses have to do with search. One is built in to Notes, called Lookup Hints, but Spotlight’s search feature is the most important of the two. Since iOS 9, Spotlight has been able to return search suggestions from activities performed within apps. If you’re constantly playing Life of Pablo in Spotify, the iOS 9 version of the system might drop a link to the album’s Spotify page when you search “pablo” in Spotlight, based on data shared by the developers of the app. And crucially, iOS 9 keeps all that data stored locally, so the NSA can’t snoop on the system to find out if you’ve been listening to too much Kanye. Developers choose what data to share in the coding of each app, but they have a strong incentive to share more and have their programs featured more often in Spotlight.
iOS 10 makes that system global, but uses differential privacy to keep similar privacy protections in place. Now developers have the option of submitting public data about what users are doing in the app — so if enough Kanye fans are submitting data, you might get the same “pablo” recommendation even if you’ve never listened to the album. Developers still decide what to share, but Apple mixes each of the submissions with random noise, making it all but impossible to trace the trend back to a single Kanye listener.
Taken together, the privacy impact is mixed. In each place we see differential privacy being used, it comes with a significant expansion of the data that’s being collected. In most cases, the data involved could be quite sensitive. These are private conversations and actions in third-party apps, exactly the kind of data Apple is usually so vocal about keeping private. The company wouldn’t outright say that this data could only be collected because of the differential privacy protections, but it’s clear Apple wanted some system like it in place before it expanded the data practices.
That’s particularly important given another property differential privacy has: it’s difficult to tell how well it’s working from the outside. Unlike the clear black-and-white of encryption, differential privacy works in shades of grey, balancing the reliability of the aggregate information against its potential to identify specific users. That’s sometimes referred to as a privacy budget, a kind of set balance for engineers to work against. But if you don’t work at Apple, it’s difficult to tell how strict that privacy budget really is. Apple insists it’s high enough to prevent any re-identification, but we’re mostly left to take their word for it.
It’s easy to make too much of that uncertainty. Even without a strict privacy budget, these measures are still worlds away from Gmail’s routine content scans or Facebook’s database of private message URLs. Apple just isn’t collecting as much data as its competitors, and the new systems mean much of the data will be protected in a way that’s legitimately unprecedented within the industry. None of that is a silver bullet, but it’s good news. This, roughly, is how Apple does Big Data: carefully, and with a few tricks no one’s ever seen before.