A Guide to Differential Privacy at Scale

By Sharmistha Chatterjee , Senior Manager Data Sciences at Publicis Sapient

Introduction

The growth of AI and machine learning (ML) has led researchers to think, research and stress the development of ethical AI solutions - with AI models / solutions able to provide:

  • Explainability & Interpretability
  • Privacy 
  • Fairness
  • Model Governance

The concept of privacy introduced for ML has been extended from individual / standalone models to distributed training such as federated learning. Here models trained locally are sent to decentralized servers that aggregate model weights and inferences from several individual devices to yield an aggregated ensemble model of greater privacy and trust. With further extensive AI research and translating research to productionizing solutions, hybrid ML models find its usage in different industrial domains.

Such hybrid models built using Differential Privacy at Application level and Secured Transfer Protocol at transport level have been used extensively to build scalable solutions in the following industry domains.

  • Distributed Control Systems
  • Industrial IoT Systems
  • Supply Chain
  • Retail
  • Financial Services
  • Health Care
  • Automotives and Electrical vehicles  
  • Locomotives/Transportation Systems

The below figure illustrates use of Differential Privacy (DP) for different distributed industrial applications:

Image source

Researchers have tried to address DP use-cases, from general purpose statistical queries to special-purpose analytics tasks such as graph analysis, range queries, and analysis of data streams. Let’s take a quick look at some of the legacy DP systems:

SL NO

SYSTEM

FUNCTIONALITY

1.

PINQ

Supports a LINQ-based query language, and implements the Laplace mechanism with a measure of global sensitivity

2.

Weighted PINQ

Extends PINQ to weighted datasets, graphs and implements a specialized mechanism for that setting.

3.

Airavat

Enforces differential privacy for MapReduce using the Laplace mechanism

4.

Fuzz

Enforces DP for functional programs, disallowing global static variables from adversaries

5.

GUPT

Sample & Aggregate framework with protection from side-channel attacks

6.

DJoin

DP for queries over distributed datasets with special cryptographic functions during query execution

7.

Privacy Integrated Data Stream Queries

The streaming extension to PINQ provides a programmatic interface for handling streaming data with DP

Source

This blog is structured around:

  1. Differential Privacy applications developed by software giant's Google, Apple, Microsoft
  2. Differential Privacy for Location Based Services
  3. Differential Privacy for IoT Sensors for Edge and Fog Computing

Our discussion remains confined within the scope of Local Differential Privacy where each user is responsible for maintaining a differentially private mechanism to their own data through addition of noise mechanisms.

Local Differential Privacy (LDP) Based Solutions

Rappor By Google

The RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) implemented in Chrome browse collects data from opt-in users, tens of millions per day by Tracking in  settings in the browser, e.g. home page, search engine and protects them from unwanted or malicious hijacking. It eliminates the need for a trusted third party server and puts control over client’s data back into their direct control.

It works on the principle of randomized response that provides strong Differential Privacy (DP) guarantee and protection. It operates by collecting statistics on sensitive topics by crowdsourcing from end-users, through client-side software where survey respondents wish to retain confidentiality. The RAPPOR mechanism is performed locally on the client, and does not require a trusted third party. In terms of frequency, the data from clients where data is collected repeatedly (or even infinitely often).

RAPPOR can be used to collect statistics on categorical client properties, by having each bit in a client’s response represent whether, or not, that client belongs to a category.   

It functions by limiting the number of correlated categories, or Bloom filter hash functions, reported by any single client. This helps RAPPOR to maintain its differential-privacy (DP) guarantees even when statistics are collected on multiple aspects of clients. 

The foremost limitation of RAPPOR is Its slowness and limited application.

LDP in IOS, IPhone, Mac - Apple Products

Local Differential Privacy (LDP), applied on the platforms used by Apple works on the principle of  statistical noise that is slightly biased and can mask a user’s individual data before it is shared with Apple. On receiving similar data inputs from several people, the added noise helps to average out large numbers of data points, thereby setting a clear inference before Apple which they can use as their benefit to derive meaningful information.

It uses the principle of privacy budget (quantified by the parameter epsilon), and sets a strict limit on the number of contributions from a user in order to preserve their privacy. 

Suitable metrics with insights improves the intelligence and usability from the following applications. In addition, it also protects the privacy of user activity in a given time period

  • QuickType suggestions
  • Emoji suggestions
  • Lookup Hints 
  • Safari Energy Draining Domains
  • Safari Autoplay Intent Detection (macOS High Sierra)
  • Safari Crashing Domains (iOS 11)
  • Health Type Usage (iOS 10.2)

The Count Mean Sketch technique allows Apple to determine the most popular emoji to help design better ways to find and use our favorite emoji. 

Count Mean Sketch and Hadamard Count Mean–based Sketch techniques for differential privacy relies on noise addition along with hash encoding using a series of mathematical functions.

Apple Sketches and Transform count frequencies of items/typing statistics. Apple uses their system to collect data from iOS and OS X users Popular emojis, popular words, which websites to mute and where to play audio.

The system is composed of Ingestor and Aggregator where private info like IP is stripped off and processed in a batch that removes metadata, such as the timestamps of privatized records received, and separates these records based on their use case.

The architecture strives to obtain a tradeoff among factors, including privacy, utility, server computation overhead, and device bandwidth, without lowering accuracy.

Local Differential Privacy  in Microsoft Products

Microsoft leveraged Local Differential Privacy (LDP) to protect user privacy in several applications, including telemetry in Windows (used by millions of users), advertiser queries in LinkedIn, suggested replies in Office, and manager dashboards in Workplace Analytics which is a pluggable open source library of differentially private algorithms.

The DP platform provided by Microsoft provides a host of components that can be flexibly configured, allowing big data specialists to use the best and right set of services for their environments.

Image source

The framework is equipped with mechanisms related to implementations based on mature differential privacy research and APIs for defining an analysis. Each analysis is supported by suitable validation techniques for estimating the total privacy loss.

The Telemetry private data collection employs:

  • 1BitMean to collect numeric data, independently from different users
  • dBitFlip to collect (sparse) histogram data from each user independently, distributed across buckets.

According to Microsoft’s DP design principles “the continuous collection of counter data is strong when the user's behavior remains approximately the same, varies slowly, or varies around a small number of values over the course of data collection”.

Memoization and perturbation techniques applied by Microsoft in the context of private collection of counters, avoids substantial losses in accuracy or privacy, functioning well when data is collected over long periods of time.

Challenges

The real challenge in scaling DP solutions lies in its integration into real-world environments with highly customized data structures. Further it also needs good KPIs for performance and reliability. Legacy DP systems require modification to existing data pipelines or replacement of database engines, while integrating to ML systems and Big Data pipelines.

For edge computing the challenge lies in limitations of individual edge-servers in detecting threats. Edge servers are middleware components that can serve as a distillation point for sensitive data before they are sent to the cloud. Further, when edge servers become the centre of data processing and storage, then they may not have enough resources to protect themselves.

To develop a safe ecosystem for edge-computing, it’s essential to design a scalable, efficient, and decentralized security mechanism. 

Differential Privacy for Location Based Services

For automotives and related location-based services real-time data processing along with preserving location privacy are essential for providing security  to the driver. Edge-computing can be seen as a promising solution to offer efficient differentially privacy-preserving  location-based services where computing and storage resources can be placed at the network  edge. Edge nodes (e.g., road side

infrastructures and base stations) takes the responsibility from remote-servers to provide privacy enabled solutions  by ensuring location privacy in its coverage by balancing both utility and privacy.

The fundamental principle in building  the framework lies in creating two services, Differentially Privacy-preserving  Location-based Service Usage (Pri-LBS) mechanism and Privacy Level Adjustment (PLA) module, which are designed to enable vehicles to request useful information based on  the  submitted  location  without  revealing  their  location privacy.  Any vehicle registered with this service first submits its location to the edge node. The edge node executes the privacy-preserving mechanism by generating noisy location through use of two dimensional Gaussian Distribution for querying the POI.  In addition, it filters useful information, to prevent leakage of any information to the attacker other than the vehicle’s coverage area (circle S).

This kind of architecture guarantees an acceptable trade-off between privacy and service-quality by adjustable selection of privacy level.

Differential Privacy in Location Based Services. Image source.

Differential Privacy for Databases and SQL Queries

CHORUS embeds an Access Control differential privacy mechanism into the query before execution, primarily designed for statistical SQL queries. Uber uses CHORUS for its internal analytics tasks, by enforcing differential privacy at its output. DP based query systems vary with the size of the dataset with every single query. The privacy budget for each query is managed by providing query rewriters with a ε value apportioned to each query.

Its flexibility allows any SQL database integration with a processing range upto 10,000 queries per day, without any  modifications to the database or queries.

As CHORUS is DBMS-independent, any high-performance databases can be plugged-in  to scale to big data. Further, it  eliminates any kind of post-processing techniques,  allowing easy integration in existing data processing pipelines.

Differentially private queries with chorus integration with GCP. Image source.

Scalable, Secured IoT Platform Architecture with Google Cloud Components and Differential Privacy

Google Cloud Platform provides cloud services composed of data ingestion, flow, storage and analytics services. These services are endowed with the following capabilities to provide easy integration interface to the IoT providers:

  • Seamless Data Ingestion and Device Control for improved interoperability
  • End to End Data Security enabling data and device protection
  • Easy device Onboarding to simplify deployment for security-enabled devices
  • Robust Scalability with cloud-based infrastructure
  • Better Insights with GCP’s advanced analytics infrastructure
  • Data Monetization through additional services and capabilities

One potential integration vendor with GCP is Intel’s IoT Platform and their joint reference architecture can not only leverage a scalable platform with edge devices, network and cloud components, but it provides the backbone behind building an ecosystem of Differential Private solutions for edge networks at an enterprise-level.

GCP components for DP in IoT infrastructure. Image source.

Differential Privacy for Federated Learning

Mobile-Edge networks also make use of Differential Privacy through private federated learning scheme (FedMEC) by efficiently partitioning the model, where a deep neural network is splitted into two parts, thereby offloading the  heavy computation to the edge server. 

The architecture is enabled with a differentially private data perturbation mechanism to perturb the Laplacian random noises to the client-side features before uploading to the edge server. The 

FedMEC relies on the mobile edge computing environment , where the federated learning takes place in three phases -client, edge and server.

  1. Edge devices are assigned the task of simple and lightweight feature extraction and perturbation. Further, the client-side local neural network is assigned by the cloud server.
  2. The results computed from the original data from client-side are perturbed for privacy reasons before being transmitted to the edge server to protect the privacy contained in the raw data.
  3. All kinds of resource-hungry tasks are off-loaded to the edge servers and cloud center.
  4. A pre-trained deep neural network is initialized as a global model in the cloud. The network can be trained using public data which has a similar distribution with private data as the auxiliary information dataset.
  5. The pretrained global neural network can be layered and extracted along the last layer of the convolution layer, which can be sent to each client for feature extraction.
  6. The second phase or the edge nodes is composed of the dense layers that update the model parameters by executing the forward and backward propagation procedures.
  7. Successive iterations for the model updates are aggregated and averaged at the cloud server.

Differential privacy in mobile-edge computing. Image source.

Differential Privacy at Edge Networks

IoT Data analytics has been evolving with the local differential privacy obfuscation (LDPO) framework, to ensure data privacy and guarantee data utility for edge-computing. The LDPO framework serves as one of the strong pillars of IoT infrastructure by aggregating and distilling the IoT data at the edge without disclosing user’s confidential information.

The basic approach used by the LDPO framework is to add noise to prevent loss of private information. One major challenge encountered is that the noise might degrade the data utility, which has led researchers to design a LDP-based data distillation model, which minimizes/limits collection of personal data to maximise the utility.  This also leads to the amount of data being shared, by obfuscating the learned features using DP. The LDPO framework is built with the following components:

  • Feature distillation based data minimization - This process ensures learning of the most compact and useful features.
  • Data obfuscation with LDP - The aim is to perturb features to add privacy guarantee.
  • Data reconstruction - This process generates obfuscated IoT data from perturbed features.
  • The initial phase of feature distillation is composed of  Feature Assignment and Feature Cloaking.
  • Feature Assignment - Any attribute within the dataset is anonymized to a k-bit string, using several hash functions, so that the transformation yields an unique string.
  • Feature Cloaking - With a careful selection of minmax filter F, each bit in F is randomized to 0 or 1 with a required probability to prevent sensitive attributes from being exposed.
  • Edge servers adopt randomized minmax filters to distil data from different IoT devices to represent the univariate or multivariate distribution.

The second phase involves Distillation-Based Local Differential Privacy after the bit strings are distilled (or transformed) by the edge serve. The randomness of the bit strings helps in the overall counting process of each bit, by the edge server.

Each edge-server promotes privacy-guarantee, with an amount of random noise/obfuscation added to the data from several connected crowdsourced IoT devices. The data from several edge-servers are then ensembled devices. The obfuscation parameter is responsible for the level of privacy guarantee, which may be stronger/weak depending on the maximum level of noise added to IoT devices. This parameter also influences the accuracy of the model.

The third phase is composed of Feature Rebuilding and Distribution Estimation. The initial state uses the hash-functions used on the IoT devices to replay them on the edge-servers for reconstructing the minmax filters. In the next state of Distribution Estimation, using the minimax filters as the feature variables, the edge server estimates the univariate or multivariate distributions (depending on the features and use-case), where one of the possible mechanisms could be linear regression for univariate distribution. With mutually independent features of minmax filters, ML techniques like Lasso Regression can be used for multivariate distribution estimation.

Obfuscation Techniques for DP at Edge Networks. Image source.

Differential Privacy for Fog Computing

The primary objective of incorporating Differential Privacy in a Fog Computing Architecture is to reduce communication efficiency through data minimization with the use of auto-encoders, as well as to reduce load on the cloud. What drives this privacy-enabled framework is its high scalability, elastic services, support for mobiles and low latency.

Differential privacy data aggregation protocol in fog computing. Image source.

In this IoT based fog computing architecture, the heavy computations are distributed to the edge of the network, alleviating this burden from the cloud server. This architecture designs “an efficient data aggregation method that preserves the privacy of the users’ data and allows for multifunctional aggregation queries in an IoT setting”.  Thus, architecture tries to fulfill three primary objectives:

  1. Smooth and correct implementation of multifunctional aggregation 
  2. Guarantee the privacy of the collected user data, through DP
  3. Ensure the aggregation results are close to the results without privacy protection

In addition, the architecture is designed to send only the  aggregation results to the server rather than all the sensory data, thereby improving the communication efficiency. 

The privacy issues are handled by partitioning and separating the original data to two fixed fog nodes, delivered by the Fog center.

  • Fog nodes: The fog nodes are efficient devices for computing and storing data that extend the edge of the cloud service. These devices serve as storage to answer aggregation queries sent from the fog center. 
  • Fog center: The fog center is responsible for a. Transfering queries to the appropriate aggregation query set, to be answered by the fog nodes, b. Gathering the returned query results from the fog nodes, c. Calculating the original query results and reporting them to the cloud server.

Conclusion

In this blog we discussed differential private applications at scale for modeling different business problems starting from a web browser, ranging to any API-driven SAAS services, Location Based Services, Edge and Fog Computing.

Differential private scalable solutions for Industrial IoT (IIoT) can be deployed in terms of edge, fog and cloud architectural layers, where the edge and fog layers complement each other. 

Fog computing relies on centralized systems that can communicate with industrial gateways and embedded systems on a local area network (LAN). In contrast, edge computing performs much of the processing on embedded devices/platforms with direct communication to sensors and controllers.

The most important aspects or key methodologies in designing such differentially private systems rely on understanding what the problem domain is and how business wants to formulate the DP based machine learning use-case into the architecture. Another important factor to take into consideration is to build the support for personalized privacy budget allocation strategies for Personalized DP solutions, with event level privacy (such as modeling privacy events of different sizes).

While considering the system architecture, other aspects to keep in mind are formalizing data storage for streaming data and defining data leakage prevention techniques, so that there’s no data leakage in the pipeline. In addition, there should be data minimization principles to iteratively build models with both, numerical and categorical queries.

References

  1. Google Rappor
  2. Learning with Privacy at Scale
  3. Introducing the new differential privacy platform from Microsoft and Harvard’s OpenDP
  4. Chorus: Differential Privacy via Query Rewriting
  5. Machine Learning Differential Privacy With Multifunctional Aggregation in a Fog Computing Architecture
  6. Distilling at the Edge: A Local Differential Privacy Obfuscation Framework for IoT Data Analytics
  7. An Efficient Federated Learning Scheme with Differential Privacy in Mobile Edge Computing
  8. Seamless Edge-to-Cloud IoT Integration Speeds Time to Market
  9. Differential Privacy Techniques for Cyber Physical Systems: A Survey, IEEE Communications Surveys and Tutorials, Sep 2019  
  10. Global vs Local Differential Privacy
SIGN UP FOR THE DSS PLAY WEEKLY NEWSLETTER
Get the latest data science news and resources every Friday right to your inbox!