The growth of AI and machine learning (ML) has led researchers to think, research and stress the development of ethical AI solutions - with AI models / solutions able to provide:
The concept of privacy introduced for ML has been extended from individual / standalone models to distributed training such as federated learning. Here models trained locally are sent to decentralized servers that aggregate model weights and inferences from several individual devices to yield an aggregated ensemble model of greater privacy and trust. With further extensive AI research and translating research to productionizing solutions, hybrid ML models find its usage in different industrial domains.
Such hybrid models built using Differential Privacy at Application level and Secured Transfer Protocol at transport level have been used extensively to build scalable solutions in the following industry domains.
The below figure illustrates use of Differential Privacy (DP) for different distributed industrial applications:
Researchers have tried to address DP use-cases, from general purpose statistical queries to special-purpose analytics tasks such as graph analysis, range queries, and analysis of data streams. Let’s take a quick look at some of the legacy DP systems:
SL NO |
SYSTEM |
FUNCTIONALITY |
1. |
PINQ |
Supports a LINQ-based query language, and implements the Laplace mechanism with a measure of global sensitivity |
2. |
Weighted PINQ |
Extends PINQ to weighted datasets, graphs and implements a specialized mechanism for that setting. |
3. |
Airavat |
Enforces differential privacy for MapReduce using the Laplace mechanism |
4. |
Fuzz |
Enforces DP for functional programs, disallowing global static variables from adversaries |
5. |
GUPT |
Sample & Aggregate framework with protection from side-channel attacks |
6. |
DJoin |
DP for queries over distributed datasets with special cryptographic functions during query execution |
7. |
Privacy Integrated Data Stream Queries |
The streaming extension to PINQ provides a programmatic interface for handling streaming data with DP |
This blog is structured around:
Our discussion remains confined within the scope of Local Differential Privacy where each user is responsible for maintaining a differentially private mechanism to their own data through addition of noise mechanisms.
The RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) implemented in Chrome browse collects data from opt-in users, tens of millions per day by Tracking in settings in the browser, e.g. home page, search engine and protects them from unwanted or malicious hijacking. It eliminates the need for a trusted third party server and puts control over client’s data back into their direct control.
It works on the principle of randomized response that provides strong Differential Privacy (DP) guarantee and protection. It operates by collecting statistics on sensitive topics by crowdsourcing from end-users, through client-side software where survey respondents wish to retain confidentiality. The RAPPOR mechanism is performed locally on the client, and does not require a trusted third party. In terms of frequency, the data from clients where data is collected repeatedly (or even infinitely often).
RAPPOR can be used to collect statistics on categorical client properties, by having each bit in a client’s response represent whether, or not, that client belongs to a category.
It functions by limiting the number of correlated categories, or Bloom filter hash functions, reported by any single client. This helps RAPPOR to maintain its differential-privacy (DP) guarantees even when statistics are collected on multiple aspects of clients.
The foremost limitation of RAPPOR is Its slowness and limited application.
Local Differential Privacy (LDP), applied on the platforms used by Apple works on the principle of statistical noise that is slightly biased and can mask a user’s individual data before it is shared with Apple. On receiving similar data inputs from several people, the added noise helps to average out large numbers of data points, thereby setting a clear inference before Apple which they can use as their benefit to derive meaningful information.
It uses the principle of privacy budget (quantified by the parameter epsilon), and sets a strict limit on the number of contributions from a user in order to preserve their privacy.
Suitable metrics with insights improves the intelligence and usability from the following applications. In addition, it also protects the privacy of user activity in a given time period
The Count Mean Sketch technique allows Apple to determine the most popular emoji to help design better ways to find and use our favorite emoji.
Count Mean Sketch and Hadamard Count Mean–based Sketch techniques for differential privacy relies on noise addition along with hash encoding using a series of mathematical functions.
Apple Sketches and Transform count frequencies of items/typing statistics. Apple uses their system to collect data from iOS and OS X users Popular emojis, popular words, which websites to mute and where to play audio.
The system is composed of Ingestor and Aggregator where private info like IP is stripped off and processed in a batch that removes metadata, such as the timestamps of privatized records received, and separates these records based on their use case.
The architecture strives to obtain a tradeoff among factors, including privacy, utility, server computation overhead, and device bandwidth, without lowering accuracy.
Microsoft leveraged Local Differential Privacy (LDP) to protect user privacy in several applications, including telemetry in Windows (used by millions of users), advertiser queries in LinkedIn, suggested replies in Office, and manager dashboards in Workplace Analytics which is a pluggable open source library of differentially private algorithms.
The DP platform provided by Microsoft provides a host of components that can be flexibly configured, allowing big data specialists to use the best and right set of services for their environments.
The framework is equipped with mechanisms related to implementations based on mature differential privacy research and APIs for defining an analysis. Each analysis is supported by suitable validation techniques for estimating the total privacy loss.
The Telemetry private data collection employs:
According to Microsoft’s DP design principles “the continuous collection of counter data is strong when the user's behavior remains approximately the same, varies slowly, or varies around a small number of values over the course of data collection”.
Memoization and perturbation techniques applied by Microsoft in the context of private collection of counters, avoids substantial losses in accuracy or privacy, functioning well when data is collected over long periods of time.
The real challenge in scaling DP solutions lies in its integration into real-world environments with highly customized data structures. Further it also needs good KPIs for performance and reliability. Legacy DP systems require modification to existing data pipelines or replacement of database engines, while integrating to ML systems and Big Data pipelines.
For edge computing the challenge lies in limitations of individual edge-servers in detecting threats. Edge servers are middleware components that can serve as a distillation point for sensitive data before they are sent to the cloud. Further, when edge servers become the centre of data processing and storage, then they may not have enough resources to protect themselves.
To develop a safe ecosystem for edge-computing, it’s essential to design a scalable, efficient, and decentralized security mechanism.
For automotives and related location-based services real-time data processing along with preserving location privacy are essential for providing security to the driver. Edge-computing can be seen as a promising solution to offer efficient differentially privacy-preserving location-based services where computing and storage resources can be placed at the network edge. Edge nodes (e.g., road side
infrastructures and base stations) takes the responsibility from remote-servers to provide privacy enabled solutions by ensuring location privacy in its coverage by balancing both utility and privacy.
The fundamental principle in building the framework lies in creating two services, Differentially Privacy-preserving Location-based Service Usage (Pri-LBS) mechanism and Privacy Level Adjustment (PLA) module, which are designed to enable vehicles to request useful information based on the submitted location without revealing their location privacy. Any vehicle registered with this service first submits its location to the edge node. The edge node executes the privacy-preserving mechanism by generating noisy location through use of two dimensional Gaussian Distribution for querying the POI. In addition, it filters useful information, to prevent leakage of any information to the attacker other than the vehicle’s coverage area (circle S).
This kind of architecture guarantees an acceptable trade-off between privacy and service-quality by adjustable selection of privacy level.
Differential Privacy in Location Based Services. Image source.
CHORUS embeds an Access Control differential privacy mechanism into the query before execution, primarily designed for statistical SQL queries. Uber uses CHORUS for its internal analytics tasks, by enforcing differential privacy at its output. DP based query systems vary with the size of the dataset with every single query. The privacy budget for each query is managed by providing query rewriters with a ε value apportioned to each query.
Its flexibility allows any SQL database integration with a processing range upto 10,000 queries per day, without any modifications to the database or queries.
As CHORUS is DBMS-independent, any high-performance databases can be plugged-in to scale to big data. Further, it eliminates any kind of post-processing techniques, allowing easy integration in existing data processing pipelines.
Differentially private queries with chorus integration with GCP. Image source.
Google Cloud Platform provides cloud services composed of data ingestion, flow, storage and analytics services. These services are endowed with the following capabilities to provide easy integration interface to the IoT providers:
One potential integration vendor with GCP is Intel’s IoT Platform and their joint reference architecture can not only leverage a scalable platform with edge devices, network and cloud components, but it provides the backbone behind building an ecosystem of Differential Private solutions for edge networks at an enterprise-level.
GCP components for DP in IoT infrastructure. Image source.
Mobile-Edge networks also make use of Differential Privacy through private federated learning scheme (FedMEC) by efficiently partitioning the model, where a deep neural network is splitted into two parts, thereby offloading the heavy computation to the edge server.
The architecture is enabled with a differentially private data perturbation mechanism to perturb the Laplacian random noises to the client-side features before uploading to the edge server. The
FedMEC relies on the mobile edge computing environment , where the federated learning takes place in three phases -client, edge and server.
Differential privacy in mobile-edge computing. Image source.
IoT Data analytics has been evolving with the local differential privacy obfuscation (LDPO) framework, to ensure data privacy and guarantee data utility for edge-computing. The LDPO framework serves as one of the strong pillars of IoT infrastructure by aggregating and distilling the IoT data at the edge without disclosing user’s confidential information.
The basic approach used by the LDPO framework is to add noise to prevent loss of private information. One major challenge encountered is that the noise might degrade the data utility, which has led researchers to design a LDP-based data distillation model, which minimizes/limits collection of personal data to maximise the utility. This also leads to the amount of data being shared, by obfuscating the learned features using DP. The LDPO framework is built with the following components:
The second phase involves Distillation-Based Local Differential Privacy after the bit strings are distilled (or transformed) by the edge serve. The randomness of the bit strings helps in the overall counting process of each bit, by the edge server.
Each edge-server promotes privacy-guarantee, with an amount of random noise/obfuscation added to the data from several connected crowdsourced IoT devices. The data from several edge-servers are then ensembled devices. The obfuscation parameter is responsible for the level of privacy guarantee, which may be stronger/weak depending on the maximum level of noise added to IoT devices. This parameter also influences the accuracy of the model.
The third phase is composed of Feature Rebuilding and Distribution Estimation. The initial state uses the hash-functions used on the IoT devices to replay them on the edge-servers for reconstructing the minmax filters. In the next state of Distribution Estimation, using the minimax filters as the feature variables, the edge server estimates the univariate or multivariate distributions (depending on the features and use-case), where one of the possible mechanisms could be linear regression for univariate distribution. With mutually independent features of minmax filters, ML techniques like Lasso Regression can be used for multivariate distribution estimation.
Obfuscation Techniques for DP at Edge Networks. Image source.
The primary objective of incorporating Differential Privacy in a Fog Computing Architecture is to reduce communication efficiency through data minimization with the use of auto-encoders, as well as to reduce load on the cloud. What drives this privacy-enabled framework is its high scalability, elastic services, support for mobiles and low latency.
Differential privacy data aggregation protocol in fog computing. Image source.
In this IoT based fog computing architecture, the heavy computations are distributed to the edge of the network, alleviating this burden from the cloud server. This architecture designs “an efficient data aggregation method that preserves the privacy of the users’ data and allows for multifunctional aggregation queries in an IoT setting”. Thus, architecture tries to fulfill three primary objectives:
In addition, the architecture is designed to send only the aggregation results to the server rather than all the sensory data, thereby improving the communication efficiency.
The privacy issues are handled by partitioning and separating the original data to two fixed fog nodes, delivered by the Fog center.
In this blog we discussed differential private applications at scale for modeling different business problems starting from a web browser, ranging to any API-driven SAAS services, Location Based Services, Edge and Fog Computing.
Differential private scalable solutions for Industrial IoT (IIoT) can be deployed in terms of edge, fog and cloud architectural layers, where the edge and fog layers complement each other.
Fog computing relies on centralized systems that can communicate with industrial gateways and embedded systems on a local area network (LAN). In contrast, edge computing performs much of the processing on embedded devices/platforms with direct communication to sensors and controllers.
The most important aspects or key methodologies in designing such differentially private systems rely on understanding what the problem domain is and how business wants to formulate the DP based machine learning use-case into the architecture. Another important factor to take into consideration is to build the support for personalized privacy budget allocation strategies for Personalized DP solutions, with event level privacy (such as modeling privacy events of different sizes).
While considering the system architecture, other aspects to keep in mind are formalizing data storage for streaming data and defining data leakage prevention techniques, so that there’s no data leakage in the pipeline. In addition, there should be data minimization principles to iteratively build models with both, numerical and categorical queries.