Differentially private distributed mean estimation (DP-DME) is a key component in private federated learning. DP-DME (with \(n\) distributed users) traditionally employs either local DP (LDP) or central DP (CDP), achieving MSEs of \(O(1/n)\) and \(O(1/n^2)\), respectively. CDP attains a lower MSE by relying on a trusted party. Cryptographic protocols such as secure aggregation have been integrated into DP-DME systems to achieve CDP-level MSE without a trusted party. However, they result in multiple-round protocols and involve significant communication and computational overhead. We propose CorDP-DME, an alternative DP-DME mechanism that is based on an information-theoretic framework that uses optimally correlated Gaussian noise, and effectively navigates the trade-off between privacy, accuracy and robustness, bridging the gap between LDP and CDP error bounds. CorDP-DME:
In image retrieval and text-to-image generation, ensuring well-represented results for user queries is essential to avoid representational harms—instances where certain groups, individuals, or traits are misrepresented or excluded due to biases in data or model behavior. These biases can perpetuate stereotypes, result in unfair outcomes, and marginalize underrepresented communities. While prior research has explored fairness by enforcing equal or proportional representation across individual attributes such as race or gender, less attention has been given to intersectional groups, which are defined by combinations of multiple attributes (e.g., race and gender together). Our work addresses this gap by developing tools to measure and mitigate representational harms in both image retrieval and text-to-image generation. Specifically, we:
Vector databases have become increasingly popular in ML/AI applications such as retrieval-augmented generation (RAG) and recommendation systems. They store high-dimensional embeddings of text, images, or video data and enable efficient retrieval by converting user queries into vector embeddings and ranking results based on similarity. Given the scale—often millions of entries—they rely on approximate nearest neighbor (ANN) search techniques for fast retrieval. In this work, we propose a novel method for performing ANN search while ensuring perfect privacy of user queries, effectively extending the classical notion of private information retrieval to the setting of vector databases. We introduce an information-theoretic formulation of the private ANN problem and present a scheme based on Reed-Solomon codes that guarantees perfect privacy without sacrificing the accuracy of the underlying non-private ANN algorithm. Our approach achieves lower communication costs compared to existing cryptographic protocols for private ANN search.
Private-Read-Update-Write (PRUW) is an information-theoretic framework that enables users to privately download (read) and update (write) sections of a data storage system without revealing the values of the updates or the sections accessed, while ensuring perfect accuracy. PRUW has applications in efficient variants of federated learning (FL) such as federated submodel learning (FSL) and FL with gradient sparsification. In FSL, the users only download and update sections of the model that can be updated by their limited data types, which reduces communication and computation costs; however, the accessed sections and update values can still disclose users' private information. Similarly, in FL with sparsification, users transmit only the most significant \(r\) fraction of updates, with the indices and values potentially revealing sensitive information. PRUW addresses these concerns by using coding theoretic tools that perfectly hides the values and the indices of the downloaded and updated information while maintaining perfect accuracy. We propose two variants:
Private information retrieval (PIR) has been widely studied in information theory to obtain the fundamental communication rates in perfectly private database retrieval. In PIR, a user downloads a required file from a database system that stores multiple files, without revealing what was downloaded. We introduced and analyzed the following variants of PIR: