Differentially private (DP) synthetic data generation is becoming increasingly important for building large-scale data-driven systems that protect user privacy. Although access to public data has been shown to improve the privacy-utility trade-off in synthetic data generation empirically, existing approaches leverage public data only indirectly, through pre-processing (e.g., using pre-trained generative models) or post-processing steps (e.g., matching target statistics estimated from public datasets), while relying on domain-agnostic DP mechanisms. In this work, we lay the theoretical framework to study the principled incorporation of public data into DP mechanisms themselves, in an optimal manner. We introduce PubMix, a public-data-aware DP mechanism that can be used in histogram-based data synthesis pipelines.
Can LLMs communicate with each other without humans knowing? Or can we embed metadata, such as who generated a text, and when or where it was produced, directly into LLM outputs? Both questions point to the same challenge: encoding multiple bits of information in generated text while preserving natural language quality. We model this as a communication channel design problem between the hidden message generator and the detector, draw on ideas from information theory and coding theory, and develop efficient methods for multi-bit watermarking. In this work, we characterize the maximum number of bits that can be embedded per token, and introduce ArcMark, a multi-bit watermark construction that outperforms existing methods.
Differentially private distributed mean estimation (DP-DME) is a key component in private federated learning. DP-DME (with \(n\) distributed users) traditionally employs either local DP (LDP) or central DP (CDP), achieving MSEs of \(O(1/n)\) and \(O(1/n^2)\), respectively. CDP attains a lower MSE by relying on a trusted party. Cryptographic protocols such as secure aggregation have been integrated into DP-DME systems to achieve CDP-level MSE without a trusted party. However, they result in multiple-round protocols and involve significant communication and computational overhead. We propose CorDP-DME, an alternative DP-DME mechanism that is based on an information-theoretic framework that uses optimally correlated Gaussian noise, and effectively navigates the trade-off between privacy, accuracy and robustness, bridging the gap between LDP and CDP error bounds. CorDP-DME:
In image retrieval and text-to-image generation, ensuring well-represented results for user queries is essential to avoid representational harms—instances where certain groups, individuals, or traits are misrepresented or excluded due to biases in data or model behavior. These biases can perpetuate stereotypes, result in unfair outcomes, and marginalize underrepresented communities. While prior research has explored fairness by enforcing equal or proportional representation across individual attributes such as race or gender, less attention has been given to intersectional groups, which are defined by combinations of multiple attributes (e.g., race and gender together). Our work addresses this gap by developing tools to measure and mitigate representational harms in both image retrieval and text-to-image generation. Specifically, we:
Vector databases have become increasingly popular in ML/AI applications such as retrieval-augmented generation (RAG) and recommendation systems. They store high-dimensional embeddings of text, images, or video data and enable efficient retrieval by converting user queries into vector embeddings and ranking results based on similarity. Given the scale—often millions of entries—they rely on approximate nearest neighbor (ANN) search techniques for fast retrieval. In this work, we propose a novel method for performing ANN search while ensuring perfect privacy of user queries, effectively extending the classical notion of private information retrieval to the setting of vector databases. We introduce an information-theoretic formulation of the private ANN problem and present a scheme based on Reed-Solomon codes that guarantees perfect privacy without sacrificing the accuracy of the underlying non-private ANN algorithm. Our approach achieves lower communication costs compared to existing cryptographic protocols for private ANN search.
Private-Read-Update-Write (PRUW) is an information-theoretic framework that enables users to privately download (read) and update (write) sections of a data storage system without revealing the values of the updates or the sections accessed, while ensuring perfect accuracy. PRUW has applications in efficient variants of federated learning (FL) such as federated submodel learning (FSL) and FL with gradient sparsification. In FSL, the users only download and update sections of the model that can be updated by their limited data types, which reduces communication and computation costs; however, the accessed sections and update values can still disclose users' private information. Similarly, in FL with sparsification, users transmit only the most significant \(r\) fraction of updates, with the indices and values potentially revealing sensitive information. PRUW addresses these concerns by using coding theoretic tools that perfectly hides the values and the indices of the downloaded and updated information while maintaining perfect accuracy. We propose two variants:
Private information retrieval (PIR) has been widely studied in information theory to obtain the fundamental communication rates in perfectly private database retrieval. In PIR, a user downloads a required file from a database system that stores multiple files, without revealing what was downloaded. We introduced and analyzed the following variants of PIR:
See my Google Scholar for the most up-to-date list.