Research

PubMix: Towards Optimal Domain-Aware Privacy Mechanisms for Synthetic Data Generation

Differentially private (DP) synthetic data generation is becoming increasingly important for building large-scale data-driven systems that protect user privacy. Although access to public data has been shown to improve the privacy-utility trade-off in synthetic data generation empirically, existing approaches leverage public data only indirectly, through pre-processing (e.g., using pre-trained generative models) or post-processing steps (e.g., matching target statistics estimated from public datasets), while relying on domain-agnostic DP mechanisms. In this work, we lay the theoretical framework to study the principled incorporation of public data into DP mechanisms themselves, in an optimal manner. We introduce PubMix, a public-data-aware DP mechanism that can be used in histogram-based data synthesis pipelines.

Optimal Domain-Aware Privacy Mechanisms for Synthetic Data Generation

HeavyWater and SimplexWater: LLM Watermarking

How can we distinguish between AI-generated and human-written text? One effective approach is through watermarking LLM outputs. Watermarking works by subtly altering the model’s next-token predictions in a way that remains imperceptible to users but can be detected by a verifier who holds a secret key. We frame this watermarking process as an optimization problem: finding the "optimal perturbation" to the token distribution according to a fixed score function that is accessible only with the secret key. For binary score functions, this reduces to a classic distance-maximizing code design problem in coding theory, followed by an optimal transport problem. We also extend the analysis to non-binary scores and derive a new watermarking scheme that outperforms existing watermarks, which is based on scores drawn from heavy-tailed distributions. The main contributions of this work are:

Providing a generalized information-theoretic framework for analyzing watermakrs, which includes many of the existing watermarks as special cases
Providing the optimal watermark with binary scores (SimplexWater) that maximizes detection with zero distortion
Establishing conditions under which a watermark based on arbitrary scores maximizes detection, and introducing the HeavyWater watermark, which outperforms all existing approaches.

Paper:

HeavyWater and SimplexWater: Watermarking Low-Entropy Text Distributions.

ArcMark: Multi-bit Information Embedding in LLM Outputs

Can LLMs communicate with each other without humans knowing? Or can we embed metadata, such as who generated a text, and when or where it was produced, directly into LLM outputs? Both questions point to the same challenge: encoding multiple bits of information in generated text while preserving natural language quality. We model this as a communication channel design problem between the hidden message generator and the detector, draw on ideas from information theory and coding theory, and develop efficient methods for multi-bit watermarking. In this work, we characterize the maximum number of bits that can be embedded per token, and introduce ArcMark, a multi-bit watermark construction that outperforms existing methods.

Paper:

ArcMark: Multi-bit LLM Watermark via Optimal Transport.

CorDP-DME: Correlated Privacy Mechanisms for Differentially Private Distributed Mean Estimation

Differentially private distributed mean estimation (DP-DME) is a key component in private federated learning. DP-DME (with \(n\) distributed users) traditionally employs either local DP (LDP) or central DP (CDP), achieving MSEs of \(O(1/n)\) and \(O(1/n^2)\), respectively. CDP attains a lower MSE by relying on a trusted party. Cryptographic protocols such as secure aggregation have been integrated into DP-DME systems to achieve CDP-level MSE without a trusted party. However, they result in multiple-round protocols and involve significant communication and computational overhead. We propose CorDP-DME, an alternative DP-DME mechanism that is based on an information-theoretic framework that uses optimally correlated Gaussian noise, and effectively navigates the trade-off between privacy, accuracy and robustness, bridging the gap between LDP and CDP error bounds. CorDP-DME:

is a single-round protocol
requires no trusted party
achieves significantly improved accuracy compared to LDP
incurs substantially lower communication and computational costs and increased resilience against dropouts and colluding users compared to secure aggregation.

Papers:

Correlated Privacy Mechanisms for Differentially Private Distributed Mean Estimation.

Multi-Group Proportional Representation in Database Retrieval and Text-to-Image Generation

In image retrieval and text-to-image generation, ensuring well-represented results for user queries is essential to avoid representational harms—instances where certain groups, individuals, or traits are misrepresented or excluded due to biases in data or model behavior. These biases can perpetuate stereotypes, result in unfair outcomes, and marginalize underrepresented communities. While prior research has explored fairness by enforcing equal or proportional representation across individual attributes such as race or gender, less attention has been given to intersectional groups, which are defined by combinations of multiple attributes (e.g., race and gender together). Our work addresses this gap by developing tools to measure and mitigate representational harms in both image retrieval and text-to-image generation. Specifically, we:

Introduce Multi-Group Proportional Representation (MPR), a theoretically grounded metric for measuring intersectional group representation
Propose Multi-Group Aware Proportional Retrieval (MAPR), an image retrieval algorithm that improves both fairness and accuracy, outperforming existing fair retrieval approaches
Conduct a comprehensive analysis of representational gaps in current text-to-image generation models, and propose a method to mitigate these disparities.

Papers:

Private Applroximate Nearest neighbor Search for Vector Database Querying

Vector databases have become increasingly popular in ML/AI applications such as retrieval-augmented generation (RAG) and recommendation systems. They store high-dimensional embeddings of text, images, or video data and enable efficient retrieval by converting user queries into vector embeddings and ranking results based on similarity. Given the scale—often millions of entries—they rely on approximate nearest neighbor (ANN) search techniques for fast retrieval. In this work, we propose a novel method for performing ANN search while ensuring perfect privacy of user queries, effectively extending the classical notion of private information retrieval to the setting of vector databases. We introduce an information-theoretic formulation of the private ANN problem and present a scheme based on Reed-Solomon codes that guarantees perfect privacy without sacrificing the accuracy of the underlying non-private ANN algorithm. Our approach achieves lower communication costs compared to existing cryptographic protocols for private ANN search.

Paper:

Private Approximate Nearest Neighbor Search for Vector Database Querying.

PRUW: Private Read-Update-Write for Efficient Federated Learning with Perfect Information-theoretic Privacy

Private-Read-Update-Write (PRUW) is an information-theoretic framework that enables users to privately download (read) and update (write) sections of a data storage system without revealing the values of the updates or the sections accessed, while ensuring perfect accuracy. PRUW has applications in efficient variants of federated learning (FL) such as federated submodel learning (FSL) and FL with gradient sparsification. In FSL, the users only download and update sections of the model that can be updated by their limited data types, which reduces communication and computation costs; however, the accessed sections and update values can still disclose users' private information. Similarly, in FL with sparsification, users transmit only the most significant \(r\) fraction of updates, with the indices and values potentially revealing sensitive information. PRUW addresses these concerns by using coding theoretic tools that perfectly hides the values and the indices of the downloaded and updated information while maintaining perfect accuracy. We propose two variants:

structured PRUW, where the FL model is divided into pre-determined sections (as in FSL)
unstructured PRUW, where the selected content has no specific structure (as in the top \(r\) updates selected in gradient sparsification).

For both, We develop coding schemes using the properties of Lagrange polynomials and Cauchy-Vandermonde matrices, along with concepts from coded computing and private information retrieval (PIR) to achieve perfect privacy and accuracy. We show that the asymptotic communication cost of PRUW can be as low as twice that of the corresponding non-private read-write operations. Additionally, we provide the following extensions of basic PRUW:

Relaxing the perfect privacy and accuracy conditions in PRUW to characterize the rate-distortion and rate-privacy-storage trade-offs with corresponding achievable schemes.
PRUW schemes that are applicable to servers with arbitrary storage constraints.

Papers:

Private Information Retrieval

Private information retrieval (PIR) has been widely studied in information theory to obtain the fundamental communication rates in perfectly private database retrieval. In PIR, a user downloads a required file from a database system that stores multiple files, without revealing what was downloaded. We introduced and analyzed the following variants of PIR:

Semantic PIR, which broadens the scope of classical PIR by incorporating files with arbitrary lengths and retrieval probabilities. By deriving the capacity of semantic PIR, we demonstrated that semantic PIR always outperforms classical PIR. Our findings highlight the benefits of leveraging the natural differences in files to improve the communication rate, as opposed to adhering strictly to the classical PIR model.
Quantum PIR, which considers quantum communication channels instead of classical channels in PIR to improve the communication rates by a factor of two.

Papers:

Publications

See my Google Scholar for the most up-to-date list.

Journal Papers

Quantum X-Secure E-Eavesdropped T-Colluding Symmetric Private Information Retrieval

A. Aytekin, M. Nomeir, S. Vithana, and S. Ulukus

IEEE Transactions on Information Theory, vol. 71(5):3974–3988 May 2025

Information-Theoretically Private Federated Submodel Learning with Storage Constrained Databases

S. Vithana and S. Ulukus

IEEE Transactions on Information Theory, vol. 70(8):6041–6059 August 2024

Private Read-Update-Write with Controllable Information Leakage for Storage-Efficient Federated Learning with Top \(r\) Sparsification

S. Vithana and S. Ulukus

IEEE Transactions on Information Theory, vol. 70(5):3669–3692 May 2024

Deceptive Information Retrieval

S. Vithana and S. Ulukus

Entropy, vol. 26(3):244 March 2024

Private Read Update Write (PRUW) in Federated Submodel Learning (FSL): Communication Efficient Schemes With and Without Sparsification

S. Vithana and S. Ulukus

IEEE Transactions on Information Theory, vol. 70(2):1320–1348 February 2024

Private Information Retrieval and Its Applications: An Introduction, Open Problems, Future Directions

S. Vithana, Z. Wang, and S. Ulukus

IEEE BITS Magazine 2023

Semantic Private Information Retrieval

S. Vithana, K. Banawan, and S. Ulukus

IEEE Transactions on Information Theory, vol. 68(4):2635–2652 April 2022

Adaptive Hierarchical Clustering for Hyperspectral Image Classification: Umbrella Clustering

S. Vithana, M. Ekanayake, H. Ekanayake, A. Rathnayake, G. Jayatilaka, V. Herath, R. Godaliyadda, and P. Ekanayake

Journal of Spectral Imaging, vol. 8(a11) July 2019

Comparison of Two Algorithms for Land Cover Mapping Based on Hyperspectral Imagery

S. Vithana, R. Abeysekara, S. Oorloff, A. Rupasinghe, V. Herath, R. Godaliyadda, and P. Ekanayake

International Journal on Advances in ICT for Emerging Regions, vol. 11(1) July 2018

Conference Papers

HeavyWater and SimplexWater: Watermarking Low-Entropy Text Distributions

D. Tsur, C. Long, C. Verdun, H. Hsu, C.-F. Chen, H. Permuter, S. Vithana, and F. P. Calmon

Conference on Neural Information Processing Systems (NeurIPS) December 2025

Differentially Private Distributed Mean Estimation with Constrained User Correlations

S. Vithana, V. Cadambe, F. Calmon, and H. Jeong

IEEE International Symposium on Information Theory (ISIT) 2025

Multi-Group Proportional Representation for Text-to-Image Models

S. Jung, A. Oesterling, C. M. Verdun, S. Vithana, T. Moon, and F. P. Calmon

Conference on Computer Vision and Pattern Recognition (CVPR) 2025

S. Vithana, V. R. Cadambe, F. P. Calmon, and H. Jeong

IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) 2025

Multi-Group Proportional Representation

A. Oesterling, C. Verdun, C. Long, A. Glynn, L. Paes, S. Vithana, M. Cardone, and F. P. Calmon

Conference on Neural Information Processing Systems (NeurIPS) December 2024

Measuring Representational Harms in Image Generation with a Multi-Group Proportional Metric

S. Jung, A. Oesterling, C. M. Verdun, S. Vithana, T. Moon, and F. P. Calmon

NeurIPS Workshop on Algorithmic Fairness Through the Lens of Metrics and Evaluation December 2024

Private Approximate Nearest Neighbor Search for Vector Database Querying

S. Vithana, M. Cardone, and F. P. Calmon

IEEE International Symposium on Information Theory (ISIT) July 2024

Asymmetric X-Secure T-Private Information Retrieval: More Databases is not Always Better

M. Nomeir, S. Vithana, and S. Ulukus

58th Annual Conference on Information Sciences and Systems (CISS) March 2024

Quantum Symmetric Private Information Retrieval with Secure Storage and Eavesdroppers

A. Aytekin, M. Nomeir, S. Vithana, and S. Ulukus

IEEE GLOBECOM Workshops December 2023

Private Membership Aggregation

M. Nomeir, S. Vithana, and S. Ulukus

IEEE Military Communications Conference (MILCOM) October 2023

Private Read Update Write (PRUW) with Heterogeneous Databases

S. Vithana and S. Ulukus

IEEE International Symposium on Information Theory (ISIT) June 2023

Rate-Privacy-Storage Trade-off in Federated Learning with Top \(r\) Sparsification Best Paper Award

S. Vithana and S. Ulukus

IEEE International Conference on Communications (ICC) May 2023

Model Segmentation for Storage Efficient Private Federated Learning with Top \(r\) Sparsification

S. Vithana and S. Ulukus

Conference on Information Sciences and Systems (CISS) March 2023

Private Federated Submodel Learning with Sparsification

S. Vithana and S. Ulukus

IEEE Information Theory Workshop (ITW) November 2022

Rate Distortion Trade-off in Private Read Update Write in Federated Submodel Learning

S. Vithana and S. Ulukus

Asilomar Conference on Signals, Systems, and Computers October 2022

Private Read Update Write (PRUW) with Storage Constrained Databases

S. Vithana and S. Ulukus

IEEE International Symposium on Information Theory (ISIT) June 2022

Efficient Private Federated Submodel Learning

S. Vithana and S. Ulukus

IEEE International Conference on Communications (ICC) May 2022

Semantic Private Information Retrieval from MDS Coded Databases

S. Vithana, K. Banawan, and S. Ulukus

IEEE International Symposium on Information Theory (ISIT) July 2021

Semantic Private Information Retrieval: Effects of Heterogeneous Message Sizes and Popularities

S. Vithana, K. Banawan, and S. Ulukus

IEEE Global Communications Conference (GLOBECOM) December 2020

A Semi-Supervised Algorithm to Map Major Vegetation Zones Using Satellite Hyperspectral Data

M. Ekanayake, H. Ekanayake, A. Rathnayake, S. Vithana, V. Herath, R. Godaliyadda, and M. P. B. Ekanayake

9th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS) September 2018

Spectral-Spatial Hybrid Mechanism for Feature Detection Using Spectral Correlation

S. Oorloff, R. Abeysekara, S. Vithana, A. Rupasinghe, V. Herath, R. Godaliyadda, and P. Ekanayake

IEEE International Conference on Industrial and Information Systems (ICIIS) December 2017

Hyperspectral Imaging Based Land Cover Mapping Using Data Obtained by the Hyperion Sensor Best Paper Award

S. Vithana, R. Abeysekara, S. Oorloff, A. Rupasinghe, V. Herath, R. Godaliyadda, and P. Ekanayake

Seventeenth International Conference on Advances in ICT for Emerging Regions (IEEE ICTer) September 2017

Sajani Vithana

Research

PubMix: Towards Optimal Domain-Aware Privacy Mechanisms for Synthetic Data Generation

HeavyWater and SimplexWater: LLM Watermarking

ArcMark: Multi-bit Information Embedding in LLM Outputs

CorDP-DME: Correlated Privacy Mechanisms for Differentially Private Distributed Mean Estimation

Multi-Group Proportional Representation in Database Retrieval and Text-to-Image Generation

Private Applroximate Nearest neighbor Search for Vector Database Querying

PRUW: Private Read-Update-Write for Efficient Federated Learning with Perfect Information-theoretic Privacy

Private Information Retrieval

Publications

Journal Papers

Conference Papers

Teaching

Primary Instructor

Teaching Fellow

Teaching Assistant