Differentially Private Steering for Large Language Model Alignment

Goel, Anmol; Hu, Yaxi; Gurevych, Iryna; Sanyal, Amartya

Computer Science > Computation and Language

arXiv:2501.18532 (cs)

[Submitted on 30 Jan 2025 (v1), last revised 20 Mar 2025 (this version, v2)]

Title:Differentially Private Steering for Large Language Model Alignment

Authors:Anmol Goel, Yaxi Hu, Iryna Gurevych, Amartya Sanyal

View PDF HTML (experimental)

Abstract:Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the Private Steering for LLM Alignment (PSA) algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our experiments support the theoretical guarantees by showing improved guarantees for our PSA algorithm compared to several existing non-private techniques.

Comments:	ICLR 2025 Camera Ready; Code: this https URL
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2501.18532 [cs.CL]
	(or arXiv:2501.18532v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.18532

Submission history

From: Anmol Goel [view email]
[v1] Thu, 30 Jan 2025 17:58:36 UTC (480 KB)
[v2] Thu, 20 Mar 2025 09:58:49 UTC (538 KB)

Computer Science > Computation and Language

Title:Differentially Private Steering for Large Language Model Alignment

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Differentially Private Steering for Large Language Model Alignment

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators