Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?

Mukherjee, Sourabrata; Ojha, Atul Kr.; McCrae, John P.; Dusek, Ondrej

Computer Science > Computation and Language

arXiv:2502.04718 (cs)

[Submitted on 7 Feb 2025 (v1), last revised 23 Apr 2025 (this version, v2)]

Title:Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?

Authors:Sourabrata Mukherjee, Atul Kr. Ojha, John P. McCrae, Ondrej Dusek

View PDF

Abstract:Text style transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Evaluating TST outputs is a multidimensional challenge, requiring the assessment of style transfer accuracy, content preservation, and naturalness. Using human evaluation is ideal but costly, as is common in other natural language processing (NLP) tasks, however, automatic metrics for TST have not received as much attention as metrics for, e.g., machine translation or summarization. In this paper, we examine both set of existing and novel metrics from broader NLP tasks for TST evaluation, focusing on two popular subtasks, sentiment transfer and detoxification, in a multilingual context comprising English, Hindi, and Bengali. By conducting meta-evaluation through correlation with human judgments, we demonstrate the effectiveness of these metrics when used individually and in ensembles. Additionally, we investigate the potential of large language models (LLMs) as tools for TST evaluation. Our findings highlight newly applied advanced NLP metrics and LLM-based evaluations provide better insights than existing TST metrics. Our oracle ensemble approaches show even more potential.

Comments:	Accepted at NAACL SRW 2025
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.04718 [cs.CL]
	(or arXiv:2502.04718v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.04718

Submission history

From: Sourabrata Mukherjee [view email]
[v1] Fri, 7 Feb 2025 07:39:17 UTC (10,661 KB)
[v2] Wed, 23 Apr 2025 04:06:56 UTC (1,524 KB)

Computer Science > Computation and Language

Title:Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators