TKPO - Token-level Preference Self-Alignment Optimization for Multi-style Outline Controllable Generation
TKPO adopts token-level preference self-alignment optimization for multi-style (concise vs. comprehensive; objective vs. literature) outline generation, as depicted in the toy example below.
Specifically, we extend the Bradley-Terry model from pair-wise to list-wise comparison, which is further applied at the token level for fine-grained preference signal utilization. In comparison to the representative methods, such as DPO, TKPO does not require response pairs; instead, we propose a controllable attributes-driven method to construct reject samples for self-alignment. Experiments demonstrate that TKPO outperforms DPO by up to 19.28% in performance while requiring only 56.25% in training time.
Check out our papers to learn more: Token-level Preference Self-Alignment Optimization for Multi-style Outline Controllable Generation
- python 3.10.11
- pytorch 2.0.1
- transformers 4.43.2
- deepspeed 0.14.4
- llamafactory 0.8.4.dev0
- cuda 11.7
We curate two datasets (level-of-detail and language style) in our paper for outline controllable generation, all of which are already included in the data
directory of this repo.
All the experiments are conducted on 8 transformers/models/qwen2/modeling_qwen2.py
with our provided modeling_qwen2.py
, run llamafactory-cli train qwen_sft_tkpo.yaml
for SFT.
If you find our code, data, models, or the paper useful, please cite the paper:
This work benefits from LLaMA-Factory and Qwen2.5. Thanks for their significant contributions to the community.