90 examples can beat 1K — P‑TTS uses principled instructional prompt augmentation to turn 90 AIME seeds into 900 high‑utility training examples, delivering strong reasoning with far less data.
P‑TTS expands a small, vetted seed set (90 AIME 2022–2024 problems) by wrapping each problem with principled instructions to elicit diverse reasoning traces from a teacher model (DeepSeek‑R1). We then fine‑tune Qwen2.5‑Instruct models on these augmented traces.
Principles used (unchanged question text; wrappers are prefixed/suffixed):
- Reward – e.g., "I'll tip $200,000 for a better solution!"
- Penalty – "You will be penalized if the answer is wrong."
- Correctness – "You MUST provide the correct answer."
- Step‑by‑Step – "Think step by step."
Data scales via augmentation multipliers m ∈ {1, 4, 5, 10}: 90 → 360 → 450 → 900.
Benchmarks: AIME24, AIME25, MATH500, GPQA‑Diamond. Backbone: Qwen2.5‑Instruct (7B/14B/32B). Metric: accuracy (lm‑evaluation‑harness; greedy decoding).
Model | #Train ex. | AIME24 | AIME25 | MATH500 | GPQA‑D | Avg. |
---|---|---|---|---|---|---|
P‑TTS‑32B | 900 | 73.33% | 53.33% | 94.20% | 60.61% | 70.35% |
P‑TTS‑14B | 900 | 53.33% | 26.67% | 90.40% | 51.01% | 55.35% |
P‑TTS‑7B | 900 | 43.33% | 26.67% | 84.20% | 41.92% | 49.03% |
The training dataset consists of 900 high-quality reasoning examples generated from 90 AIME seed problems. Each seed problem is augmented using principled instruction wrappers and processed through DeepSeek-R1 to create diverse reasoning traces.
Dataset Composition:
- Source: 90 AIME problems (2022-2024)
- Augmentation: 4 instruction wrapper types with reward variants
- Final Size: 900 training examples
Before training, you need to tokenize your raw dataset. Use the provided tokenization script:
# Run the tokenization script
python tokenize_data.py
To run training, you can find our script at train/sft.py
which you can invoke via one of the train/sft*.sh
scripts, or you can launch via train/launch.sh
if you are on a SLURM cluster (requires editing the file for your cluster setup).
Hardware Requirements:
- For 7B models: 4x A100 GPUs
- For 32B models: 6x B200 GPUs
Quick Start:
git clone https://github.com/simplescaling/s1.git
cd s1
pip3 install -r requirements.txt
# First tokenize your data
python tokenize_data.py
# Then run training
bash train/sft.sh
Note: Training scripts are adapted from simplescaling/s1 (Apache-2.0).
The script expects your training data in CSV format. Update the train_file_path
variable in sft.sh
:
--train_file_path="xx_tokonized.csv"
90 AIME seeds → apply 4 instruction wrappers (+ reward variants) →
query teacher (DeepSeek‑R1) → collect reasoning traces → fine‑tune Qwen2.5‑Instruct
# 1) Build wrapped prompts from seeds
python DataConstruction/build_prompt_variants.py \
--input DataConstruction/seeds.csv \
--out DataConstruction/variants.csv
# 2) Query teacher model to collect reasoning traces
python DataConstruction/deepseek_query.py \
--input DataConstruction/variants.csv \
--out DataConstruction/DS_responses.csv
# 3) Combine Full Data
python DataConstruction/combine_deepseek_data.py
@article{bsharat2025prompting,
title={Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation},
author={Bsharat, Sondos Mahmoud and Shen, Zhiqiang},
journal={arXiv preprint arXiv:2510.09599},
year={2025}
}