This repository contains the implementation of the framework described in TSDS: Data Selection for Task-Specific Model Finetuning.
Before running the project, ensure you have Python installed. You can download the latest version of Python from here.
-
Clone the repository:
git https://github.com/ZifanL/TSDS.git cd TSDS
-
Install the required dependencies from the
requirements.txt
file:pip install -r requirements.txt
-
(Optional) If you're using
faiss-gpu
, ensure you have the correct GPU drivers installed. Refer to the Faiss documentation for more information.
After installing the dependencies, you can run the project as follows using the toy data:
python tsds.py
In the output folder, the output file selected_candidate_indices.npy
will contain the indices of the selected candidates.
To run TSDS on your customized data, two embedding files are needed:
- An
.npy
file that stores the embeddings of the candidate examples. The shape of the array should be (number of candidates, embedding dimensions) - An
.npy
file that stores the embeddings of the query examples. The shape of the array should be (number of query examples, embedding dimensions) Change the file paths inconfig.yaml
. Adjust the parameters inconfig.yaml
as needed. The implementation usesfaiss.IndexIVFFlat
for approximate nearest neighbor search. To use a customized index, add it tofaiss_helper.py
and substituteFaissIndexIVFFlat
intsds.py
.
Please cite our paper if you find this repo helpful in your work:
@inproceedings{
liu2024tsds,
title={{TSDS}: Data Selection for Task-Specific Model Finetuning},
author={Zifan Liu and Amin Karbasi and Theodoros Rekatsinas},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=wjbTHLUSzU}
}