Gemini Data Refining Pipeline

This project is a Python-based, flexible, and extensible tool for transforming raw conversational data (TXT, PDF, DOCX) into high-quality JSONL datasets for fine-tuning large language models (LLMs).

Features

Multi-format Input: Supports .txt, .pdf, and .docx files.
Automated Chunking: Segments long texts into coherent conversational pairs.
Customizable Prompt: Easily modify the system prompt for your use case.
Error Handling: Robust error reporting for each chunk.
Modern Pythonic CLI: Simple, interactive command-line interface.

Installation

Clone the repository:

git clone https://github.com/O96a/data-refining-pipeline.git
cd data-refining-pipeline

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Run the tool:
```
python main.py
```
Follow the prompts:
- Enter your Google AI Studio API key.
- (Optionally) Edit the system prompt.
- Enter the path to your input file (.txt, .pdf, or .docx).
The output will be saved as a .jsonl file in the same directory as your input file.

Example

Enter your Google AI Studio API key: <your-api-key>
Modify system prompt? (y/n): n
Enter the path to your input text file: dataset_samples.txt
Processing dataset_samples.txt with Gemini...
Processed: ...
Processing complete! Output saved as: gemini_output_dataset_samples.txt.jsonl

System Prompt Customization

You can fully customize the system prompt to fit your data cleaning, translation, or formatting needs. The default prompt is designed for general conversational data refinement.

File Support

.txt: Standard UTF-8 text files.
.pdf: Extracts text from all pages.
.docx: Extracts text from all paragraphs.

Contributing

Contributions, issues, and feature requests are welcome! Please open an issue or submit a pull request.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Gemini Data Refining Pipeline

Features

Installation

Usage

Example

System Prompt Customization

File Support

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

O96a/data-refining-pipeline

Folders and files

Latest commit

History

Repository files navigation

Gemini Data Refining Pipeline

Features

Installation

Usage

Example

System Prompt Customization

File Support

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages