GitHub - MaTriXy/datalore-localgen-cli: synthetic dataset generation workflow using local file resources for finetuning llms.

Overview

Datalore is a terminal tool for generating structured datasets from local files like PDFs, Word docs, images, and text. You upload a file and describe the kind of dataset you want. It extracts the content, uses semantic search to understand and gather relevant context, applies your instructions through a generated schema, and outputs clean, structured data. Perfect for converting raw or unstructured local documents into ready-to-use datasets for training, analysis, or experimentation, all without manual formatting.

How It Works

give the path to a local directory containing all kind files mentioned (PDF, DOCX, JPG, TXT, etc.)
extracts text from the uploaded document
splits the content page-wise into smaller chunks
randomly selects a chunk to use as a reference
runs a semantic similarity search using Qdrant to find related chunks
gathers similar chunks to build a context window
formats the gathered context cleanly
generates structured data using an instruction query and generated schema
evolves and improves the dataset iteratively
combines generated samples into a complete dataset
exports the final dataset in CSV or JSON format via the terminal

Workflow

This diagram shows how Datalore takes a local file and an instruction, extracts and understands the content, and turns it into a structured dataset.

Getting Started

Follow these steps to set up and run the project locally.

Prerequisite: Install `uv`

uv is required to manage the virtual environment and dependencies.

You can download it from the official uv GitHub repository, which includes platform-specific installation instructions.

1. Clone the Repository

git clone https://github.com/Datalore-ai/datalore-localgen-cli.git
cd datalore-localgen-cli

2. Create a Virtual Environment

Use uv to create a virtual environment:

uv venv

3. Activate the Virtual Environment

Activate the environment depending on your OS:

Windows:

.venv\Scripts\activate

macOS/Linux:

source .venv/bin/activate

4. Set Up Environment Variables

Copy the example .env file and add your API keys:

cp .env.example .env

Open the .env file in a text editor and fill in the required fields:

OPENAI_API_KEY=your_openai_api_key_here
MISTRAL=your_mistral_api_key_here

# defaults
QDRANT_URL=http://localhost:6333
COLLECTION_NAME=knowledge_base
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5

These keys are essential for the application to work correctly.

5. Install Dependencies

Install required packages using:

uv pip install -r requirements.txt

5. Set Up Docker for Qdrant vectorDB

Make sure you have Docker and Docker Compose installed. Then start the required services (e.g., Qdrant) using:

docker-compose up --build

This will spin up the necessary services in the background.

6. Run the Application

Once the environment and services are ready, start the application:

python main.py

You're all set to go! The application will now guide you through the dataset creation process step by step and the final dataset will be saved in the output_files directory.

Optional: `configuration.py`

You can customize how the tool behaves using the configuration.py file. It lets you adjust 2 parameters for this application.

CONFIGURATION = {
    "rows_per_context": 5,           # Number of QAs or rows generated per chunk
    "evolution_depth": 1,            # How much transformation/evolution to apply (1 = minimal, 3 = very complex)
}

Authors

Contributing

If something here could be improved, please open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
agents		agents
assets		assets
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
configuration.py		configuration.py
docker-compose.yaml		docker-compose.yaml
main.py		main.py
prompts.py		prompts.py
qdrant_setup.py		qdrant_setup.py
requirements.txt		requirements.txt
schemas.py		schemas.py
utils.py		utils.py
workflow.py		workflow.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

How It Works

Workflow

Getting Started

Prerequisite: Install `uv`

1. Clone the Repository

2. Create a Virtual Environment

3. Activate the Virtual Environment

4. Set Up Environment Variables

5. Install Dependencies

5. Set Up Docker for Qdrant vectorDB

6. Run the Application

Optional: `configuration.py`

Authors

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

MaTriXy/datalore-localgen-cli

Folders and files

Latest commit

History

Repository files navigation

Overview

How It Works

Workflow

Getting Started

Prerequisite: Install uv

1. Clone the Repository

2. Create a Virtual Environment

3. Activate the Virtual Environment

4. Set Up Environment Variables

5. Install Dependencies

5. Set Up Docker for Qdrant vectorDB

6. Run the Application

Optional: configuration.py

Authors

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Prerequisite: Install `uv`

Optional: `configuration.py`

Packages