Datalore is a terminal tool for generating structured datasets from local files like PDFs, Word docs, images, and text. You upload a file and describe the kind of dataset you want. It extracts the content, uses semantic search to understand and gather relevant context, applies your instructions through a generated schema, and outputs clean, structured data. Perfect for converting raw or unstructured local documents into ready-to-use datasets for training, analysis, or experimentation, all without manual formatting.
- give the path to a local directory containing all kind files mentioned (PDF, DOCX, JPG, TXT, etc.)
- extracts text from the uploaded document
- splits the content page-wise into smaller chunks
- randomly selects a chunk to use as a reference
- runs a semantic similarity search using Qdrant to find related chunks
- gathers similar chunks to build a context window
- formats the gathered context cleanly
- generates structured data using an instruction query and generated schema
- evolves and improves the dataset iteratively
- combines generated samples into a complete dataset
- exports the final dataset in CSV or JSON format via the terminal
This diagram shows how Datalore takes a local file and an instruction, extracts and understands the content, and turns it into a structured dataset.
Follow these steps to set up and run the project locally.
uv
is required to manage the virtual environment and dependencies.
You can download it from the official uv GitHub repository, which includes platform-specific installation instructions.
git clone https://github.com/Datalore-ai/datalore-localgen-cli.git
cd datalore-localgen-cli
Use uv
to create a virtual environment:
uv venv
Activate the environment depending on your OS:
Windows:
.venv\Scripts\activate
macOS/Linux:
source .venv/bin/activate
Copy the example .env
file and add your API keys:
cp .env.example .env
Open the .env
file in a text editor and fill in the required fields:
OPENAI_API_KEY=your_openai_api_key_here
MISTRAL=your_mistral_api_key_here
# defaults
QDRANT_URL=http://localhost:6333
COLLECTION_NAME=knowledge_base
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
These keys are essential for the application to work correctly.
Install required packages using:
uv pip install -r requirements.txt
Make sure you have Docker and Docker Compose installed. Then start the required services (e.g., Qdrant) using:
docker-compose up --build
This will spin up the necessary services in the background.
Once the environment and services are ready, start the application:
python main.py
You're all set to go! The application will now guide you through the dataset creation process step by step and the final dataset will be saved in the output_files directory.
You can customize how the tool behaves using the configuration.py
file. It lets you adjust 2 parameters for this application.
CONFIGURATION = {
"rows_per_context": 5, # Number of QAs or rows generated per chunk
"evolution_depth": 1, # How much transformation/evolution to apply (1 = minimal, 3 = very complex)
}
If something here could be improved, please open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE
file for more details.