这是indexloc提供的服务,不要输入任何密码
Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
858880a
feature: add file chunker microservice
densumesh Nov 8, 2024
c826a85
feature: add logging and write to clickhouse
densumesh Nov 8, 2024
7b79067
feature: added api key and file chunking status
densumesh Nov 8, 2024
c105be6
feature: add CLI to test pdf chunking
densumesh Nov 9, 2024
9ecffbe
cleanup: add .env.dist + use env's for model in pdf_chunk.rs
skeptrunedev Nov 11, 2024
6352b4e
feat: use pdla boxes for chunking
drew-harris Nov 9, 2024
f6e9633
cleanup: clippy
drew-harris Nov 11, 2024
72ca0ba
feat: add docker-compose.yml for local testing
drew-harris Nov 12, 2024
d717c1e
feature: pdf to md :)
densumesh Nov 13, 2024
052ccc9
feature: add page right after its been processed
densumesh Nov 15, 2024
a2a8a97
refactor: complete rename from file-chunker to pdf2md
skeptrunedev Nov 15, 2024
1c89f89
bugfix: include snippet to handle loading env's at runtime instead of…
skeptrunedev Nov 15, 2024
b34e5ba
feature: docker setup for pdf2md service
skeptrunedev Nov 15, 2024
fdb9298
feature: add CI workflow to push images for pdf2md service
skeptrunedev Nov 15, 2024
9e2a55c
feature: enable chunk bulk delete as an async operation where chunks …
skeptrunedev Nov 15, 2024
aed7531
feature: skeleton for pdf2md demo page
skeptrunedev Nov 16, 2024
4f3f464
feature: working CSS for demo-ui on pdf2md
skeptrunedev Nov 16, 2024
ac7c7fd
feature: finished navbar for pdf2md
skeptrunedev Nov 16, 2024
b5d7aa3
feature: skeleton upload form for pdf2md
skeptrunedev Nov 16, 2024
a1d8bda
feature: add health check + fix redoc
skeptrunedev Nov 18, 2024
97b5f88
feature: add skeleton template others can slot into
skeptrunedev Nov 18, 2024
c4bf0b6
feature: add api key functionality to demo-ui app
skeptrunedev Nov 18, 2024
2888eb0
feature: working API req on file form add
skeptrunedev Nov 18, 2024
41a51a5
feature: add advanced options and usage metrics to each chunk
densumesh Nov 19, 2024
1a9710f
cleanup: remove creating a file on the file_handler
cdxker Nov 17, 2024
29f46f3
feature: setup fileworker to communicate with pdf2md
cdxker Nov 18, 2024
15b7bbf
bugfix: correct page count using redis
cdxker Nov 19, 2024
d245cf0
feature: update to new route name
cdxker Nov 19, 2024
967d049
feature: make pdf2md ocr optional
cdxker Nov 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .env.server
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,5 @@ VECTOR_SIZES="384,512,768,1024,1536,3072"
RUST_LOG="INFO"
BM25_ACTIVE="true"
FIRECRAWL_URL=https://api.firecrawl.dev
FIRECRAWL_API_KEY=fc-abdef**************
FIRECRAWL_API_KEY=fc-abdef**************
PDF2MD_URL="http://localhost:8081"
6 changes: 1 addition & 5 deletions .github/ISSUE_TEMPLATE/issue-template.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,7 @@ assignees: ''

### Target(s)

<replace w/ one or more of the following options: `server`, `search`, `chat`>

### Requirement to close

<please describe what is required to close this issue here>
<replace w/ name of the service(s) which are associated with this issue>

### Community channels

Expand Down
149 changes: 149 additions & 0 deletions .github/workflows/push-pdf2md-server.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
name: Create PDF2MD Docker Images

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref }}
cancel-in-progress: true

on:
workflow_dispatch:
push:
branches:
- "main"
paths:
- "pdf2md/server/**"

jobs:
pdf2md-server:
name: Push PDF2MD Server image
runs-on: ${{ matrix.runner }}
strategy:
matrix:
runner: [blacksmith-8vcpu-ubuntu-2204]
platform: [linux/amd64]
exclude:
- runner: blacksmith-8vcpu-ubuntu-2204
platform: linux/arm64
- runner: blacksmith-8vcpu-ubuntu-2204-arm
platform: linux/amd64
steps:
- name: Checkout the repo
uses: actions/checkout@v4

- name: Setup buildx
uses: docker/setup-buildx-action@v3

- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}

- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: |
trieve/pdf2md-server
tags: |
type=raw,latest
type=sha

- name: Build and push Docker image
uses: useblacksmith/build-push-action@v1.0.0-beta
with:
platforms: ${{ matrix.platform }}
context: pdf2md/
file: ./pdf2md/server/Dockerfile.pdf2md-server
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}

chunk-worker:
name: Push PDF2MD Chunk Worker image
runs-on: ${{ matrix.runner }}
strategy:
matrix:
runner: [blacksmith-8vcpu-ubuntu-2204]
platform: [linux/amd64]
exclude:
- runner: blacksmith-8vcpu-ubuntu-2204
platform: linux/arm64
- runner: blacksmith-8vcpu-ubuntu-2204-arm
platform: linux/amd64
steps:
- name: Checkout the repo
uses: actions/checkout@v4

- name: Setup buildx
uses: docker/setup-buildx-action@v3

- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}

- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: |
trieve/chunk-worker
tags: |
type=raw,latest
type=sha

- name: Build and push Docker image
uses: useblacksmith/build-push-action@v1.0.0-beta
with:
platforms: ${{ matrix.platform }}
context: pdf2md/
file: ./pdf2md/server/Dockerfile.chunk-worker
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}

supervisor-worker:
name: Push PDF2MD Supervisor Worker image
runs-on: ${{ matrix.runner }}
strategy:
matrix:
runner: [blacksmith-8vcpu-ubuntu-2204]
platform: [linux/amd64]
exclude:
- runner: blacksmith-8vcpu-ubuntu-2204
platform: linux/arm64
- runner: blacksmith-8vcpu-ubuntu-2204-arm
platform: linux/amd64
steps:
- name: Checkout the repo
uses: actions/checkout@v4

- name: Setup buildx
uses: docker/setup-buildx-action@v3

- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}

- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: |
trieve/supervisor-worker
tags: |
type=raw,latest
type=sha

- name: Build and push Docker image
uses: useblacksmith/build-push-action@v1.0.0-beta
with:
platforms: ${{ matrix.platform }}
context: pdf2md/
file: ./pdf2md/server/Dockerfile.supervisor-worker
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ story_html.zip
testing.ipynb
output.json
temp.json
analytics/analytics-server/target
**/target
server/target
server/images
server/tantivy
Expand Down Expand Up @@ -92,4 +92,6 @@ server/migrations/2024-07-26-165058_move_config_to_table/down.sql
server/migrations/2024-07-26-165058_move_config_to_table/up.sql
dist/**


clients/python-sdk/dist
pdf2md/ch_migrations/chm.toml
6 changes: 5 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@
"[python]": {
"editor.defaultFormatter": "ms-python.black-formatter"
},
"rust-analyzer.linkedProjects": ["./server/Cargo.toml"],
"rust-analyzer.linkedProjects": [
"./server/Cargo.toml",
"./pdf2md/server/Cargo.toml",
"./pdf2md/cli/Cargo.toml"
],
"rust-analyzer.showUnlinkedFileNotification": false,
"rust-analyzer.server.path": "~/.cargo/bin/rust-analyzer",
"python.analysis.typeCheckingMode": "basic",
Expand Down
27 changes: 27 additions & 0 deletions pdf2md/.env.dist
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Redis
REDIS_URL=redis://:thisredispasswordisverysecureandcomplex@localhost:6379
REDIS_PASSWORD=thisredispasswordisverysecureandcomplex

# Clickhouse
CLICKHOUSE_URL=http://localhost:8123
CLICKHOUSE_DB=default
CLICKHOUSE_USER=clickhouse
CLICKHOUSE_PASSWORD=password

# S3
S3_ENDPOINT=http://127.0.0.1:9000
S3_ACCESS_KEY=ZaaZZaaZZaaZZaaZZaaZ
S3_SECRET_KEY=ssssssssssssssssssssTTTTTTTTTTTTTTTTTTTT
S3_BUCKET=trieve

# S3 dockerfile auto-configuration
MINIO_ROOT_USER=rootuser
MINIO_ROOT_PASSWORD=rootpassword

# PDF2MD conversion worker services
LLM_BASE_URL=https://openrouter.ai/api/v1
LLM_API_KEY=
LLM_MODEL=gpt-4o-mini

# PDF2MD HTTP API server
API_KEY=admin
39 changes: 39 additions & 0 deletions pdf2md/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Contributing to PDF2MD

## Setup ENV's

```bash
cd server
cp .env.dist .env
```

## Run dep processes

```bash
docker compose --profile dev up -d
```

## Run Server + Workers

Strongly recommend using tmux or another multiplex system to handle the different proceses.

```bash
cargo watch -x run #HTTP server
cargo run --bin supervisor-worker
cargo run --bin chunk-worker
```

## CLI

Make your changes then use the following to run:

```bash
cd cli
cargo run -- help #or other command instead of help
```

## Run tailwindcss server for demo UI

```
npx tailwindcss -i ./static/in.css -o ./static/output.css --watch
```
Loading