Carve

Carve is a high-performance, Arrow-native log parser that transforms structured logs into Apache Arrow format for fast analytics. Built in Go with zero external dependencies for the binary.

Features

Arrow-native output: Direct conversion to Apache Arrow IPC format
Regex-based parsing: Use named capture groups to define schema
Memory efficient: Configurable batch sizes prevent memory bloat
Stream processing: Handle large files with constant memory usage
Zero-copy interop: Works seamlessly with DuckDB, Polars, pandas, etc.
High performance: Optimized for throughput with built-in benchmarking

Installation

Build from source:

git clone https://github.com/TFMV/carve.git
cd carve
go build -o carve ./cmd/carve

Quick Start

Transform logs with a simple regex pattern:

# Parse structured logs into Arrow format
carve --pattern '^(?P<timestamp>[^ ]+) (?P<level>\w+) (?P<message>.+)' \
      --input app.log \
      --output logs.arrow

# View the inferred schema
carve --pattern '^(?P<timestamp>[^ ]+) (?P<level>\w+) (?P<message>.+)' \
      --schema

Usage

Basic Options

carve [options]

Options:
  --pattern string        Regex pattern with named capture groups (required)
  --input string         Input file (defaults to stdin)
  --output string        Output Arrow IPC file (required unless --schema)
  --flush-interval int   Rows per record batch (default 10000)
  --schema              Print inferred schema and exit
  --verbose             Verbose logging with warnings for non-matching lines
  --bench-report        Emit per-batch timing information
  --max-rows int        Limit processed input (0 = unlimited)
  --version             Print version and exit

Examples

Parse web server logs:

carve --pattern '^(?P<ip>\S+) - - \[(?P<timestamp>[^\]]+)\] "(?P<method>\w+) (?P<path>\S+) (?P<protocol>[^"]+)" (?P<status>\d+) (?P<size>\d+)' \
      --input access.log \
      --output access.arrow

Parse application logs with verbose output:

carve --pattern '^(?P<ts>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z) \[(?P<thread>[^\]]+)\] (?P<level>\w+) (?P<logger>\S+) - (?P<msg>.+)' \
      --input app.log \
      --output app.arrow \
      --verbose

Process from stdin with custom batch size:

tail -f /var/log/app.log | carve --pattern '^(?P<ts>[^ ]+) (?P<level>\w+) (?P<msg>.+)' \
                                 --output live.arrow \
                                 --flush-interval 1000

Analytics Integration

DuckDB

Query Arrow files directly in DuckDB:

-- Load and explore the data
SELECT * FROM 'logs.arrow' LIMIT 10;

-- Analyze error patterns
SELECT level, COUNT(*) as count 
FROM 'logs.arrow' 
GROUP BY level 
ORDER BY count DESC;

-- Time-based analysis
SELECT DATE_TRUNC('hour', CAST(timestamp AS TIMESTAMP)) as hour,
       COUNT(*) as events
FROM 'logs.arrow' 
WHERE level = 'ERROR'
GROUP BY hour
ORDER BY hour;

-- Find specific patterns
SELECT timestamp, message 
FROM 'logs.arrow'
WHERE message LIKE '%timeout%'
ORDER BY timestamp DESC;

Python (pandas/polars)

import pyarrow as pa
import pandas as pd

# Load with pandas
df = pd.read_feather('logs.arrow')
print(df.head())

# Or with polars for better performance
import polars as pl
df = pl.read_ipc('logs.arrow')
df.filter(pl.col('level') == 'ERROR').head()

Performance

Carve is optimized for high-throughput log processing:

Benchmarks

Run the built-in benchmarks:

cd pkg/carve
go test -bench=. -benchmem

Example results on a modern laptop:

BenchmarkSmallFile-8         5000    250000 ns/op    45000 B/op    120 allocs/op
BenchmarkMediumFile-8         500   2500000 ns/op   450000 B/op   1200 allocs/op  
BenchmarkLargeFile-8           50  25000000 ns/op  4500000 B/op  12000 allocs/op

Memory Usage

Streaming: Constant memory usage regardless of input size
Batching: Configurable batch sizes (1K-100K rows recommended)
Zero-copy: Arrow format enables zero-copy reads in analytics tools

Throughput

Typical performance on structured logs:

Small files (<1MB): ~10MB/s
Medium files (1-100MB): ~50MB/s
Large files (100MB+): ~100MB/s

Performance varies with regex complexity and match rate.

Schema Inference

Carve automatically creates Arrow schemas from regex named capture groups:

# View schema for a pattern
carve --pattern '^(?P<timestamp>[^ ]+) (?P<level>\w+) (?P<message>.+)' --schema

Output:

Schema (3 fields):
  0: timestamp (string)
  1: level (string)  
  2: message (string)

Notes:

All fields are currently typed as string
Field order matches the order of named groups in the regex
Unnamed capture groups are ignored

Error Handling

Non-matching Lines

By default, lines that don't match the pattern are silently skipped. Use --verbose to see warnings:

carve --pattern '^(?P<ts>[^ ]+) (?P<level>\w+) (?P<msg>.+)' \
      --input mixed.log \
      --output clean.arrow \
      --verbose

Output:

[warn] line 42: does not match pattern
[warn] line 103: does not match pattern
processed 1000 lines, wrote 950 rows to clean.arrow

Invalid Patterns

Carve validates regex patterns at startup:

# Invalid regex
carve --pattern '[invalid' --schema
# Error: failed to compile pattern: error parsing regexp: missing closing ]: `[invalid`

# No named groups  
carve --pattern '(\w+)' --schema
# Error: schema error: pattern must contain named capture groups

Validation

DuckDB Integration Test

The repository includes sample data for testing DuckDB integration:

# Generate sample Arrow file
carve --pattern '^(?P<ts>\d{4}-[^ ]+) (?P<level>\w+) (?P<msg>.+)' \
      --input testdata/sample.log \
      --output testdata/sample.arrow

# Test with DuckDB
duckdb -c "SELECT level, COUNT(*) FROM 'testdata/sample.arrow' GROUP BY level;"
duckdb -c "SELECT msg FROM 'testdata/sample.arrow' WHERE level = 'WARN';"

Performance Testing

Use the new performance flags for benchmarking:

# Monitor batch processing performance
carve --pattern '^(?P<ts>[^ ]+) (?P<level>\w+) (?P<msg>.+)' \
      --input large.log \
      --output large.arrow \
      --bench-report \
      --flush-interval 5000

# Limit processing for testing
carve --pattern '^(?P<ts>[^ ]+) (?P<level>\w+) (?P<msg>.+)' \
      --input huge.log \
      --output sample.arrow \
      --max-rows 10000 \
      --verbose

Development

Testing

Run the full test suite:

# Unit tests
go test ./pkg/carve

# Integration tests  
go test ./cmd/carve

# All tests with coverage
go test ./... -coverprofile=coverage.out
go tool cover -html=coverage.out

API Usage

Use Carve as a library:

package main

import (
    "regexp"
    "carve/pkg/carve"
    "github.com/apache/arrow-go/v18/arrow/memory"
)

func main() {
    re := regexp.MustCompile(`^(?P<ts>[^ ]+) (?P<level>\w+) (?P<msg>.+)`)
    
    // Extract schema
    schema, err := carve.ExtractSchema(re)
    if err != nil {
        panic(err)
    }
    
    // Create writer
    writer := carve.NewArrowWriter(schema, memory.DefaultAllocator, 1000)
    
    // Parse and append lines
    line := "2023-01-01 INFO Application started"
    if vals := carve.ParseLine(line, re); vals != nil {
        writer.Append(vals)
    }
    
    // Flush to get Arrow record
    if writer.Rows() > 0 {
        record := writer.Flush()
        defer record.Release()
        // Use record...
    }
}

Roadmap

Future enhancements:

Type inference: Automatic detection of numeric/timestamp fields
Multiple output formats: JSON, CSV, Parquet support
Streaming mode: Real-time processing with minimal latency
Advanced patterns: Multi-line log support
Compression: Built-in compression options

License

MIT License - see LICENSE file for details.

Carve - Transform logs into analytics-ready Arrow format with zero lock-in.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
cmd/carve		cmd/carve
pkg/carve		pkg/carve
testdata		testdata
.gitignore		.gitignore
BENCHMARK_REPORT.md		BENCHMARK_REPORT.md
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Carve

Features

Installation

Quick Start

Usage

Basic Options

Examples

Analytics Integration

DuckDB

Python (pandas/polars)

Performance

Benchmarks

Memory Usage

Throughput

Schema Inference

Error Handling

Non-matching Lines

Invalid Patterns

Validation

DuckDB Integration Test

Performance Testing

Development

Testing

API Usage

Roadmap

License

About

Uh oh!

Releases

Packages

Languages

License

TFMV/carve

Folders and files

Latest commit

History

Repository files navigation

Carve

Features

Installation

Quick Start

Usage

Basic Options

Examples

Analytics Integration

DuckDB

Python (pandas/polars)

Performance

Benchmarks

Memory Usage

Throughput

Schema Inference

Error Handling

Non-matching Lines

Invalid Patterns

Validation

DuckDB Integration Test

Performance Testing

Development

Testing

API Usage

Roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages