+
Skip to content

TFMV/carve

Repository files navigation

Carve

Build and Test

Carve is a high-performance, Arrow-native log parser that transforms structured logs into Apache Arrow format for fast analytics. Built in Go with zero external dependencies for the binary.

Features

  • Arrow-native output: Direct conversion to Apache Arrow IPC format
  • Regex-based parsing: Use named capture groups to define schema
  • Memory efficient: Configurable batch sizes prevent memory bloat
  • Stream processing: Handle large files with constant memory usage
  • Zero-copy interop: Works seamlessly with DuckDB, Polars, pandas, etc.
  • High performance: Optimized for throughput with built-in benchmarking

Installation

Build from source:

git clone https://github.com/TFMV/carve.git
cd carve
go build -o carve ./cmd/carve

Quick Start

Transform logs with a simple regex pattern:

# Parse structured logs into Arrow format
carve --pattern '^(?P<timestamp>[^ ]+) (?P<level>\w+) (?P<message>.+)' \
      --input app.log \
      --output logs.arrow

# View the inferred schema
carve --pattern '^(?P<timestamp>[^ ]+) (?P<level>\w+) (?P<message>.+)' \
      --schema

Usage

Basic Options

carve [options]

Options:
  --pattern string        Regex pattern with named capture groups (required)
  --input string         Input file (defaults to stdin)
  --output string        Output Arrow IPC file (required unless --schema)
  --flush-interval int   Rows per record batch (default 10000)
  --schema              Print inferred schema and exit
  --verbose             Verbose logging with warnings for non-matching lines
  --bench-report        Emit per-batch timing information
  --max-rows int        Limit processed input (0 = unlimited)
  --version             Print version and exit

Examples

Parse web server logs:

carve --pattern '^(?P<ip>\S+) - - \[(?P<timestamp>[^\]]+)\] "(?P<method>\w+) (?P<path>\S+) (?P<protocol>[^"]+)" (?P<status>\d+) (?P<size>\d+)' \
      --input access.log \
      --output access.arrow

Parse application logs with verbose output:

carve --pattern '^(?P<ts>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z) \[(?P<thread>[^\]]+)\] (?P<level>\w+) (?P<logger>\S+) - (?P<msg>.+)' \
      --input app.log \
      --output app.arrow \
      --verbose

Process from stdin with custom batch size:

tail -f /var/log/app.log | carve --pattern '^(?P<ts>[^ ]+) (?P<level>\w+) (?P<msg>.+)' \
                                 --output live.arrow \
                                 --flush-interval 1000

Analytics Integration

DuckDB

Query Arrow files directly in DuckDB:

-- Load and explore the data
SELECT * FROM 'logs.arrow' LIMIT 10;

-- Analyze error patterns
SELECT level, COUNT(*) as count 
FROM 'logs.arrow' 
GROUP BY level 
ORDER BY count DESC;

-- Time-based analysis
SELECT DATE_TRUNC('hour', CAST(timestamp AS TIMESTAMP)) as hour,
       COUNT(*) as events
FROM 'logs.arrow' 
WHERE level = 'ERROR'
GROUP BY hour
ORDER BY hour;

-- Find specific patterns
SELECT timestamp, message 
FROM 'logs.arrow'
WHERE message LIKE '%timeout%'
ORDER BY timestamp DESC;

Python (pandas/polars)

import pyarrow as pa
import pandas as pd

# Load with pandas
df = pd.read_feather('logs.arrow')
print(df.head())

# Or with polars for better performance
import polars as pl
df = pl.read_ipc('logs.arrow')
df.filter(pl.col('level') == 'ERROR').head()

Performance

Carve is optimized for high-throughput log processing:

Benchmarks

Run the built-in benchmarks:

cd pkg/carve
go test -bench=. -benchmem

Example results on a modern laptop:

BenchmarkSmallFile-8         5000    250000 ns/op    45000 B/op    120 allocs/op
BenchmarkMediumFile-8         500   2500000 ns/op   450000 B/op   1200 allocs/op  
BenchmarkLargeFile-8           50  25000000 ns/op  4500000 B/op  12000 allocs/op

Memory Usage

  • Streaming: Constant memory usage regardless of input size
  • Batching: Configurable batch sizes (1K-100K rows recommended)
  • Zero-copy: Arrow format enables zero-copy reads in analytics tools

Throughput

Typical performance on structured logs:

  • Small files (<1MB): ~10MB/s
  • Medium files (1-100MB): ~50MB/s
  • Large files (100MB+): ~100MB/s

Performance varies with regex complexity and match rate.

Schema Inference

Carve automatically creates Arrow schemas from regex named capture groups:

# View schema for a pattern
carve --pattern '^(?P<timestamp>[^ ]+) (?P<level>\w+) (?P<message>.+)' --schema

Output:

Schema (3 fields):
  0: timestamp (string)
  1: level (string)  
  2: message (string)

Notes:

  • All fields are currently typed as string
  • Field order matches the order of named groups in the regex
  • Unnamed capture groups are ignored

Error Handling

Non-matching Lines

By default, lines that don't match the pattern are silently skipped. Use --verbose to see warnings:

carve --pattern '^(?P<ts>[^ ]+) (?P<level>\w+) (?P<msg>.+)' \
      --input mixed.log \
      --output clean.arrow \
      --verbose

Output:

[warn] line 42: does not match pattern
[warn] line 103: does not match pattern
processed 1000 lines, wrote 950 rows to clean.arrow

Invalid Patterns

Carve validates regex patterns at startup:

# Invalid regex
carve --pattern '[invalid' --schema
# Error: failed to compile pattern: error parsing regexp: missing closing ]: `[invalid`

# No named groups  
carve --pattern '(\w+)' --schema
# Error: schema error: pattern must contain named capture groups

Validation

DuckDB Integration Test

The repository includes sample data for testing DuckDB integration:

# Generate sample Arrow file
carve --pattern '^(?P<ts>\d{4}-[^ ]+) (?P<level>\w+) (?P<msg>.+)' \
      --input testdata/sample.log \
      --output testdata/sample.arrow

# Test with DuckDB
duckdb -c "SELECT level, COUNT(*) FROM 'testdata/sample.arrow' GROUP BY level;"
duckdb -c "SELECT msg FROM 'testdata/sample.arrow' WHERE level = 'WARN';"

Performance Testing

Use the new performance flags for benchmarking:

# Monitor batch processing performance
carve --pattern '^(?P<ts>[^ ]+) (?P<level>\w+) (?P<msg>.+)' \
      --input large.log \
      --output large.arrow \
      --bench-report \
      --flush-interval 5000

# Limit processing for testing
carve --pattern '^(?P<ts>[^ ]+) (?P<level>\w+) (?P<msg>.+)' \
      --input huge.log \
      --output sample.arrow \
      --max-rows 10000 \
      --verbose

Development

Testing

Run the full test suite:

# Unit tests
go test ./pkg/carve

# Integration tests  
go test ./cmd/carve

# All tests with coverage
go test ./... -coverprofile=coverage.out
go tool cover -html=coverage.out

API Usage

Use Carve as a library:

package main

import (
    "regexp"
    "carve/pkg/carve"
    "github.com/apache/arrow-go/v18/arrow/memory"
)

func main() {
    re := regexp.MustCompile(`^(?P<ts>[^ ]+) (?P<level>\w+) (?P<msg>.+)`)
    
    // Extract schema
    schema, err := carve.ExtractSchema(re)
    if err != nil {
        panic(err)
    }
    
    // Create writer
    writer := carve.NewArrowWriter(schema, memory.DefaultAllocator, 1000)
    
    // Parse and append lines
    line := "2023-01-01 INFO Application started"
    if vals := carve.ParseLine(line, re); vals != nil {
        writer.Append(vals)
    }
    
    // Flush to get Arrow record
    if writer.Rows() > 0 {
        record := writer.Flush()
        defer record.Release()
        // Use record...
    }
}

Roadmap

Future enhancements:

  • Type inference: Automatic detection of numeric/timestamp fields
  • Multiple output formats: JSON, CSV, Parquet support
  • Streaming mode: Real-time processing with minimal latency
  • Advanced patterns: Multi-line log support
  • Compression: Built-in compression options

License

MIT License - see LICENSE file for details.


Carve - Transform logs into analytics-ready Arrow format with zero lock-in.

About

Arrow-Powered Log Parser (6M lines/sec)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载