Discussion: some optimizations for large datasets

Hi! First, thank you for this wonderful library! Being able to crunch data in Lisp makes my life much easier :) I have since written some optimizations to improve the performance, which is particularly helpful for handling relatively large (~1GB) data sets. I'm now looking to extract and contribute these patches, after sorting out some questions about portability/extensibility... Sorry for the noise, I hope this is not the wrong place to discuss!

What I have includes:
1. Call `duckdb-api:duckdb-row-count` in `translate-result` and allocate vectors with the right size at the beginning, eliminate `vector-push-extend` to avoid repeated realloc in the loop.
2. Use specialized CL array (e.g. `(array double-float)` instead of `(array t)`) when possible, and use C function `memcpy` to do the copying instead of Lisp `loop`.

1 is free speed boost -- I think it's completely transparent and does not change any observable behavior!

2 gives massive improvement both immediately when loading the arrays, and when later operating on these arrays, for e.g. `double-float`'s (it avoids consing billions of boxed floats). However its portability/extensibility requires some discussion: different CL implementations might have different specialized array type, and C `memcpy` only works if CL array and C (duckdb) array has the same element bit patterns. The code I use right now only considers SBCL/x86-64. What I have in mind is: introduce generic functions `allocate-result-array`, `translate-from-chunk` and `translate-to-chunk`, then we can specialize them on duckdb types and use optimized version depending on some read-time conditionals. I can only fill in specializations for `#+(and sbcl x86-64)` because of what I use, but users of other platform can fill in the rest.

How does that sound? Are such optimization patches welcomed in general?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion: some optimizations for large datasets #67

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Discussion: some optimizations for large datasets #67

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions