+
Skip to content

Discussion: some optimizations for large datasets #67

@kchanqvq

Description

@kchanqvq

Hi! First, thank you for this wonderful library! Being able to crunch data in Lisp makes my life much easier :) I have since written some optimizations to improve the performance, which is particularly helpful for handling relatively large (~1GB) data sets. I'm now looking to extract and contribute these patches, after sorting out some questions about portability/extensibility... Sorry for the noise, I hope this is not the wrong place to discuss!

What I have includes:

  1. Call duckdb-api:duckdb-row-count in translate-result and allocate vectors with the right size at the beginning, eliminate vector-push-extend to avoid repeated realloc in the loop.
  2. Use specialized CL array (e.g. (array double-float) instead of (array t)) when possible, and use C function memcpy to do the copying instead of Lisp loop.

1 is free speed boost -- I think it's completely transparent and does not change any observable behavior!

2 gives massive improvement both immediately when loading the arrays, and when later operating on these arrays, for e.g. double-float's (it avoids consing billions of boxed floats). However its portability/extensibility requires some discussion: different CL implementations might have different specialized array type, and C memcpy only works if CL array and C (duckdb) array has the same element bit patterns. The code I use right now only considers SBCL/x86-64. What I have in mind is: introduce generic functions allocate-result-array, translate-from-chunk and translate-to-chunk, then we can specialize them on duckdb types and use optimized version depending on some read-time conditionals. I can only fill in specializations for #+(and sbcl x86-64) because of what I use, but users of other platform can fill in the rest.

How does that sound? Are such optimization patches welcomed in general?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载