-
Notifications
You must be signed in to change notification settings - Fork 8
Description
we are capable to emitting between
current approach: partially chunked
generate and trace are chunked, but all rays are collected before writing to h5
graph LR
A["generate rays"]
A --> B["trace rays"]
B --> A
B --> C["collect rays"]
C --> D["write all rays to h5"]
proposed approach: fully chunked:
the whole sequence of steps is executed for each chunk
limits the amount of resources. required at each instance.
steps could be executed simultaneously (pipelined), potentially improving runtime performance
graph LR
A["generate rays"]
A --> B["trace rays"]
B --> C["collect rays"]
C --> D["write chunk of rays to h5"]
D --> A
rayx-core
h5 chunking&compression should be considered to balance compute and I/O
rayx-python
in order to make this feature available, rayx-python has to integrate into the pipeline. essentially it replaces the write to h5
step. We probably have to create an API for the chunked processing.
graph LR
A["generate rays"]
A --> B["trace rays"]
B --> C["collect rays"]
C --> D["process chunk by python"]
D --> A
reading large h5 from python (no rayx-python)
filtering the data before making it accessible to the user, reduces the amount of data, at each instance. hopefully enough to read larger amounts of data, that usually wouldnt fit into memory.
df = pd.read_hdf("data.h5", key="mydata", where="temperature > 300")
if the amount of data is still too large, the data could be loaded, filtered and processed slice by slice
finally: processing chunks in python
e.g. creating a histogram
a histogram has a fixed number of bins, each represent accumulated data consisting of a potentially large amount of datapoints. we can feed chunks of datapoints into the histogram, limiting the amount of resources, required at each instance.