Tags: Eventual-Inc/Daft
Tags
docs: add casting matrix (#5333) ## Changes Made Add an updated casting matrix to our docs as a new "Casting" page I checked the logic for each cast in `cast.rs` to see if we technically support it. Next steps would be to actually test this matrix. <img width="794" height="824" alt="image" src="https://github.com/user-attachments/assets/1ad0276e-95a5-4707-a78d-56ee7e7403df" /> ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [x] Documented in API Docs (if applicable) - [x] Documented in User Guide (if applicable) - [x] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [x] Documentation builds and is formatted properly
refactor: add fragment_group_size to reduce lance scan task (#5261) ## Changes Made When the number of fragments is large, the current implementation method assigns one task to each fragment, which results in a long planning time. Therefore, some fragment filtering and fragment grouping implementations have been added here to reduce the number of tasks. <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
feat: add File.to_tempfile method and optimize range requests (#5226) ## Changes Made Adds a new `.to_tempfile()` on daft.file. Since many apis don't work with readable objects, but expect literal file paths, This allows us better integrations with these tools. such as docling ```py from docling.document_converter import DocumentConverter @daft.func def process_document(doc: daft.File) -> str: with doc.to_tempfile() as temp_file: converter = DocumentConverter() result = converter.convert(temp_file.name) return result.document.export_to_text() df.select(process_document(F.file(df["url"]))).collect() ``` or whisper ```py import whisper @daft.func(return_dtype=dt.list(dt.struct({ "text": dt.string(), "start": dt.float64(), "end": dt.float64(), "id": dt.int64() }))) def extract_dialogue_segments(file: daft.File): """ Transcribes audio using whisper. """ with file.to_tempfile() as tmpfile: model = whisper.load_model("turbo") result = model.transcribe(tmpfile) segments = [] for segment in result["segments"]: segment_obj = { "text": segment["text"], "start": segment["start"], "end": segment["end"], "id": segment["id"] } segments.append(segment_obj) return segments ``` ### Notes for reviewers. I also had to add some internal buffering for http backed files. Previously it was erroring if you attempted to do a range request and that server didnt support them (`416`). So instead, we now try to do a range request, if we get the `416` then we instead buffer the entire data. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
docs: improve text readability on examples page (#5182) ## Summary - Add darker overlay for image generation and document processing cards to improve text readability on light-colored cover images - Maintain same gradient positioning as base overlay while increasing opacity values ## Before/After Screenshots <img width="1070" height="945" alt="image" src="https://github.com/user-attachments/assets/7ef48940-fa07-4c14-a4a9-092d1e9bb274" /> <img width="1066" height="947" alt="image" src="https://github.com/user-attachments/assets/643bfbba-2b78-48ae-94ae-ae2039820cf8" /> ## Test plan - [x] Verify text is readable on all example cards - [x] Check overlay doesn't obscure image details unnecessarily - [x] Test responsive behavior on mobile ## Internal Closes https://linear.app/eventual/issue/EVE-875/darken-the-background-overlay-for-the-text-for-examples
ci: fix test-wheels job in build-wheel.yml (#5134) ## Changes Made PyPI upload is failing on main due to the test setup. Fixing it here https://github.com/Eventual-Inc/Daft/actions/runs/17446158050 ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
fix: Fix venv command for windows build (#5073) ## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
feat: support count(1) in dataframe and choose the cheap column (#4977) ## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> Not count(1) in dataframe is not supported: ``` In [49]: df.count(1).show() --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[49], line 1 ----> 1 df.count(1).show() [... skipping hidden 1 frame] File /data00/code/tmp2/Daft/daft/dataframe/dataframe.py:3011, in DataFrame.count(self, *cols) 3008 raise ValueError("Cannot call count() with both * and column names") 3010 # Otherwise, perform a column-wise count on the specified columns -> 3011 return self._apply_agg_fn(Expression.count, cols) File /data00/code/tmp2/Daft/daft/dataframe/dataframe.py:2854, in DataFrame._apply_agg_fn(self, fn, cols, group_by) 2852 groupby_name_set = set() if group_by is None else group_by.to_name_set() 2853 cols = tuple(c for c in self.column_names if c not in groupby_name_set) -> 2854 exprs = self._wildcard_inputs_to_expressions(cols) 2855 return self._agg([fn(c) for c in exprs], group_by) File /data00/code/tmp2/Daft/daft/dataframe/dataframe.py:1596, in DataFrame._wildcard_inputs_to_expressions(self, columns) 1594 """Handles wildcard argument column inputs.""" 1595 column_input: Iterable[ColumnInputType] = columns[0] if len(columns) == 1 else columns # type: ignore -> 1596 return column_inputs_to_expressions(column_input) File /data00/code/tmp2/Daft/daft/utils.py:126, in column_inputs_to_expressions(columns) 123 from daft.expressions import col 125 column_iter: Iterable[ColumnInputType] = [columns] if is_column_input(columns) else columns # type: ignore --> 126 return [col(c) if isinstance(c, str) else c for c in column_iter] TypeError: 'int' object is not iterable ``` Using this pr: ``` In [5]: df.count(1).show() ╭────────╮ │ count │ │ --- │ │ UInt64 │ ╞════════╡ │ 4 │ ╰────────╯ ``` At the same time, there is also a certain improvement in performance. Before modification, the performance data is as follows ``` -------------------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- test_count_with_col_name[float_id-float] 18.2832 (1.0) 24.8761 (1.07) 20.1458 (1.0) 2.6823 (1.34) 19.3201 (1.0) 2.0831 (1.0) 1;1 49.6381 (1.0) 5 1 test_count_with_col_name[int_id-int] 18.6567 (1.02) 23.2190 (1.0) 21.0213 (1.04) 1.9988 (1.0) 21.3333 (1.10) 3.6173 (1.74) 2;0 47.5707 (0.96) 5 1 test_count_with_col_name[string_content-string] 314.5036 (17.20) 633.3133 (27.28) 434.0078 (21.54) 120.1076 (60.09) 414.9809 (21.48) 120.6531 (57.92) 1;0 2.3041 (0.05) 5 1 test_count_with_col_name[binary_content-binary] 1,836.5554 (100.45) 1,889.4419 (81.37) 1,864.8847 (92.57) 22.0420 (11.03) 1,857.1332 (96.12) 34.7329 (16.67) 2;0 0.5362 (0.01) 5 1 ----- ``` Use this pr, the performance indicators are as follows: ``` ------------------------------------------------------------------------------------------------ benchmark: 4 tests ----------------------------------------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- test_count_with_col_name[int_id-int] 18.3580 (1.0) 25.2071 (1.00) 21.3138 (1.0) 2.7985 (1.68) 20.9401 (1.0) 4.5452 (2.61) 2;0 46.9180 (1.0) 5 1 test_count_with_col_name[float_id-float] 20.5159 (1.12) 25.1621 (1.0) 22.8351 (1.07) 1.6660 (1.0) 22.9081 (1.09) 1.7429 (1.0) 2;0 43.7923 (0.93) 5 1 test_count_with_col_name[binary_content-binary] 21.8774 (1.19) 27.1461 (1.08) 23.9783 (1.13) 2.1795 (1.31) 24.3946 (1.16) 3.2378 (1.86) 1;0 41.7043 (0.89) 5 1 test_count_with_col_name[string_content-string] 22.6195 (1.23) 30.9669 (1.23) 25.7760 (1.21) 3.5412 (2.13) 23.9304 (1.14) 5.4121 (3.11) 1;0 38.7958 (0.83) 5 1 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` The benchmark test cases are as follows: ``` from __future__ import annotations import pytest import daft PATH = "/tmp/test_count_1" # Consolidated test configuration with built-in IDs TEST_CASES = [ pytest.param("int_id", "int", id="first_col_int"), pytest.param("float_id", "float", id="first_col_float"), pytest.param("string_content", "string", id="first_col_string"), pytest.param("binary_content", "binary", id="first_col_binary"), ] def generate_data(): # Adjust column order to ensure the target column is first data = { "int_id": [1] * 1000, "float_id": [0.001] * 1000, "string_content": ["a"* 100000] * 1000, "binary_content": [b"a" * 1000000] * 1000, } for case in TEST_CASES: param, _ = case.values col_order = [param] + ["int_id", "float_id", "string_content", "binary_content"] col_order = list(dict.fromkeys(col_order)) df = daft.from_pydict({k: data[k] for k in col_order}) path = f"{PATH}_{param}" df.write_parquet(path, write_mode = "overwrite") generate_data() @pytest.mark.parametrize("col_name, _", [case.values[:2] for case in TEST_CASES]) def test_count_with_col_name(benchmark, col_name, _): """Benchmark count(*) with different first column layouts.""" def operation(): path = f"{PATH}_{col_name}" df = daft.read_parquet(path) return df.count().to_pydict() result = benchmark.pedantic(operation, rounds=5, warmup_rounds=1) assert result["count"][0] == 1000 ``` (Showing first 1 of 1 rows) ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) --------- Co-authored-by: Colin Ho <colin.ho99@gmail.com> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-22-159.us-west-2.compute.internal>
fix: Always just use actor for flotilla scheduler (#4978) ## Changes Made This fixes the problem where users try to run flotilla in an async context i.e. jupyter. I've been meaning to get rid of the 'local' flotilla runner anyway because it won't work well if users submit multiple daft jobs to the same cluster, as you will spin up multiple flotilla runners and multiple swordfish workers (per job) instead of reusing. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
PreviousNext