+
Skip to content

Tags: Eventual-Inc/Daft

Tags

v0.6.5

Toggle v0.6.5's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
docs: add casting matrix (#5333)

## Changes Made

Add an updated casting matrix to our docs as a new "Casting" page

I checked the logic for each cast in `cast.rs` to see if we technically
support it. Next steps would be to actually test this matrix.

<img width="794" height="824" alt="image"
src="https://github.com/user-attachments/assets/1ad0276e-95a5-4707-a78d-56ee7e7403df"
/>


## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [x] Documented in API Docs (if applicable)
- [x] Documented in User Guide (if applicable)
- [x] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [x] Documentation builds and is formatted properly

v0.6.4

Toggle v0.6.4's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
test: Temporarily remove Common Crawl integration test (#5296)

## Changes Made

Our credentialed io role doesn't have the right permissions. Removing
the test for now.

v0.6.3

Toggle v0.6.3's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
refactor: add fragment_group_size to reduce lance scan task (#5261)

## Changes Made
When the number of fragments is large, the current implementation method
assigns one task to each fragment, which results in a long planning
time. Therefore, some fragment filtering and fragment grouping
implementations have been added here to reduce the number of tasks.
<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)

v0.6.2

Toggle v0.6.2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat: add File.to_tempfile method and optimize range requests (#5226)

## Changes Made

Adds a new `.to_tempfile()` on daft.file. 

Since many apis don't work with readable objects, but expect literal
file paths, This allows us better integrations with these tools.

such as docling 
```py
from docling.document_converter import DocumentConverter

@daft.func
def process_document(doc: daft.File) -> str:
    with doc.to_tempfile() as temp_file:
        converter = DocumentConverter()
        result = converter.convert(temp_file.name)
    return result.document.export_to_text()

df.select(process_document(F.file(df["url"]))).collect()
```

or whisper

```py
import whisper

@daft.func(return_dtype=dt.list(dt.struct({
    "text": dt.string(),
    "start": dt.float64(),
    "end": dt.float64(),
    "id": dt.int64()
})))
def extract_dialogue_segments(file: daft.File):
    """
    Transcribes audio using whisper.
    """
    with file.to_tempfile() as tmpfile:
        model = whisper.load_model("turbo")

        result = model.transcribe(tmpfile)

        segments = []
        for segment in result["segments"]:
            segment_obj = {
                "text": segment["text"],
                "start": segment["start"],
                "end": segment["end"],
                "id": segment["id"]
            }
            segments.append(segment_obj)

        return segments
```

### Notes for reviewers. 

I also had to add some internal buffering for http backed files.
Previously it was erroring if you attempted to do a range request and
that server didnt support them (`416`). So instead, we now try to do a
range request, if we get the `416` then we instead buffer the entire
data.



## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

v0.6.1

Toggle v0.6.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
docs: improve text readability on examples page (#5182)

## Summary
- Add darker overlay for image generation and document processing cards
to improve text readability on light-colored cover images
- Maintain same gradient positioning as base overlay while increasing
opacity values

## Before/After Screenshots
<img width="1070" height="945" alt="image"
src="https://github.com/user-attachments/assets/7ef48940-fa07-4c14-a4a9-092d1e9bb274"
/>

<img width="1066" height="947" alt="image"
src="https://github.com/user-attachments/assets/643bfbba-2b78-48ae-94ae-ae2039820cf8"
/>

## Test plan
- [x] Verify text is readable on all example cards
- [x] Check overlay doesn't obscure image details unnecessarily
- [x] Test responsive behavior on mobile

## Internal
Closes
https://linear.app/eventual/issue/EVE-875/darken-the-background-overlay-for-the-text-for-examples

v0.6.0

Toggle v0.6.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
ci: fix test-wheels job in build-wheel.yml (#5134)

## Changes Made

PyPI upload is failing on main due to the test setup. Fixing it here
https://github.com/Eventual-Inc/Daft/actions/runs/17446158050

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)

v0.5.22

Toggle v0.5.22's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix: Fix venv command for windows build (#5073)

## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)

v0.5.21

Toggle v0.5.21's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
docs: Add audio transcription example card (#5020)

## Changes Made

The spiciness continues

v0.5.20

Toggle v0.5.20's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat: support count(1) in dataframe and choose the cheap column (#4977)

## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->
Not count(1) in dataframe is not supported:
```
In [49]: df.count(1).show()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[49], line 1
----> 1 df.count(1).show()

    [... skipping hidden 1 frame]

File /data00/code/tmp2/Daft/daft/dataframe/dataframe.py:3011, in DataFrame.count(self, *cols)
   3008     raise ValueError("Cannot call count() with both * and column names")
   3010 # Otherwise, perform a column-wise count on the specified columns
-> 3011 return self._apply_agg_fn(Expression.count, cols)

File /data00/code/tmp2/Daft/daft/dataframe/dataframe.py:2854, in DataFrame._apply_agg_fn(self, fn, cols, group_by)
   2852     groupby_name_set = set() if group_by is None else group_by.to_name_set()
   2853     cols = tuple(c for c in self.column_names if c not in groupby_name_set)
-> 2854 exprs = self._wildcard_inputs_to_expressions(cols)
   2855 return self._agg([fn(c) for c in exprs], group_by)

File /data00/code/tmp2/Daft/daft/dataframe/dataframe.py:1596, in DataFrame._wildcard_inputs_to_expressions(self, columns)
   1594 """Handles wildcard argument column inputs."""
   1595 column_input: Iterable[ColumnInputType] = columns[0] if len(columns) == 1 else columns  # type: ignore
-> 1596 return column_inputs_to_expressions(column_input)

File /data00/code/tmp2/Daft/daft/utils.py:126, in column_inputs_to_expressions(columns)
    123 from daft.expressions import col
    125 column_iter: Iterable[ColumnInputType] = [columns] if is_column_input(columns) else columns  # type: ignore
--> 126 return [col(c) if isinstance(c, str) else c for c in column_iter]

TypeError: 'int' object is not iterable
```

Using this pr:
```
In [5]: df.count(1).show()
╭────────╮
│ count  │
│ ---    │
│ UInt64 │
╞════════╡
│ 4      │
╰────────╯
```

At the same time, there is also a certain improvement in performance.
Before modification, the performance data is as follows
```
-------------------------------------------------------------------------------------------------------- benchmark: 4 tests -------------------------------------------------------------------------------------------------------
Name (time in ms)                                          Min                   Max                  Mean              StdDev                Median                 IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_count_with_col_name[float_id-float]               18.2832 (1.0)         24.8761 (1.07)        20.1458 (1.0)        2.6823 (1.34)        19.3201 (1.0)        2.0831 (1.0)           1;1  49.6381 (1.0)           5           1
test_count_with_col_name[int_id-int]                   18.6567 (1.02)        23.2190 (1.0)         21.0213 (1.04)       1.9988 (1.0)         21.3333 (1.10)       3.6173 (1.74)          2;0  47.5707 (0.96)          5           1
test_count_with_col_name[string_content-string]       314.5036 (17.20)      633.3133 (27.28)      434.0078 (21.54)    120.1076 (60.09)      414.9809 (21.48)    120.6531 (57.92)         1;0   2.3041 (0.05)          5           1
test_count_with_col_name[binary_content-binary]     1,836.5554 (100.45)   1,889.4419 (81.37)    1,864.8847 (92.57)     22.0420 (11.03)    1,857.1332 (96.12)     34.7329 (16.67)         2;0   0.5362 (0.01)          5           1
-----
```
Use this pr, the performance indicators are as follows:
```
------------------------------------------------------------------------------------------------ benchmark: 4 tests -----------------------------------------------------------------------------------------------
Name (time in ms)                                       Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_count_with_col_name[int_id-int]                18.3580 (1.0)      25.2071 (1.00)     21.3138 (1.0)      2.7985 (1.68)     20.9401 (1.0)      4.5452 (2.61)          2;0  46.9180 (1.0)           5           1
test_count_with_col_name[float_id-float]            20.5159 (1.12)     25.1621 (1.0)      22.8351 (1.07)     1.6660 (1.0)      22.9081 (1.09)     1.7429 (1.0)           2;0  43.7923 (0.93)          5           1
test_count_with_col_name[binary_content-binary]     21.8774 (1.19)     27.1461 (1.08)     23.9783 (1.13)     2.1795 (1.31)     24.3946 (1.16)     3.2378 (1.86)          1;0  41.7043 (0.89)          5           1
test_count_with_col_name[string_content-string]     22.6195 (1.23)     30.9669 (1.23)     25.7760 (1.21)     3.5412 (2.13)     23.9304 (1.14)     5.4121 (3.11)          1;0  38.7958 (0.83)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

```
The benchmark test cases are as follows:
```
from __future__ import annotations

import pytest

import daft

PATH = "/tmp/test_count_1"

# Consolidated test configuration with built-in IDs
TEST_CASES = [
    pytest.param("int_id", "int", id="first_col_int"),
    pytest.param("float_id", "float", id="first_col_float"),
    pytest.param("string_content", "string", id="first_col_string"),
    pytest.param("binary_content", "binary", id="first_col_binary"),
]


def generate_data():
    # Adjust column order to ensure the target column is first
    data = {
        "int_id": [1] * 1000,
        "float_id": [0.001] * 1000,
        "string_content": ["a"* 100000] * 1000,
        "binary_content": [b"a" * 1000000] * 1000,
    }

    for case in TEST_CASES:
        param, _ = case.values
        col_order = [param] + ["int_id", "float_id", "string_content", "binary_content"]
        col_order = list(dict.fromkeys(col_order))

        df = daft.from_pydict({k: data[k] for k in col_order})
        
        path = f"{PATH}_{param}"
        df.write_parquet(path, write_mode = "overwrite")

generate_data()

@pytest.mark.parametrize("col_name, _", [case.values[:2] for case in TEST_CASES])
def test_count_with_col_name(benchmark, col_name, _):
    """Benchmark count(*) with different first column layouts."""
    def operation():
        path = f"{PATH}_{col_name}"
        df = daft.read_parquet(path)
        return df.count().to_pydict()

    result = benchmark.pedantic(operation, rounds=5, warmup_rounds=1)
    assert result["count"][0] == 1000 

```
(Showing first 1 of 1 rows)
## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)

---------

Co-authored-by: Colin Ho <colin.ho99@gmail.com>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-22-159.us-west-2.compute.internal>

v0.5.19

Toggle v0.5.19's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix: Always just use actor for flotilla scheduler (#4978)

## Changes Made

This fixes the problem where users try to run flotilla in an async
context i.e. jupyter.

I've been meaning to get rid of the 'local' flotilla runner anyway
because it won't work well if users submit multiple daft jobs to the
same cluster, as you will spin up multiple flotilla runners and multiple
swordfish workers (per job) instead of reusing.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载