Releases: Eventual-Inc/Daft
v0.6.1
What's Changed 🚀
✨ Features
- feat: expose image attribute as expression @Jay-ju (#4848)
- feat(flotilla): no shuffle for hash join if conditions are met @colin-ho (#5135)
- feat:
.list.append
Expression @srilman (#5159) - feat: Base64 Encoding @srilman (#5158)
- feat: adds support for classify_text @rchowell (#5113)
- feat: Add Arrow IPC conversion for RecordBatches @srilman (#5143)
- feat: unnest param on @daft.func @kevinzwang (#5132)
🐛 Bug Fixes
- fix: Account for unschedulable udf actors @colin-ho (#4987)
- fix: Cleanup CLI Progress Bar Output @srilman (#5157)
- fix: flaky test test_transformers_image_embedder_other @kevinzwang (#5130)
🚀 Performance
📖 Documentation
- docs: improve text readability on examples page @ykdojo (#5182)
- docs: add TrendShift badge to README @ykdojo (#5181)
- docs: improve explode method documentation with null/empty list examples @ykdojo (#5164)
- docs: fix broken tutorial links and remove redundant file @ykdojo (#5154)
👷 CI
- ci: Reduce free disk space time @colin-ho (#5178)
- ci: re-add mac os unit tests on main @kevinzwang (#5163)
- ci: fix TPC-H benchmark workflows @kevinzwang (#5123)
- ci: Pipe unit test failures through duration aggregator @colin-ho (#5161)
- ci: remove macos from PR test suite @kevinzwang (#5142)
- ci: Aggregate test durations @colin-ho (#5129)
🔧 Maintenance
Full Changelog: v0.6.0...v0.6.1
v0.6.0
What's Changed 🚀
v0.6.0 marks the official release of our new ray-based distributed engine, Flotilla! If you are already using the ray runner, you do not need to change anything. Setting the DAFT_RUNNER=ray
environment variable, or within your python program via daft.context.set_runner_ray()
, will use Flotilla by default.
All operations except cross join, sort merge join, and pivot are currently supported. We will be working on adding support for them soon! If you need to use the legacy ray runner, please set daft.set_execution_config(use_legacy_ray_runner=True)
💥 Breaking Changes
SQLCatalog
was deprecated in v0.5 and is now removed, in favor of the bindings
kwargs.
Before:
catalog = SQLCatalog({"test_data": df})
result = daft.sql("SELECT * FROM test_data", catalog=catalog)
After:
bindings = {"test_data": df}
result = daft.sql("SELECT * FROM test_data", **bindings)
- feat!: revert daft.func behavior on literal arguments @kevinzwang (#5087)
- revert!: "revert: Temporarily revert "Remove deprecated APIs for 0.6" @desmondcheongzx (#5084)
✨ Features
- feat(embed_text): Support LM Studio as a provider @desmondcheongzx (#5103)
- feat: Implement embed_image() @desmondcheongzx (#5101)
- feat!: revert daft.func behavior on literal arguments @kevinzwang (#5087)
- feat: Automatically grab embedding dimensions for sentence transformers @desmondcheongzx (#5078)
- feat: add mcap datasource reader @Jay-ju (#4727)
🐛 Bug Fixes
- fix: Undo skipcheck change @srilman (#5131)
- fix: fix youtube video reading @rchowell (#5126)
- fix: Remove flotilla fallback @colin-ho (#5114)
- fix: Add nulls in json reads if a line doesn't contain the field from the schema @colin-ho (#4993)
- fix: Check if UDFs are Serializable @srilman (#5091)
- fix: nightly property test @malcolmgreaves (#5076)
- fix: Handle Unserializable Errors in Process UDFs @srilman (#5075)
- fix: Implement Multi-Column Aggregations with List-like columns @srilman (#5017)
🚀 Performance
- perf: Implement count pushdown for parquet @desmondcheongzx (#5038)
- perf(flotilla): Use Worker Affinity with Pre-Shuffle Merge @srilman (#5112)
- perf: Split UDFs from Filters @srilman (#5070)
- perf(embed_text): Let Sentence Transformers select the best available device @desmondcheongzx (#5082)
♻️ Refactor
📖 Documentation
- docs: fix navigation labels to match section names @ykdojo (#5121)
- docs: fix flickering typewriter animation on overview page @ykdojo (#5118)
- docs: Add batch inference use case @desmondcheongzx (#5116)
- docs: Add docs for custom data sources and sinks @desmondcheongzx (#5115)
- docs: add dark mode support for Algolia DocSearch @ykdojo (#5109)
- docs: add noindex tag to non-stable pages @jaychia (#5105)
- docs: Add text guide @desmondcheongzx (#5102)
- docs: Improve installation instructions @desmondcheongzx (#5094)
- docs: More fixes to the overview page in light mode @desmondcheongzx (#5095)
- docs: Document write_turbopuffer in the user guide @desmondcheongzx (#5092)
👷 CI
- ci: fix test-wheels job in build-wheel.yml @kevinzwang (#5134)
- ci: Truncate the # of concurrent jobs in PR CI @srilman (#5122)
- ci: Run tests before publish @colin-ho (#5009)
- ci: Always run the
unit-tests
required check @colin-ho (#5119) - ci: Do not skip postmerge tests @desmondcheongzx (#5096)
🔧 Maintenance
- chore: Add AGENTS.md @srilman (#5124)
- chore: Remove docs codeowners @desmondcheongzx (#5111)
- chore: Clean up write_turbopuffer guide @desmondcheongzx (#5093)
⏪ Reverts
- revert!: "revert: Temporarily revert "Remove deprecated APIs for 0.6" @desmondcheongzx (#5084)
Full Changelog: v0.5.22...v0.5.23
v0.5.22
What's Changed 🚀
💥 Breaking Changes
- refactor!: use struct datatype as daft representation of tuples @universalmind303 (#5030)
✨ Features
- feat: Add uv.lock to git @desmondcheongzx (#5065)
- feat: Add Hash Function Support for Decimal128, Time, Timestamp, Timestamptz Datatypes @Zyiqin-Miranda (#5026)
- feat: pushdown for lance scan @Jay-ju (#4710)
- feat: add lance merge_column task @Jay-ju (#5008)
- feat: Make the max parallel of scan tasks configurable for Native Runner @plotor (#5018)
- feat: basic generator udf @kevinzwang (#5036)
- feat: implements an openai provider with embed_text @rchowell (#4997)
- feat: daft.File object store support @universalmind303 (#5002)
🐛 Bug Fixes
- fix: Fix venv command for windows build @colin-ho (#5073)
- fix: add setuptools_scm to build wheel requirements @colin-ho (#5072)
- fix: Use cachebusting and range request fallback for HTTP requests to Hugging Face CDNs @desmondcheongzx (#5061)
- fix: Use async for starting and calling udf actors in flotilla @colin-ho (#5000)
- fix: Always refresh tqdm when updating total @colin-ho (#5033)
- fix: Fix docs build @desmondcheongzx (#5066)
- fix: require uv as prerequisite for development setup @ykdojo (#5059)
- fix: Add missing source command in Makefile install-docs-deps target @ykdojo (#5060)
- fix: Mermaid syntax error when enable explain analyze for Native Runner @plotor (#5052)
- fix: clean notebook output before running tests & tweak doc proc notebook @malcolmgreaves (#5055)
- fix: correct Modin query optimizer value in comparison tables @ykdojo (#4983)
- fix: skip credentialed tests if not from main @rchowell (#5048)
- fix: subprocess UDF inherits current process env @rchowell (#5047)
- fix: sql/spark read_iceberg and read_deltalake @kevinzwang (#5035)
- fix(blc): Disabled pipefail @rohitkulshreshtha (#5031)
♻️ Refactor
- refactor!: use struct datatype as daft representation of tuples @universalmind303 (#5030)
📖 Documentation
- docs: Make overview page legible for light mode @desmondcheongzx (#5067)
- docs: Move custom python code higher up in docs @desmondcheongzx (#5064)
- docs: Add better description in overview page @jaychia (#5063)
- docs: remove core_concepts.md and broken anchor link references @ykdojo (#5062)
- docs: fix formatting @rchowell (#4994)
- docs: remove runllm widget @ccmao1130 (#5056)
- docs: add reo script to docs @ccmao1130 (#5049)
- docs: fix broken UDF link due to core_concepts.md redirect @ykdojo (#5022)
- docs: fix typo "Github" --> "GitHub" @metonym (#5025)
- docs: fix
df.limit
link in quickstart.md @rockokw (#5013)
👷 CI
- ci: Don't run pr test suite on non-code changes fr @desmondcheongzx (#5057)
🔧 Maintenance
- chore: Remove deprecated APIs for 0.6 @colin-ho (#5050)
- chore: disable hugging face library progress bars @kevinzwang (#5040)
- chore: relax assertion in flaky sharding distribution test @Jay-ju (#5053)
- chore(dev): use pyproject.toml to manage the dev dependencies @xy-xin (#4849)
- chore: random the counter during creating DistributedActorPoolProject… @stayrascal (#5039)
⏪ Reverts
- revert: Temporarily revert "Remove deprecated APIs for 0.6" @desmondcheongzx (#5068)
Full Changelog: v0.5.21...v0.5.22
v0.5.21
What's Changed 🚀
✨ Features
- feat: Propagate morsel size top-down in swordfish @colin-ho (#4894)
- feat: DataFrame.write_huggingface @kevinzwang (#5015)
🐛 Bug Fixes
- fix(blc): Attempt to fix the broken link checker. @rohitkulshreshtha (#5010)
- fix: Print UDF stdout and Daft logs above the progress bar @srilman (#4861)
📖 Documentation
- docs: Add audio transcription example card @desmondcheongzx (#5020)
- docs: improve audio transcription example @universalmind303 (#4990)
- docs: Spice up the examples page @desmondcheongzx (#5019)
🔧 Maintenance
Full Changelog: v0.5.20...v0.5.21
v0.5.20
What's Changed 🚀
💥 Breaking Changes
- feat!: RowWiseUdf.eval for eager evaluation @kevinzwang (#4998)
✨ Features
- feat: support count(1) in dataframe and choose the cheap column @huleilei (#4977)
- feat: add clickhouse data sink @huleilei (#4850)
- feat: implement distributed sort in flotilla engine @ohbh (#4991)
- feat!: RowWiseUdf.eval for eager evaluation @kevinzwang (#4998)
- feat: basic read_huggingface functionality @kevinzwang (#4996)
- feat: support using max() and min() on list of boolean values @varun117 (#4989)
- feat: Flotilla pre-shuffle merge @colin-ho (#4873)
- feat: Flotilla into partitions @colin-ho (#4963)
- feat(optimizer): Add Lance count() pushdown optimization @huleilei (#4969)
- feat: adds video frame streaming source @rchowell (#4979)
- feat: Add offset support to Spark Connect @plotor (#4962)
- feat: new
daft.File
datatype @universalmind303 (#4959) - feat: unify all Daft type to Python type conversions @kevinzwang (#4972)
🐛 Bug Fixes
- fix: Can translate sort in flotilla @colin-ho (#5005)
- fix: Lazily import pil in infer dtype @colin-ho (#5004)
- fix: Lazily import pyarrow when importing daft @colin-ho (#4999)
- fix: lance schema does not work @ddupg (#4940)
- fix: correct possessive apostrophe typo in README @ykdojo (#4984)
- fix: correct GitHub capitalization and add missing period in README @ykdojo (#4985)
- fix: ignore NotFound error of the non-first list during iter dir @stayrascal (#4891)
- fix: S3 multipart upload redirect to correct region @kevinzwang (#4865)
♻️ Refactor
📖 Documentation
👷 CI
- ci: Don't run pr test suite on non-code changes @desmondcheongzx (#4992)
- ci: No progress bar in CI @colin-ho (#4988)
🔧 Maintenance
Full Changelog: v0.5.19...v0.5.20
v0.5.19
What's Changed 🚀
We have a pretty crazy release this time around. Some especially notable features include:
- Interactive DataFrames in Jupyter Notebooks, with special support for some multimodal types
- An async API for LLM text generation, particularly with OpenAI
- A new
.into_batches
DataFrame API, the modern alternative to.into_partitions
- Adding support for
.offset
/OFFSET
operator across the engine. Thanks @plotor for the great work! - Various Flotilla performance and reliability improvements
- Various casting improvements
✨ Features
- feat: Async open ai llm generate @colin-ho (#4879)
- feat: Add offset syntax support to SQL @plotor (#4707)
- feat: adds support for SQL GROUP BY column position @rchowell (#4955)
- feat: better dtype type inference @universalmind303 (#4973)
- feat: Casting from Python into struct or list types @srilman (#4957)
- feat: support creating partitioned tables in Iceberg via the Catalog interface. @redpheonixx (#4951)
- feat: implement into_batches operator on flotilla distrubted engine @ohbh (#4958)
- feat: literal variants for (pretty much) all types @kevinzwang (#4947)
- feat: Add offset support to Flotilla Engine @plotor (#4918)
- feat: implement into_batches on the swordfish native daft runner @ohbh (#4935)
- feat: Flotilla broadcast join @colin-ho (#4867)
🐛 Bug Fixes
- fix: Always just use actor for flotilla scheduler @colin-ho (#4978)
- fix: Add handle for swordfish runtime stats manager @colin-ho (#4970)
- fix: Dudep lance read required columns @xloya (#4967)
- fix: Don't use wildcard for logical plan match in pushdown rules @colin-ho (#4945)
- fix: Coerce arrow schema for parquet decoding @colin-ho (#4948)
- fix: use associate type for swordfish into_batches operator state @ohbh (#4956)
- fix: raise error on invalid cross join parameters @rchowell (#4952)
- fix: interactive html fixes @colin-ho (#4943)
♻️ Refactor
📖 Documentation
- docs: update links in document processing example @ccmao1130 (#4946)
- docs: improve daft.func documentation and type inference @universalmind303 (#4942)
- docs: fix link for pandas @universalmind303 (#4941)
👷 CI
🔧 Maintenance
👋 New Contributors
- @redpheonixx made their first contribution in #4951
Full Changelog: v0.5.18...v0.5.19
v0.5.18
What's Changed 🚀
✨ Features
- feat: adds column set visitor and use in pushdowns @rchowell (#4929)
- feat: async @daft.func @universalmind303 (#4908)
- feat: Add offset operator support to DataFrame for Ray Runner @plotor (#4706)
- feat: model resource plumbing for inference expressions @rchowell (#4902)
- feat: Flotilla eager limit @colin-ho (#4887)
- feat: implement random repartition in flotilla distributed engine @ohbh (#4893)
- feat: Add support for specifying hash algorithm used in expression.hash() func. @Zyiqin-Miranda (#4640)
🐛 Bug Fixes
- fix: Batch RuntimeSubscriber updates for all nodes @srilman (#4932)
- fix: Column Ordering in UDF & Project Optimizations @srilman (#4923)
- fix: Refactor Progress Bar to be a RuntimeStatSubscriber @srilman (#4837)
🚀 Performance
- perf: scalar udfs use same optimizations as legacy udfs @universalmind303 (#4931)
- perf: improve @daft.func performance by ~27% @universalmind303 (#4920)
♻️ Refactor
- refactor: move literal to daft-core @kevinzwang (#4928)
📖 Documentation
- docs: Generate files for llms.txt @desmondcheongzx (#4937)
- docs: Pull colab examples into pages @desmondcheongzx (#4936)
- docs: Add embeddings generation example @desmondcheongzx (#4934)
- docs: improve navigation for functions doc pages @kevinzwang (#4924)
- docs: getdaft.io → daft.ai @ccmao1130 (#4926)
- docs: add video to examples @ccmao1130 (#4915)
- docs: adds coalesce to docs @rchowell (#4909)
- docs: fix doctest formatting errors @rchowell (#4911)
- docs: add docstrings to I/O & DataFrame methods (issue #4124) @TheOphige (#4854)
🔧 Maintenance
- chore: config isort known_third_party to fix import formatting errors @Jay-ju (#4840)
- chore: add warning on repartition in native runner @kevinzwang (#4910)
Full Changelog: v0.5.17...v0.5.18
v0.5.17
What's Changed 🚀
📖 Documentation
- docs: add examples to docs @ccmao1130 (#4903)
- docs: Fixing Link to Resource Requests from Managing Memory Usage Page @madvart (#4901)
🔧 Maintenance
- chore: change series literal to list @kevinzwang (#4896)
Full Changelog: v0.5.16...v0.5.17
v0.5.16
What's Changed 🚀
✨ Features
- feat: Interactive jupyter display @colin-ho (#4835)
- feat: supports passing a projection kwargs in select @rchowell (#4884)
- feat: Add offset operator support to DataFrame for Native Runner @plotor (#4582)
🐛 Bug Fixes
- fix: Return from streaming sink if channel closed @colin-ho (#4885)
- fix: No empty turbopuffer write @colin-ho (#4897)
- fix: Use planning config instead of env variables during filter pushdowns @desmondcheongzx (#4888)
- fix: azure storage resource url @kevinzwang (#4892)
🚀 Performance
- perf: Add parallel execution for Python UDFs with batched GIL acquisitions @universalmind303 (#4886)
📖 Documentation
- docs: Remove redirect for installation page @desmondcheongzx (#4895)
Full Changelog: v0.5.15...v0.5.16
v0.5.15
What's Changed 🚀
✨ Features
- feat: add openai provider in llm_generate function @huleilei (#4809)
- feat: Use
shuffle_aggregation_default_partitions
in flotilla aggregate @colin-ho (#4869) - feat: abstract the interface of scan pushdown @Jay-ju (#4772)
- feat: Add
get_or_infer_runner_type
to support getting runner type from context @plotor (#4810) - feat: support glob multiple path @stayrascal (#4811)
🐛 Bug Fixes
- fix(runtime): Reduce thread name length for compute and I/O threadpools @rohitkulshreshtha (#4877)
- fix: import ParamSpec from typing_extensions only for python < 3.10 @kevinzwang (#4878)
- fix: Allow file:/ schemes in read_iceberg @colin-ho (#4843)
📖 Documentation
- docs: Restructure docs to target users @desmondcheongzx (#4875)
🔧 Maintenance
Full Changelog: v0.5.14...v0.5.15