Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

[Proposal] Combination of arrow-rs and arrow2, deprecation of arrow2 repository #1429

Open
alamb opened this issue Mar 7, 2023 · 26 comments
Labels
feature A new feature help wanted Extra attention is needed investigation Issues or PRs that are investigations. Prs may or may not be merged.

Comments

@alamb
Copy link
Collaborator

alamb commented Mar 7, 2023

This proposal is a response to the seeming desire to unify Rust Arrow development communities going forward and my summary of the discussion on apache/arrow-rs#1176 (specifically apache/arrow-rs#1176 (comment)) . There is a summary with some diagrams that may be helpful in https://docs.google.com/presentation/d/1cqQEpC-kJES2Mng152r_qZyaOqHjtb5YFuseSTWyulU/edit

I am trying to help the community with this proposal. Please provide your feedback -- I don't expect to proceed unless there is consensus that this is a desired path.

Current state

  1. Both arrow-rs and arrow2 offer Rust implementations of apache Arrow
  2. arrow-rs is under active development, is governed by the Apache Software Foundation (ASF), and has adequate resources for maintenance
  3. arrow2 state is (maybe?) struggling to find sufficient maintenance capacity
  4. It is challenging for users to build applications that use mix and match libraries that use arrow-rs (e.g. datafusion) and arrow2 (e.g. pola.rs).

Proposal 1 End State (alternate options are listed below)

  1. arrow-rs continues to be maintained and released under the ASF guidelines
  2. The arrow2 codebase is moved into the existing apache repo https://github.com/apache/arrow-experimental-rs-arrow2 and deprecated after some period (e.g. 6 months)
  3. All relevant / needed functionality is ported from arrow2 to arrow-rs during the deprecation period

This will necessitate users of arrow2 to either change their code, or find alternate maintainers with sufficient capacity for arrow2 if they want a maintained version of arrow going forward.

Prior to the Deprecation Period

  1. The arrow2 code goes through the ASF IP clearance process
  2. arrow2 code is brought into the existing experimental repo (https://github.com/apache/arrow-experimental-rs-arrow2) with a deprecation note, https://github.com/jorgecarleitao/arrow2 is marked read only, directing users to the experimental repo.
  3. During the deprecation period, the community (whoever needs) continue to maintain the arrow-experimental-rs-arrow2 repo, we can release non-official ASF releases, etc

After the Depreciation Period

  1. https://github.com/apache/arrow-experimental-rs-arrow2 is marked as read only

Proposed Milestones

Milestone 1: Ergonomic (and Zero runtime cost) conversion

  1. @tustvold is creating a zero cost abstraction Strongly Typed ArrayData apache/arrow-rs#1799 (see diagrams in this presentation)
  2. @tustvold and/or I will then make a PR to arrow2 that replaces its use of Buffer/ForeignVec with PrimitiveArrayData, as described in Discussion: relationship / unification of arrow-rs and arrow2 going forward apache/arrow-rs#1176 (comment)

This will result in:

  1. arrow2 will depend on array-data rather than foreign_vec
  2. It will be ergonomic (and zero runtime cost) to convert (with From<..> impls) between the arrow-rs and arrow2 Array types
  3. People can mix and match io, kernels, etc (e.g. parquet with arrow2 and parquet2 with arrow)

Milestone 2: Deprecation period

  1. arrow2 is marked as deprecated, giving users fair warning
  2. Releases continue to be made, on demand, as today
  3. Users decide how they want to proceed after deprecation period is over (port entirely to arrow-rs, make other arrangements for arrow2 maintenance, etc)

Milestone 3: Deprecation is over, arrow2 is archived

  1. arrow2 repository is marked read only, with pointers to arrow-rs

Alternatives Considered

Proposal 2: deprecate arrow2 but do not run (explicit) IP clearance

  1. leave arrow2 in https://github.com/jorgecarleitao/arrow2 and do not run the IP clearance process

Pros: less work
Cons: potentially complicates the 'porting code from arrow2 to arrow' process as the IP provenance isn't as clear. However, since the entire arrow2 repo is apache2 licensed, it may be ok.

Proposal 3: arrow2 continues development outside the ASF, wiht

  1. leave arrow2 in https://github.com/jorgecarleitao/arrow2 and the arrow2 community steps up and contributes maintenance (reviewing / merging PRs, etc)
  2. We still complete the "easy interoperability" steps that allows ergonomic and zero cost conversion between arrow2 and arrow-rs

Pros: Users of arrow2 can continue to use it without any migration or effort
Cons: Unclear where the resources will come from

Proposal 4: Do nothing

@alamb
Copy link
Collaborator Author

alamb commented Mar 7, 2023

I wonder if @jorgecarleitao @ritchie46 @b41sh or others can share their perspectives on this propsal

@sundy-li
Copy link
Collaborator

sundy-li commented Mar 7, 2023

Since arrow-rs already had the typed array, maybe it's more convenient to forward it to Proposal 2?

Arrow2 had many improvements and features that arrow-rs may lack, we can port it into arrow-rs by pr. We may need 2 or 3 weeks to finish the migration in Databend (will be processed in another branch).

@jorgecarleitao jorgecarleitao added help wanted Extra attention is needed feature A new feature investigation Issues or PRs that are investigations. Prs may or may not be merged. labels Mar 7, 2023
@Amdahl-rs
Copy link

Do nothing !!!

@ritchie46
Copy link
Collaborator

Thanks for the overview @alamb.

I am in favor of proposal 3. Where the core buffers are in arrow-rs and we have arrow2 as an alternative API/kernels/IO layer on top of the officially maintained arrow spec. My spare cycles are very low, and going this path gives polars enough time to transition. Your proposal document looks very promising on that regard. 👍

I believe that this will hit the most important pain point in the community, that is that currently there is no way to interop with arrow-rs/arrow2 without paying a huge compilation cost.

@tustvold
Copy link
Contributor

My spare cycles are very low, and going this path gives polars enough time to transition

Would you be willing to receive PRs to polars to accelerate this transition along, or do you see there being fundamental blockers to this? I'm very motivated to get everyone working on the same implementation if possible, rather than splitting our efforts, the interop in my mind is just a means to this end

@alamb
Copy link
Collaborator Author

alamb commented Mar 15, 2023

Specifically, perhaps we can help show how to use parquet (rather than parquet2) in polars using this new interop -- and then let any other migration take its course

tustvold added a commit to tustvold/arrow2 that referenced this issue Mar 17, 2023
tustvold added a commit to tustvold/arrow2 that referenced this issue Mar 17, 2023
tustvold added a commit to tustvold/arrow2 that referenced this issue Apr 12, 2023
tustvold added a commit to tustvold/arrow2 that referenced this issue Apr 12, 2023
tustvold added a commit to tustvold/arrow2 that referenced this issue Apr 12, 2023
tustvold added a commit to tustvold/arrow2 that referenced this issue Apr 12, 2023
tustvold added a commit to tustvold/arrow2 that referenced this issue Apr 12, 2023
@sophiajt
Copy link

Just checking in here. Has there been any further progress on the merger? We on the nushell team are hoping we can move onto the resulting library once it's ready.

stormasm added a commit to stormasm/nunotes that referenced this issue Jan 16, 2024
@tustvold
Copy link
Contributor

tustvold commented Jan 16, 2024

I am not aware of any further work under this issue, the state of play as I understand it is that:

My understanding is that polars-arrow is intended to serve the needs of polars, and not as a general purpose arrow library (although @ritchie46 please correct me if I am wrong), and therefore workloads should look to switch over to using arrow-rs.

Edit: unless of course those workloads are integrated with polars, in which case I guess they switch to using polars-arrow??

@stormasm
Copy link

stormasm commented Jan 16, 2024

I am not aware of any further work under this issue, the state of play as I understand it is that:

My understanding is that polars-arrow is intended to serve the needs of polars, and not as a general purpose arrow library (although @ritchie46 please correct me if I am wrong), and therefore workloads should look to switch over to using arrow-rs.

Edit: unless of course those workloads are integrated with polars, in which case I guess they switch to using polars-arrow??

@tustvold Thanks for your quick response !
Really appreciate the update..
Will help us on the nushell team plan ahead.

@sophiajt
Copy link

@tustvold - Thanks for the quick and thorough response. It's much appreciated.

We'll reach out to @ritchie46 to chat about compatibility concerns we need to be aware of if we go with arrow-rs.

Thanks again for the help.

@alamb
Copy link
Collaborator Author

alamb commented Jan 16, 2024

It appears that databend, another of the historical major users of arrow2 has also switched to arrow-rs https://github.com/search?q=repo%3Adatafuselabs%2Fdatabend+arrow+language%3ATOML&type=code&l=TOML

@Dandandan
Copy link
Collaborator

Maybe it makes sense to add a note to the readme of this repo explaining the status as well?

@MichaelScofield
Copy link

We (GreptimeDB) have also switched to arrow-rs from arrow2.

@alamb
Copy link
Collaborator Author

alamb commented Jan 17, 2024

Maybe it makes sense to add a note to the readme of this repo explaining the status as well?

Here is a proposal to add a note to the readme: #1606

@ozgrakkurt
Copy link
Contributor

this library consistenly outperforms (almost 2x) the decode time of parquet crate for reading the same files for me, not sure if it should be abandoned maybe some people want to take over the maintenance.

Also it has completely seperate decoding/io parts for parquet which is very useful

@alamb
Copy link
Collaborator Author

alamb commented Jan 30, 2024

I agree it would be great to get some additional maintainers (or maybe figure out how to port whatever is working well for you to the parquet crate)

@ozgrakkurt
Copy link
Contributor

I agree it would be great to get some additional maintainers (or maybe figure out how to port whatever is working well for you to the parquet crate)

Actually sorry about the performance claim, I was parallelizing with arrow2. I'll try to parallelize parquet version as well, but parquet seems to be slightly faster if neither of them is parallelized.

I'll try to seperate decoding and io with parquet library as well.

@alamb
Copy link
Collaborator Author

alamb commented Feb 6, 2024

Just to add a little clarity on this document.

Ideally there would be some sort of stronger declaration about this crate -- either that it was deprecated and urging people to move to a maintained crate, or that someone / a group was rallying a community to maintain it going forward.

However, given the current lack of community engagement / maintenance, from my perspective, the challenge is that it is not at all clear who would make such a decision and no one seems willing to put the time in to chart a path forward.

I am not sure of @jorgecarleitao, as the original author, has any thoughts to share on this matter.

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Feb 7, 2024

Would mark this repo as archived be a reasonable action (besides the note in the README)?

@datapythonista
Copy link
Contributor

Personally, I think it'd be helpful for many visitors that besides the repo being archived there is a short note in the README with what users are expected to use instead of this repo. And links to the official Arrow-rs, maybe the Polars copy, and a link to this issue too.

@infogulch
Copy link

If we want to adjust the notice at the beginning of the readme, perhaps something like:

Important

This repository has been superseded by the official Apache Arrow repository, arrow-rs, and is no longer maintained.
See [Proposal] Combination of arrow-rs and arrow2, deprecation of arrow2 repository #1429 for the discussion of this change, including features and performance enhancements that were ported during the transition.

Polars, which was the original motivation for this project, is now maintaining its own pared-down clone of arrow2 in-tree.

> [!IMPORTANT]
> This repository has been superseded by the official Apache Arrow repository, [arrow-rs](https://github.com/apache/arrow-rs), and is no longer maintained.
> See [[Proposal] Combination of arrow-rs and arrow2, deprecation of arrow2 repository #1429](https://github.com/jorgecarleitao/arrow2/issues/1429) for the discussion of this change, including features and performance enhancements that were ported during the transition.
> 
> [Polars](https://github.com/pola-rs/polars), which was the original motivation for this project, is now maintaining its own pared-down clone of arrow2 in-tree.

@urvishdesai
Copy link

urvishdesai commented Feb 7, 2024

Is there still a plan to include arrow2 arrays as a part of the arrow-rs library like it was discussed earlier in the form of something like PhysicalArray?

I was working on a PR to add the newly introduced binary_view and utf8_view (in arrow spec and c ABI) to arrow2. I have 90% of the code ready. I saw the deprecation notice and I am not sure on how to proceed further.

I feel that there is not enough clarity on what features have been added from arrow2 to arrow-rs, apart from the interoperability PRs for buffers, arrays and schema. Since the original plan was to merge arrow-rs with arrow2, I am not sure if we are reaping the benefits of arrow2 in terms of performance, safety and intuitive API usage, after the recent commits. Can we summarize the final changes and additions?

@jorgecarleitao @tustvold @alamb Thanks to you and everyone else who contributed to this repo! As an arrow2 user, I would want to know how to go about with my existing code without sacrificing performance and the safe & intuitive design of arrow2 over arrow-rs.

@alamb
Copy link
Collaborator Author

alamb commented Feb 8, 2024

Is there still a plan to include arrow2 arrays as a part of the arrow-rs library like it was discussed earlier in the form of something like PhysicalArray?

There is no one actively working on this that I know of

I was working on a PR to add the newly introduced binary_view and utf8_view (in arrow spec and c ABI) to arrow2. I have 90% of the code ready. I saw the deprecation notice and I am not sure on how to proceed further.

I feel that there is not enough clarity on what features have been added from arrow2 to arrow-rs, apart from the interoperability PRs for buffers, arrays and schema. Since the original plan was to merge arrow-rs with arrow2, I am not sure if we are reaping the benefits of arrow2 in terms of performance, safety and intuitive API usage, after the recent commits. Can we summarize the final changes and additions?

I don't know of any summary of the additions made to arrow-rs but I would also be interested in any summary you are able to provide -- perhaps you can look at the current API in arrow-rs and judge for yourself if it suits your needs.

@jorgecarleitao @tustvold @alamb Thank you for the all your and other contributors' efforts! As an arrow2 user, I would want to know how to go about with my existing code without sacrificing performance and the safe & intuitive design of arrow2 over arrow-rs.

Again, I would encourage you to look at arrow-rs and if there are particular things you find lacking in terms of performance or design, make proposals and work with us to improve its design.

For example, perhaps you would be interested in helping implement string view in arrow_rs -- I filed a ticket to track the work here: apache/arrow-rs#5374

@urvishdesai
Copy link

urvishdesai commented Feb 8, 2024

I don't know of any summary of the additions made to arrow-rs but I would also be interested in any summary you are able to provide -- perhaps you can look at the current API in arrow-rs and judge for yourself if it suits your needs.

I am actively looking at both the libraries and will try to compile my insights into a document. I am happy to provide some kind of summary for the API difference as well.

However, what I am trying to say is that the note on README is not enough for users to understand what to do with their existing arrow2 code. Perhaps @jorgecarleitao and @tustvold will be able to provide more info on the most recent changes that they have incorporated into both the repositories for interoperability. Also, does this mean that everyone should transition to arrow-rs since there are talks to archive arrow2 and mark it as read-only.

@tustvold
Copy link
Contributor

tustvold commented Feb 8, 2024

@tustvold will be able to provide more info on the most recent changes that they have incorporated into both the repositories for interoperability.

At this point the arrow-rs arrays contain strongly typed buffers, with zero-copy interoperability with Vec. There are some slight differences in generics compared with the arrow2 arrays, but broadly speaking the two array representations are now equivalent. Additionally zero-copy conversions were also added to arrow2 to facilitate interoperability between the projects.

There are differences in the broader APIs provided, but in terms of functionality, the only thing I'm aware of that isn't supported by arrow-rs (yet) but is supported by arrow2, is avro.

Also, does this mean that everyone should transition to arrow-rs since there are talks to archive arrow2 and mark it as read-only.

I think in lieu of a long-term maintenance story for arrow2, this would be my recommendation.

@alamb
Copy link
Collaborator Author

alamb commented Feb 8, 2024

Also, does this mean that everyone should transition to arrow-rs since there are talks to archive arrow2 and mark it as read-only.

I think each user should make this decision based on what their needs are. Ideally we would help the decision by providing the type of information you are describing @urvishdesai.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature help wanted Extra attention is needed investigation Issues or PRs that are investigations. Prs may or may not be merged.
Projects
None yet