Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(csv/unstable): infer column names from object arrays for stringify() #6122

Merged
merged 19 commits into from
Nov 14, 2024

Conversation

efekrskl
Copy link
Contributor

@efekrskl efekrskl commented Oct 16, 2024

Closes #3857

Allows usages like below by deriving the column names from the elements of the given array.

const data = [{ a: 1 }, { a: 2 }, { b: 3 }];
stringify(data);

@efekrskl efekrskl requested a review from kt3k as a code owner October 16, 2024 22:21
@CLAassistant
Copy link

CLAassistant commented Oct 16, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added the csv label Oct 16, 2024
Comment on lines 76 to 89
await t.step(
{
name: "Invalid data, no columns",
fn() {
const data = [{ a: 1 }, { a: 2 }];
assertThrows(
() => stringify(data),
TypeError,
"No property accessor function was provided for object",
);
},
},
);
await t.step(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these were duplicates, removed both as they are not relevant anymore

@efekrskl efekrskl changed the title csv: stringify infer column names from object arrays feat(csv): infer column names from object arrays for stringify() Oct 16, 2024
Copy link

codecov bot commented Oct 16, 2024

Codecov Report

Attention: Patch coverage is 79.31034% with 6 lines in your changes missing coverage. Please review.

Project coverage is 95.84%. Comparing base (8522627) to head (774b2bb).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
csv/unstable_stringify.ts 79.31% 5 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6122      +/-   ##
==========================================
- Coverage   96.48%   95.84%   -0.65%     
==========================================
  Files         535      517      -18     
  Lines       41285    41156     -129     
  Branches     6165     6074      -91     
==========================================
- Hits        39835    39447     -388     
- Misses       1406     1664     +258     
- Partials       44       45       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@timreichen
Copy link
Contributor

Maybe instead of inferring from just the first element, how about collecting headers from all objects?

@efekrskl
Copy link
Contributor Author

Maybe instead of inferring from just the first element, how about collecting headers from all objects?

I chose this approach as it feels like this is a more common pattern found in similar solutions. I'm open to both approaches though, wonder what others think.

@Leokuma
Copy link

Leokuma commented Oct 25, 2024

+1 for collecting all headers. If we don't, the user will be forced to either specify all columns manually or pass a complete object as the first object. There's a risk of losing data.

[
  {"city": "Paris", "area": 90},
  {"city": "Rome", "area": 50, "population": 43},
  {"city": "Tokyo", "altitude": 9},
]

The JSON above would lose data. This is a simple JSON, but in more complex use cases (eg data science) where there are dozens or hundreds of columns, it might not be obvious that you're losing data.

OTOH, if this approach has a performance impact, we should probably add to the docs that specifying columns manually allows the function to run faster.

@efekrskl
Copy link
Contributor Author

Thank you both for the feedback. I've updated the implementation to infer the columns from all array elements instead of the first one.

@BlackAsLight
Copy link
Contributor

Personally I think the option to infer headers should only be done via the first object. Collecting all objects before outputting anything to infer all headers really removes the streaming aspect of it. As it stands now the user is already forced to manually supply all headers so requiring them to do that when they don't have identical keys in each object isn't problem. This inference is really for the people who are using identical keys in objects.

@timreichen
Copy link
Contributor

Personally I think the option to infer headers should only be done via the first object. Collecting all objects before outputting anything to infer all headers really removes the streaming aspect of it. As it stands now the user is already forced to manually supply all headers so requiring them to do that when they don't have identical keys in each object isn't problem. This inference is really for the people who are using identical keys in objects.

But the same argument goes the other way around as well: As it stands now the user is already forced to manually supply all headers so requiring them to do that when they don't want to wait until all headers are collected isn't a problem.

Why should the inference be just for people who are using identical keys in objects? I think this might feel like a bug if data is just omitted as @Leokuma pointed out.

@BlackAsLight
Copy link
Contributor

But the same argument goes the other way around as well: As it stands now the user is already forced to manually supply all headers so requiring them to do that when they don't want to wait until all headers are collected isn't a problem.

The way it is now forces them to always supply headers. It makes more sense that the change should be to only require headers if each object doesn't have the same header. It doesn't make sense to make inference have a huge memory impact, essentially forcing everyone who doesn't want to cop that impact to continue manually supplying headers. If inference has such a huge impact then it's pointless to add it.

Most people will be dealing with objects with identical headers, and for those that aren't, they should be the ones to be forced to continue manually supplying headers. Manually supplying headers shouldn't be for performance, but to unambiguity the data source.

@babiabeo
Copy link
Contributor

babiabeo commented Oct 27, 2024

I agree. Inferring headers is just a way to help the user detect column names. It's very useful when all objects have the same headers, so users don't need to specify them again.

However, when objects have different fields, it's recommended to supply headers manually to avoid losing data, rather than relying on the inference.

@kt3k
Copy link
Member

kt3k commented Oct 28, 2024

I'm in favor of inferring only from the first item. As @BlackAsLight points, the performance impact looks too significant for large data.

@timreichen
Copy link
Contributor

timreichen commented Oct 28, 2024

I'm in favor of inferring only from the first item. As @BlackAsLight points, the performance impact looks too significant for large data.

While I agree that this will have a performance hit, selecting the keys of the just first item can lead to "buggy" behavior because not all will expect special treatment of the first element. I think in that case we should not implement this and keep explicit header declarations.

I did some benchmarks though and while the impact is big when comparing the inferColumns() function side by side, it is not such a big performance impact when comparing the stringify() functions as a whole:

Benchmarks:

     CPU | Apple M1
Runtime | Deno 2.0.3 (aarch64-apple-darwin)

file:///std/csv/infer_columns_bench.ts

benchmark                                       time/iter (avg)        iter/s      (min … max)           p75      p99     p995
----------------------------------------------- ----------------------------- --------------------- --------------------------
infer columns all elements (1000 elements)              15.5 µs        64,660 ( 13.5 µs … 205.9 µs)  15.1 µs  17.4 µs  29.3 µs
infer columns first only (1000 elements)                10.7 ns    93,560,000 (  9.8 ns …  34.7 ns)  10.2 ns  23.7 ns  25.0 ns
infer columns all elements (10000 elements)            141.1 µs         7,087 (131.3 µs … 325.7 µs) 135.5 µs 278.1 µs 286.2 µs
infer columns first only (10000 elements)               10.8 ns    92,210,000 ( 10.0 ns …  29.0 ns)  10.3 ns  23.8 ns  24.7 ns
infer columns all elements (100000 elements)             1.4 ms         704.8 (  1.3 ms …   1.6 ms)   1.5 ms   1.5 ms   1.6 ms
infer columns first only (100000 elements)              10.7 ns    93,270,000 ( 10.0 ns …  43.4 ns)  10.2 ns  23.5 ns  24.3 ns
infer columns all elements (1000000 elements)           14.3 ms          70.0 ( 14.1 ms …  14.4 ms)  14.3 ms  14.4 ms  14.4 ms
infer columns first only (1000000 elements)             10.7 ns    93,250,000 (  9.8 ns …  33.6 ns)  10.2 ns  24.2 ns  25.8 ns


file:///std/csv/stringify_bench.ts

benchmark                                                 time/iter (avg)        iter/s      (min … max)           p75      p99     p995
--------------------------------------------------------- ----------------------------- --------------------- --------------------------
stringify with infering all elements (1000 elements)             192.6 µs         5,192 (178.1 µs … 528.8 µs) 185.7 µs 361.0 µs 376.5 µs
stringify with infering first only (1000 elements)               177.1 µs         5,645 (163.7 µs … 507.9 µs) 170.7 µs 350.4 µs 361.2 µs
stringify with infering all elements (10000 elements)              3.0 ms         333.5 (  2.7 ms …  19.9 ms)   3.0 ms   6.9 ms  19.9 ms
stringify with infering first only (10000 elements)                2.9 ms         348.9 (  2.5 ms …  13.5 ms)   3.0 ms   5.6 ms  13.5 ms
stringify with infering all elements (100000 elements)            43.4 ms          23.0 ( 39.3 ms …  59.2 ms)  42.6 ms  59.2 ms  59.2 ms
stringify with infering first only (100000 elements)              40.5 ms          24.7 ( 36.0 ms …  55.0 ms)  41.5 ms  55.0 ms  55.0 ms
stringify with infering all elements (1000000 elements)          473.9 ms           2.1 (437.8 ms … 524.9 ms) 490.7 ms 524.9 ms 524.9 ms
stringify with infering first only (1000000 elements)            460.8 ms           2.2 (419.8 ms … 508.2 ms) 469.6 ms 508.2 ms 508.2 ms

@Leokuma
Copy link

Leokuma commented Oct 28, 2024

Defaulting to inferring from the first object sounds error prone to me. It could lead to bugs and data loss. IMO that should happen only if the dev explicitly opts for that behavior. JSON stringify, YAML stringify and TOML stringify don't arbitrarily discard data.

Personally I see collecting all objects' props as the correct behavior, and we shouldn't sacrifice correctness in order to be performant. But I'll try to run some benchmarks later to see how big the performance impact is on my machine.

@efekrskl
Copy link
Contributor Author

I'm a bit late to my own party 😅 I mentioned this already but, I'm not heavily leaning towards one of the solutions, but if I had to pick, I would go with my first implementation, picking the headers from the first item. If I were an end user, I would expect to pass headers already if my data is not uniform. I also like how this would be the expected approach for the users that are coming from most used csv libraries.

@efekrskl
Copy link
Contributor Author

I reverted the implementation, so now we derive the columns from the first item. I feel like this is the correct approach for this, and I would like to move forward with the PR. Just to go over the thought process one more time:

  • IMO if one has non-uniform data, it is more expected to pass headers (as in doing so, you "structure" your data)
  • Thanks @timreichen for taking your time to benchmark this. While the impact is minimal indeed, I feel like this feature should have absolutely zero impact on performance
  • Seems like a more common approach

@kt3k
Copy link
Member

kt3k commented Nov 13, 2024

@efekrskl
Sorry for the delay in review. We currently have the policy of accepting every new feature as 'unstable' feature first (ref https://github.com/denoland/std/blob/main/.github/CONTRIBUTING.md#suggesting-a-new-feature ) to minimize the impact to the stable APIs.

To align this PR to that policy, I moved the newly added part of this change to unstable_stringify.ts (that will be available from @std/csv/unstable-stringify). Do you see this change reasonable?

@efekrskl
Copy link
Contributor Author

@kt3k Of course :) looks great, thanks for the changes.

@kt3k kt3k changed the title feat(csv): infer column names from object arrays for stringify() feat(csv/unstable): infer column names from object arrays for stringify() Nov 14, 2024
Copy link
Member

@kt3k kt3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kt3k kt3k merged commit 7490085 into denoland:main Nov 14, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSV - option to automatically infer headers
7 participants