feat(csv/unstable): infer column names from object arrays for stringify() #6122

efekrskl · 2024-10-16T22:21:06Z

Allows usages like below by deriving the column names from the elements of the given array.

const data = [{ a: 1 }, { a: 2 }, { b: 3 }];
stringify(data);

CLAassistant · 2024-10-16T22:21:12Z

All committers have signed the CLA.

efekrskl · 2024-10-16T22:22:02Z

csv/stringify_test.ts

-    await t.step(
-      {
-        name: "Invalid data, no columns",
-        fn() {
-          const data = [{ a: 1 }, { a: 2 }];
-          assertThrows(
-            () => stringify(data),
-            TypeError,
-            "No property accessor function was provided for object",
-          );
-        },
-      },
-    );
-    await t.step(


Looks like these were duplicates, removed both as they are not relevant anymore

codecov · 2024-10-16T22:23:54Z

Codecov Report

Attention: Patch coverage is 79.31034% with 6 lines in your changes missing coverage. Please review.

Project coverage is 95.84%. Comparing base (8522627) to head (774b2bb).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
csv/unstable_stringify.ts	79.31%	5 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6122      +/-   ##
==========================================
- Coverage   96.48%   95.84%   -0.65%     
==========================================
  Files         535      517      -18     
  Lines       41285    41156     -129     
  Branches     6165     6074      -91     
==========================================
- Hits        39835    39447     -388     
- Misses       1406     1664     +258     
- Partials       44       45       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

timreichen · 2024-10-24T17:03:26Z

Maybe instead of inferring from just the first element, how about collecting headers from all objects?

efekrskl · 2024-10-24T20:56:51Z

Maybe instead of inferring from just the first element, how about collecting headers from all objects?

I chose this approach as it feels like this is a more common pattern found in similar solutions. I'm open to both approaches though, wonder what others think.

Leokuma · 2024-10-25T13:52:49Z

+1 for collecting all headers. If we don't, the user will be forced to either specify all columns manually or pass a complete object as the first object. There's a risk of losing data.

[
  {"city": "Paris", "area": 90},
  {"city": "Rome", "area": 50, "population": 43},
  {"city": "Tokyo", "altitude": 9},
]

The JSON above would lose data. This is a simple JSON, but in more complex use cases (eg data science) where there are dozens or hundreds of columns, it might not be obvious that you're losing data.

OTOH, if this approach has a performance impact, we should probably add to the docs that specifying columns manually allows the function to run faster.

efekrskl · 2024-10-26T15:58:17Z

Thank you both for the feedback. I've updated the implementation to infer the columns from all array elements instead of the first one.

BlackAsLight · 2024-10-26T18:29:39Z

Personally I think the option to infer headers should only be done via the first object. Collecting all objects before outputting anything to infer all headers really removes the streaming aspect of it. As it stands now the user is already forced to manually supply all headers so requiring them to do that when they don't have identical keys in each object isn't problem. This inference is really for the people who are using identical keys in objects.

timreichen · 2024-10-26T20:07:13Z

Personally I think the option to infer headers should only be done via the first object. Collecting all objects before outputting anything to infer all headers really removes the streaming aspect of it. As it stands now the user is already forced to manually supply all headers so requiring them to do that when they don't have identical keys in each object isn't problem. This inference is really for the people who are using identical keys in objects.

But the same argument goes the other way around as well: As it stands now the user is already forced to manually supply all headers so requiring them to do that when they don't want to wait until all headers are collected isn't a problem.

Why should the inference be just for people who are using identical keys in objects? I think this might feel like a bug if data is just omitted as @Leokuma pointed out.

BlackAsLight · 2024-10-27T06:44:19Z

But the same argument goes the other way around as well: As it stands now the user is already forced to manually supply all headers so requiring them to do that when they don't want to wait until all headers are collected isn't a problem.

The way it is now forces them to always supply headers. It makes more sense that the change should be to only require headers if each object doesn't have the same header. It doesn't make sense to make inference have a huge memory impact, essentially forcing everyone who doesn't want to cop that impact to continue manually supplying headers. If inference has such a huge impact then it's pointless to add it.

Most people will be dealing with objects with identical headers, and for those that aren't, they should be the ones to be forced to continue manually supplying headers. Manually supplying headers shouldn't be for performance, but to unambiguity the data source.

babiabeo · 2024-10-27T09:43:16Z

I agree. Inferring headers is just a way to help the user detect column names. It's very useful when all objects have the same headers, so users don't need to specify them again.

However, when objects have different fields, it's recommended to supply headers manually to avoid losing data, rather than relying on the inference.

kt3k · 2024-10-28T06:04:02Z

I'm in favor of inferring only from the first item. As @BlackAsLight points, the performance impact looks too significant for large data.

timreichen · 2024-10-28T07:28:44Z

I'm in favor of inferring only from the first item. As @BlackAsLight points, the performance impact looks too significant for large data.

While I agree that this will have a performance hit, selecting the keys of the just first item can lead to "buggy" behavior because not all will expect special treatment of the first element. I think in that case we should not implement this and keep explicit header declarations.

I did some benchmarks though and while the impact is big when comparing the inferColumns() function side by side, it is not such a big performance impact when comparing the stringify() functions as a whole:

Benchmarks:

     CPU | Apple M1
Runtime | Deno 2.0.3 (aarch64-apple-darwin)

file:///std/csv/infer_columns_bench.ts

benchmark                                       time/iter (avg)        iter/s      (min … max)           p75      p99     p995
----------------------------------------------- ----------------------------- --------------------- --------------------------
infer columns all elements (1000 elements)              15.5 µs        64,660 ( 13.5 µs … 205.9 µs)  15.1 µs  17.4 µs  29.3 µs
infer columns first only (1000 elements)                10.7 ns    93,560,000 (  9.8 ns …  34.7 ns)  10.2 ns  23.7 ns  25.0 ns
infer columns all elements (10000 elements)            141.1 µs         7,087 (131.3 µs … 325.7 µs) 135.5 µs 278.1 µs 286.2 µs
infer columns first only (10000 elements)               10.8 ns    92,210,000 ( 10.0 ns …  29.0 ns)  10.3 ns  23.8 ns  24.7 ns
infer columns all elements (100000 elements)             1.4 ms         704.8 (  1.3 ms …   1.6 ms)   1.5 ms   1.5 ms   1.6 ms
infer columns first only (100000 elements)              10.7 ns    93,270,000 ( 10.0 ns …  43.4 ns)  10.2 ns  23.5 ns  24.3 ns
infer columns all elements (1000000 elements)           14.3 ms          70.0 ( 14.1 ms …  14.4 ms)  14.3 ms  14.4 ms  14.4 ms
infer columns first only (1000000 elements)             10.7 ns    93,250,000 (  9.8 ns …  33.6 ns)  10.2 ns  24.2 ns  25.8 ns


file:///std/csv/stringify_bench.ts

benchmark                                                 time/iter (avg)        iter/s      (min … max)           p75      p99     p995
--------------------------------------------------------- ----------------------------- --------------------- --------------------------
stringify with infering all elements (1000 elements)             192.6 µs         5,192 (178.1 µs … 528.8 µs) 185.7 µs 361.0 µs 376.5 µs
stringify with infering first only (1000 elements)               177.1 µs         5,645 (163.7 µs … 507.9 µs) 170.7 µs 350.4 µs 361.2 µs
stringify with infering all elements (10000 elements)              3.0 ms         333.5 (  2.7 ms …  19.9 ms)   3.0 ms   6.9 ms  19.9 ms
stringify with infering first only (10000 elements)                2.9 ms         348.9 (  2.5 ms …  13.5 ms)   3.0 ms   5.6 ms  13.5 ms
stringify with infering all elements (100000 elements)            43.4 ms          23.0 ( 39.3 ms …  59.2 ms)  42.6 ms  59.2 ms  59.2 ms
stringify with infering first only (100000 elements)              40.5 ms          24.7 ( 36.0 ms …  55.0 ms)  41.5 ms  55.0 ms  55.0 ms
stringify with infering all elements (1000000 elements)          473.9 ms           2.1 (437.8 ms … 524.9 ms) 490.7 ms 524.9 ms 524.9 ms
stringify with infering first only (1000000 elements)            460.8 ms           2.2 (419.8 ms … 508.2 ms) 469.6 ms 508.2 ms 508.2 ms

Leokuma · 2024-10-28T14:39:49Z

Defaulting to inferring from the first object sounds error prone to me. It could lead to bugs and data loss. IMO that should happen only if the dev explicitly opts for that behavior. JSON stringify, YAML stringify and TOML stringify don't arbitrarily discard data.

Personally I see collecting all objects' props as the correct behavior, and we shouldn't sacrifice correctness in order to be performant. But I'll try to run some benchmarks later to see how big the performance impact is on my machine.

efekrskl · 2024-10-28T18:17:40Z

I'm a bit late to my own party 😅 I mentioned this already but, I'm not heavily leaning towards one of the solutions, but if I had to pick, I would go with my first implementation, picking the headers from the first item. If I were an end user, I would expect to pass headers already if my data is not uniform. I also like how this would be the expected approach for the users that are coming from most used csv libraries.

efekrskl · 2024-10-30T19:32:38Z

I reverted the implementation, so now we derive the columns from the first item. I feel like this is the correct approach for this, and I would like to move forward with the PR. Just to go over the thought process one more time:

IMO if one has non-uniform data, it is more expected to pass headers (as in doing so, you "structure" your data)
Thanks @timreichen for taking your time to benchmark this. While the impact is minimal indeed, I feel like this feature should have absolutely zero impact on performance
Seems like a more common approach

kt3k · 2024-11-13T06:52:57Z

@efekrskl
Sorry for the delay in review. We currently have the policy of accepting every new feature as 'unstable' feature first (ref https://github.com/denoland/std/blob/main/.github/CONTRIBUTING.md#suggesting-a-new-feature ) to minimize the impact to the stable APIs.

To align this PR to that policy, I moved the newly added part of this change to unstable_stringify.ts (that will be available from @std/csv/unstable-stringify). Do you see this change reasonable?

efekrskl · 2024-11-13T22:51:09Z

@kt3k Of course :) looks great, thanks for the changes.

kt3k

LGTM!

efekrskl added 2 commits October 17, 2024 00:09

csv: stringify infer column names from object arrays

c0cb0d4

csv: run fmt

26385d2

efekrskl requested a review from kt3k as a code owner October 16, 2024 22:21

github-actions bot added the csv label Oct 16, 2024

efekrskl commented Oct 16, 2024

View reviewed changes

efekrskl changed the title ~~csv: stringify infer column names from object arrays~~ feat(csv): infer column names from object arrays for stringify() Oct 16, 2024

efekrskl added 2 commits October 17, 2024 21:11

test: update old jsdoc example

a35ac58

Merge branch 'main' into feat/3857-infer-columns

2a3bd65

efekrskl added 3 commits October 26, 2024 17:39

feat(csv): infer columns from array elements

d3cbf1f

test(csv): update tests

5c0448b

refactor(csv): run fmt

14d919b

efekrskl added 4 commits October 30, 2024 20:23

revert(csv): infer columns from first items

3554ab9

fix(csv): remove . from error message

6b13825

refactor(csv): run fmt

c8bddf1

test(csv): fix type error in test

7c0c445

efekrskl and others added 6 commits October 30, 2024 20:36

Merge branch 'main' into feat/3857-infer-columns

ffe9e78

Merge branch 'main' into feat/3857-infer-columns

54bdd27

Merge branch 'main' into feat/3857-infer-columns

0c8bb4f

Merge branch 'main' into feat/3857-infer-columns

953f705

move new inference feature to unstable-stringify

0966c7c

fix typo

235a943

kt3k and others added 2 commits November 13, 2024 19:16

Merge branch 'main' into feat/3857-infer-columns

4e41ff5

Merge branch 'main' into feat/3857-infer-columns

774b2bb

kt3k changed the title ~~feat(csv): infer column names from object arrays for stringify()~~ feat(csv/unstable): infer column names from object arrays for stringify() Nov 14, 2024

kt3k approved these changes Nov 14, 2024

View reviewed changes

kt3k merged commit 7490085 into denoland:main Nov 14, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(csv/unstable): infer column names from object arrays for stringify() #6122

feat(csv/unstable): infer column names from object arrays for stringify() #6122

efekrskl commented Oct 16, 2024 •

edited

Loading

CLAassistant commented Oct 16, 2024 •

edited

Loading

efekrskl Oct 16, 2024

codecov bot commented Oct 16, 2024 •

edited

Loading

timreichen commented Oct 24, 2024

efekrskl commented Oct 24, 2024

Leokuma commented Oct 25, 2024

efekrskl commented Oct 26, 2024

BlackAsLight commented Oct 26, 2024

timreichen commented Oct 26, 2024

BlackAsLight commented Oct 27, 2024

babiabeo commented Oct 27, 2024 •

edited

Loading

kt3k commented Oct 28, 2024

timreichen commented Oct 28, 2024 •

edited

Loading

Leokuma commented Oct 28, 2024

efekrskl commented Oct 28, 2024

efekrskl commented Oct 30, 2024

kt3k commented Nov 13, 2024

efekrskl commented Nov 13, 2024

kt3k left a comment

feat(csv/unstable): infer column names from object arrays for stringify() #6122

feat(csv/unstable): infer column names from object arrays for stringify() #6122

Conversation

efekrskl commented Oct 16, 2024 • edited Loading

CLAassistant commented Oct 16, 2024 • edited Loading

efekrskl Oct 16, 2024

Choose a reason for hiding this comment

codecov bot commented Oct 16, 2024 • edited Loading

Codecov Report

timreichen commented Oct 24, 2024

efekrskl commented Oct 24, 2024

Leokuma commented Oct 25, 2024

efekrskl commented Oct 26, 2024

BlackAsLight commented Oct 26, 2024

timreichen commented Oct 26, 2024

BlackAsLight commented Oct 27, 2024

babiabeo commented Oct 27, 2024 • edited Loading

kt3k commented Oct 28, 2024

timreichen commented Oct 28, 2024 • edited Loading

Leokuma commented Oct 28, 2024

efekrskl commented Oct 28, 2024

efekrskl commented Oct 30, 2024

kt3k commented Nov 13, 2024

efekrskl commented Nov 13, 2024

kt3k left a comment

Choose a reason for hiding this comment

efekrskl commented Oct 16, 2024 •

edited

Loading

CLAassistant commented Oct 16, 2024 •

edited

Loading

codecov bot commented Oct 16, 2024 •

edited

Loading

babiabeo commented Oct 27, 2024 •

edited

Loading

timreichen commented Oct 28, 2024 •

edited

Loading