-
Notifications
You must be signed in to change notification settings - Fork 620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(csv/unstable): infer column names from object arrays for stringify() #6122
Conversation
csv/stringify_test.ts
Outdated
await t.step( | ||
{ | ||
name: "Invalid data, no columns", | ||
fn() { | ||
const data = [{ a: 1 }, { a: 2 }]; | ||
assertThrows( | ||
() => stringify(data), | ||
TypeError, | ||
"No property accessor function was provided for object", | ||
); | ||
}, | ||
}, | ||
); | ||
await t.step( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these were duplicates, removed both as they are not relevant anymore
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6122 +/- ##
==========================================
- Coverage 96.48% 95.84% -0.65%
==========================================
Files 535 517 -18
Lines 41285 41156 -129
Branches 6165 6074 -91
==========================================
- Hits 39835 39447 -388
- Misses 1406 1664 +258
- Partials 44 45 +1 ☔ View full report in Codecov by Sentry. |
Maybe instead of inferring from just the first element, how about collecting headers from all objects? |
I chose this approach as it feels like this is a more common pattern found in similar solutions. I'm open to both approaches though, wonder what others think. |
+1 for collecting all headers. If we don't, the user will be forced to either specify all columns manually or pass a complete object as the first object. There's a risk of losing data. [
{"city": "Paris", "area": 90},
{"city": "Rome", "area": 50, "population": 43},
{"city": "Tokyo", "altitude": 9},
] The JSON above would lose data. This is a simple JSON, but in more complex use cases (eg data science) where there are dozens or hundreds of columns, it might not be obvious that you're losing data. OTOH, if this approach has a performance impact, we should probably add to the docs that specifying columns manually allows the function to run faster. |
Thank you both for the feedback. I've updated the implementation to infer the columns from all array elements instead of the first one. |
Personally I think the option to infer headers should only be done via the first object. Collecting all objects before outputting anything to infer all headers really removes the streaming aspect of it. As it stands now the user is already forced to manually supply all headers so requiring them to do that when they don't have identical keys in each object isn't problem. This inference is really for the people who are using identical keys in objects. |
But the same argument goes the other way around as well: As it stands now the user is already forced to manually supply all headers so requiring them to do that when they don't want to wait until all headers are collected isn't a problem. Why should the inference be just for people who are using identical keys in objects? I think this might feel like a bug if data is just omitted as @Leokuma pointed out. |
The way it is now forces them to always supply headers. It makes more sense that the change should be to only require headers if each object doesn't have the same header. It doesn't make sense to make inference have a huge memory impact, essentially forcing everyone who doesn't want to cop that impact to continue manually supplying headers. If inference has such a huge impact then it's pointless to add it. Most people will be dealing with objects with identical headers, and for those that aren't, they should be the ones to be forced to continue manually supplying headers. Manually supplying headers shouldn't be for performance, but to unambiguity the data source. |
I agree. Inferring headers is just a way to help the user detect column names. It's very useful when all objects have the same headers, so users don't need to specify them again. However, when objects have different fields, it's recommended to supply headers manually to avoid losing data, rather than relying on the inference. |
I'm in favor of inferring only from the first item. As @BlackAsLight points, the performance impact looks too significant for large data. |
While I agree that this will have a performance hit, selecting the keys of the just first item can lead to "buggy" behavior because not all will expect special treatment of the first element. I think in that case we should not implement this and keep explicit header declarations. I did some benchmarks though and while the impact is big when comparing the Benchmarks: CPU | Apple M1
Runtime | Deno 2.0.3 (aarch64-apple-darwin)
file:///std/csv/infer_columns_bench.ts
benchmark time/iter (avg) iter/s (min … max) p75 p99 p995
----------------------------------------------- ----------------------------- --------------------- --------------------------
infer columns all elements (1000 elements) 15.5 µs 64,660 ( 13.5 µs … 205.9 µs) 15.1 µs 17.4 µs 29.3 µs
infer columns first only (1000 elements) 10.7 ns 93,560,000 ( 9.8 ns … 34.7 ns) 10.2 ns 23.7 ns 25.0 ns
infer columns all elements (10000 elements) 141.1 µs 7,087 (131.3 µs … 325.7 µs) 135.5 µs 278.1 µs 286.2 µs
infer columns first only (10000 elements) 10.8 ns 92,210,000 ( 10.0 ns … 29.0 ns) 10.3 ns 23.8 ns 24.7 ns
infer columns all elements (100000 elements) 1.4 ms 704.8 ( 1.3 ms … 1.6 ms) 1.5 ms 1.5 ms 1.6 ms
infer columns first only (100000 elements) 10.7 ns 93,270,000 ( 10.0 ns … 43.4 ns) 10.2 ns 23.5 ns 24.3 ns
infer columns all elements (1000000 elements) 14.3 ms 70.0 ( 14.1 ms … 14.4 ms) 14.3 ms 14.4 ms 14.4 ms
infer columns first only (1000000 elements) 10.7 ns 93,250,000 ( 9.8 ns … 33.6 ns) 10.2 ns 24.2 ns 25.8 ns
file:///std/csv/stringify_bench.ts
benchmark time/iter (avg) iter/s (min … max) p75 p99 p995
--------------------------------------------------------- ----------------------------- --------------------- --------------------------
stringify with infering all elements (1000 elements) 192.6 µs 5,192 (178.1 µs … 528.8 µs) 185.7 µs 361.0 µs 376.5 µs
stringify with infering first only (1000 elements) 177.1 µs 5,645 (163.7 µs … 507.9 µs) 170.7 µs 350.4 µs 361.2 µs
stringify with infering all elements (10000 elements) 3.0 ms 333.5 ( 2.7 ms … 19.9 ms) 3.0 ms 6.9 ms 19.9 ms
stringify with infering first only (10000 elements) 2.9 ms 348.9 ( 2.5 ms … 13.5 ms) 3.0 ms 5.6 ms 13.5 ms
stringify with infering all elements (100000 elements) 43.4 ms 23.0 ( 39.3 ms … 59.2 ms) 42.6 ms 59.2 ms 59.2 ms
stringify with infering first only (100000 elements) 40.5 ms 24.7 ( 36.0 ms … 55.0 ms) 41.5 ms 55.0 ms 55.0 ms
stringify with infering all elements (1000000 elements) 473.9 ms 2.1 (437.8 ms … 524.9 ms) 490.7 ms 524.9 ms 524.9 ms
stringify with infering first only (1000000 elements) 460.8 ms 2.2 (419.8 ms … 508.2 ms) 469.6 ms 508.2 ms 508.2 ms |
Defaulting to inferring from the first object sounds error prone to me. It could lead to bugs and data loss. IMO that should happen only if the dev explicitly opts for that behavior. JSON stringify, YAML stringify and TOML stringify don't arbitrarily discard data. Personally I see collecting all objects' props as the correct behavior, and we shouldn't sacrifice correctness in order to be performant. But I'll try to run some benchmarks later to see how big the performance impact is on my machine. |
I'm a bit late to my own party 😅 I mentioned this already but, I'm not heavily leaning towards one of the solutions, but if I had to pick, I would go with my first implementation, picking the headers from the first item. If I were an end user, I would expect to pass headers already if my data is not uniform. I also like how this would be the expected approach for the users that are coming from most used csv libraries. |
I reverted the implementation, so now we derive the columns from the first item. I feel like this is the correct approach for this, and I would like to move forward with the PR. Just to go over the thought process one more time:
|
@efekrskl To align this PR to that policy, I moved the newly added part of this change to |
@kt3k Of course :) looks great, thanks for the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Closes #3857
Allows usages like below by deriving the column names from the elements of the given array.