Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implicit limitation on max column name length? #438

Closed
CodingCat opened this issue Jan 2, 2021 · 4 comments
Closed

implicit limitation on max column name length? #438

CodingCat opened this issue Jan 2, 2021 · 4 comments
Assignees
Labels
Milestone

Comments

@CodingCat
Copy link

CodingCat commented Jan 2, 2021

Hi, we recently noticed a bug in Spark 3.x which depends on the latest version of univocity parser. Basically, we found that there is an implicit limitation on column name length in univocity(1024 chars by default). if you added a header longer than the limitation, you will get NPE (you could see the detailed analysis in the Spark PR)

in univocity code base, you could add the following unit test to reproduce (to get that NPE error mentioned in Spark PR)

+       @Test
+       public void testSuperLongHeader() {
+               CsvWriterSettings settings = new CsvWriterSettings();
+               settings.getFormat().setLineSeparator("\n");
+               StringBuffer sb = new StringBuffer();
+               for (int i = 0; i < 1025; i++) {
+                       sb.append("a");
+               }
+               settings.setHeaders(sb.toString());
+               StringWriter out = new StringWriter();
+
+               CsvWriter writer = new CsvWriter(out, settings);
+               writer.writeHeaders();
+               List<String> row = new ArrayList<String>();
+               row.add("value 1");
+               row.add("value 2");
+               writer.writeRow(row);
+               writer.close();
+
+               assertEquals(out.toString(), "value 1,value 2\n");
+       }

NPE:

java.lang.NullPointerException: null
	at com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)
	at com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)
	at com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)
	at com.univocity.parsers.csv.CsvWriterTest.testSuperLongHeader(CsvWriterTest.java:638)

our question is: is such a limitation intentionally added? or it is actually a bug?

cc @HyukjinKwon @viirya

@CodingCat
Copy link
Author

cc @jbax

@CodingCat
Copy link
Author

ping @jbax

jbax added a commit that referenced this issue Jan 12, 2021
@jbax
Copy link
Member

jbax commented Jan 12, 2021

Fixed and will release a 2.9.1-SNAPSHOT version soon which you can use to test and confirm it's working.
Thank you!

@jbax jbax closed this as completed Jan 12, 2021
@jbax jbax added this to the 2.9.1 milestone Jan 12, 2021
@jbax jbax self-assigned this Jan 12, 2021
@jbax jbax added the bug label Jan 12, 2021
@HyukjinKwon
Copy link

Awesome!

HyukjinKwon pushed a commit to apache/spark that referenced this issue Jan 20, 2021
### What changes were proposed in this pull request?

upgrade univocity

### Why are the changes needed?

csv writer actually has an implicit limit on column name length due to univocity-parser 2.9.0,

when we initialize a writer https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211, it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java eventually (https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)

in that stringCache.get, it has a maxStringLength cap https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104 which is 1024 by default

more details at #30972 and uniVocity/univocity-parsers#438

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
existing UT

Closes #31246 from CodingCat/upgrade_univocity.

Authored-by: CodingCat <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon pushed a commit to apache/spark that referenced this issue Jan 20, 2021
### What changes were proposed in this pull request?

upgrade univocity

### Why are the changes needed?

csv writer actually has an implicit limit on column name length due to univocity-parser 2.9.0,

when we initialize a writer https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211, it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java eventually (https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)

in that stringCache.get, it has a maxStringLength cap https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104 which is 1024 by default

more details at #30972 and uniVocity/univocity-parsers#438

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
existing UT

Closes #31246 from CodingCat/upgrade_univocity.

Authored-by: CodingCat <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 7f3e952)
Signed-off-by: HyukjinKwon <[email protected]>
skestle pushed a commit to skestle/spark that referenced this issue Feb 3, 2021
### What changes were proposed in this pull request?

upgrade univocity

### Why are the changes needed?

csv writer actually has an implicit limit on column name length due to univocity-parser 2.9.0,

when we initialize a writer https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211, it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java eventually (https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)

in that stringCache.get, it has a maxStringLength cap https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104 which is 1024 by default

more details at apache#30972 and uniVocity/univocity-parsers#438

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
existing UT

Closes apache#31246 from CodingCat/upgrade_univocity.

Authored-by: CodingCat <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants