Skip to content

Conversation

@oxaoo
Copy link

@oxaoo oxaoo commented Oct 27, 2018

The support of duplicate header entries allows processing a CSV file do not worry about the presence of duplicate headers.
It is enough to just call CSVFormat.DEFAULT.withIgnoreDuplicateHeaderEntries() that has to be first in the forming chain of the CSVFormat.

What is the need for this?!
Here are two examples from real life.

  1. There is a well-known set of columns from which to extract data. And there is no information about the potential presence of other columns (possibly duplicates) and their sequences in a document.
    The use of this feature will avoid such exceptions as java.lang.IllegalArgumentException: The header contains a duplicate name when the contents of the document are not fully known and there is a need to get by name.
    Example:
    Well-known columns set: [A, B, D].
    Actual document columns set: [Z, A, B, C, D, C]
    Updated header structure: Z->[0], A->[1], B->[2], C->[3, 5], D->[4]
    Summarizing: This approach avoids exceptions for columns that do not even participate in processing. At the same time allows saving the possibility of getting by name.

  2. There is a pivot table that aggregates other tables with the partially identical column names and there is a need to perform an aggregate function with the same columns.
    Example:
    Table1: [A, B, C]
    Table2: [B, C]
    Pivot table: [A, B, C, B, C]
    Task: need to perform an XOR for duplicate columns
    Updated header structure: A->[0], B->[1, 3], C->[2, 4].
    Summarizing: This approach allows storing duplicates as an ordered set. Thus it will allow to perform xor(B[1], B[3]) & xor (C[2], C[4]).

@coveralls
Copy link

coveralls commented Oct 27, 2018

Coverage Status

Coverage increased (+0.2%) to 95.294% when pulling 284035a on oxaoo:master into 0ab2b08 on apache:master.

@oxaoo
Copy link
Author

oxaoo commented Dec 8, 2018

@garydgregory, why don't you take a look at this pull request?

@garydgregory
Copy link
Member

garydgregory commented Dec 9, 2018

Instead of this complication, I would recommend using withSkipHeaderRecord and withHeader(String...). This let's you skip the first row where your duplicate headers are a problem, and set the headers to values that make sense to you application.

For an example, see org.apache.commons.csv.CSVParserTest.testSkipHeaderOverrideDuplicateHeaders().

@garydgregory
Copy link
Member

Please see the new methods in git master and 1.7-SNAPSHOT builds:

  • org.apache.commons.csv.CSVFormat.withAllowDuplicateHeaderNames()
  • org.apache.commons.csv.CSVFormat.withAllowDuplicateHeaderNames(boolean)

Does that work for your use case?

@LuckyIlam
Copy link

Hi,
Pull request seems to be incomplete because CSVFormat.validate:1671 doesnt not allow duplicate header name ?

@garydgregory
Copy link
Member

Please provide a separate PR if you find a problem as we just released 1.7 but the site has not been updated yet.

@LuckyIlam
Copy link

Please provide a separate PR if you find a problem as we just released 1.7 but the site has not been updated yet.

Could you just take a minute to look that code, is it normal that there is
// validate header if (header != null) { final Set<String> dupCheck = new HashSet<>(); for (final String hdr : header) { if (!dupCheck.add(hdr)) { throw new IllegalArgumentException( "The header contains a duplicate entry: '" + hdr + "' in " + Arrays.toString(header)); } } }
and not
// validate header if (header != null && !allowDuplicateHeaderNames) { final Set<String> dupCheck = new HashSet<>(); for (final String hdr : header) { if (!dupCheck.add(hdr)) { throw new IllegalArgumentException( "The header contains a duplicate entry: '" + hdr + "' in " + Arrays.toString(header)); } } }

@garydgregory
Copy link
Member

It would be best if you could provide a PR with a unit test that shows what is wrong...

@LuckyIlam
Copy link

It would be best if you could provide a PR with a unit test that shows what is wrong...

I can't push branch on this repository, i missing some auth ...

@garydgregory
Copy link
Member

That's not how it works. Please read https://help.github.com/en/articles/about-pull-requests

@LuckyIlam
Copy link

That's not how it works. Please read https://help.github.com/en/articles/about-pull-requests

Are you in "In the fork and pull model," ?

anyone can fork an existing repository and push changes to their personal fork without needing access to the source repository.

@LuckyIlam
Copy link

well sorry for disturbance and thanks for help, first time i use Github, i finaly open PR

@garydgregory
Copy link
Member

Closing, implemented by another PR, see https://issues.apache.org/jira/browse/CSV-264 and #114

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants