Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detects "whitespace" as delimiter instead of "comma" #420

Closed
HMazharHameed opened this issue Oct 30, 2020 · 4 comments
Closed

Detects "whitespace" as delimiter instead of "comma" #420

HMazharHameed opened this issue Oct 30, 2020 · 4 comments
Assignees
Milestone

Comments

@HMazharHameed
Copy link

Hi
I tried to parse this file, but it recognizes space as a separator instead of a comma.
I can see the problem here every time the attribute values are separated with space and concatenated with comma.
Is there any property to deal with such issues?

Sample Input.

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K
37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K
49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K
31, Private, 45781, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 14084, 0, 50, United-States, >50K
42, Private, 159449, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 5178, 0, 40, United-States, >50K
37, Private, 280464, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 80, United-States, >50K
30, State-gov, 141297, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, India, >50K
23, Private, 122272, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 30, United-States, <=50K
32, Private, 205019, Assoc-acdm, 12, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 50, United-States, <=50K
40, Private, 121772, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, ?, >50K
25, Self-emp-not-inc, 176756, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 35, United-States, <=50K
32, Private, 186824, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, White, Male, 0, 0, 40, United-States, <=50K
38, Private, 28887, 11th, 7, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K
43, Self-emp-not-inc, 292175, Masters, 14, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 45, United-States, >50K
40, Private, 193524, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K
54, Private, 302146, HS-grad, 9, Separated, Other-service, Unmarried, Black, Female, 0, 0, 20, United-States, <=50K
35, Federal-gov, 76845, 9th, 5, Married-civ-spouse, Farming-fishing, Husband, Black, Male, 0, 0, 40, United-States, <=50K
43, Private, 117037, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 2042, 40, United-States, <=50K
59, Private, 109015, HS-grad, 9, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K
56, Local-gov, 216851, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K
19, Private, 168294, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K
54, ?, 180211, Some-college, 10, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South, >50K
39, Private, 367260, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 80, United-States, <=50K
49, Private, 193366, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K

@HMazharHameed HMazharHameed changed the title Detects "space" as delimiter instead of "comma" Detects "whitespace" as delimiter instead of "comma" Oct 30, 2020
@jbax
Copy link
Member

jbax commented Nov 2, 2020

In the example you gave the is no clear distinction between space or colon as delimiters (to the code that tries to figure out what delimiter you might have there). They show up consistently across all rows

You can try to solve this by passing a list of allowed delimiters, for example:

parserSettings.setDelimiterDetectionEnabled(true, ',', '|', ';', '\t');

I'll leave this open to try and figure a better way to handle this case

@HMazharHameed
Copy link
Author

Thank you for getting in touch with me.
Yeah, this file is one of the special cases I work with. I have already tried different ways, but I will be waiting for your answer for a better and dynamic solution.

@HMazharHameed
Copy link
Author

HMazharHameed commented Dec 10, 2020

Hi,
In the file I showed earlier, the parser recognizes whitespace as delimiters, but for the file below, it recognizes commas as delimiters, which is correct but then appends an extra comma by substituting whitespace and assuming that the comma already set is part of the cell value.

My observations:

  1. the parser assumes the comma as a delimiter by default and replaces whitespace with commas in this case.
  2. each value has a different number of whitespace ( as if the file has the same number of whitespaces (before or after cell value), the parser correctly detects the delimiter and does not replace whitespace with commas.

Sample Input

1.000, 2.000, 26986.500, 28217.400
2.000, 1.000, 39436.602, 40417.500
1.000, 1.000, 56504.199, 57633.000
1.000, 2.000, 0.000, 0.000
1.000, 2.000, 0.000, 0.000
2.000, 1.000, 223854.906, 224208.906
1.000, 1.000, 234827.094, 234872.906
2.000, 1.000, 242935.703, 243400.297
1.000, 1.000, 291422.187, 292854.812
1.000, 1.000, 0.000, 0.000
1.000, 1.000, 325835.594, 327295.812
2.000, 2.000, 330849.000, 333776.500
1.000, 2.000, 346442.906, 347863.312
1.000, 2.000
0.000, 11.000, 36181.000, 46595.699, 1.000
0.000, 12.000, 48054.398, 51497.602, 1.000
0.000, 12.000, 0.000, 0.000, 1.000
0.000, 12.000, 56339.102, 57802.699, 1.000
1.000, 11.000, 61023.699, 62411.102, 1.000
1.000, 12.000, 65707.898, 67088.797, 1.000
2.000, 11.000, 70421.398, 71927.500, 1.000
2.000, 12.000, 75163.102, 76616.898, 1.000
3.000, 11.000, 79828.797, 81223.500, 1.000
3.000, 12.000, 85949.203, 86075.703, 1.000
0.000, 11.000, 0.000
0.000, 12.000, 0.000
1.000, 11.000, 0.000
1.000, 11.000, 0.000
1.000, 11.000, 0.000
1.000, 12.000, 0.000
0.000, 11.000, 165422.297, 174058.000, 1.000
0.000, 12.000, 174072.406, 179033.500, 1.000
0.000, 12.000, 0.000, 0.000, 1.000
0.000, 12.000, 0.000, 0.000, 1.000
1.000, 11.000, 188491.203, 189916.406, 1.000
1.000, 12.000, 193203.906, 194739.094, 1.000
0.000, 11.000, 220495.703, 233841.406, 1.000
0.000, 12.000, 233864.000, 238646.797, 1.000
1.000, 11.000, 240117.906, 245865.500, 1.000
1.000, 12.000, 249333.797, 250740.906, 1.000
2.000, 11.000, 256761.797, 257932.500, 1.000
2.000, 12.000, 257945.797, 262812.687, 1.000
3.000, 11.000, 264271.000, 269978.000, 1.000
3.000, 12.000, 271353.187, 274820.906, 1.000
4.000, 11.000, 276473.187, 281976.687, 1.000
4.000, 12.000, 283523.812, 286868.594, 1.000
5.000, 11.000, 288448.312, 294017.812, 1.000
5.000, 12.000, 297854.000, 298910.906, 1.000
6.000, 11.000, 300499.906, 306017.187, 1.000
6.000, 12.000, 307581.594, 310994.594, 1.000
7.000, 11.000, 316626.312, 318068.312, 1.000
7.000, 12.000, 319569.406, 322962.187, 1.000
8.000, 11.000, 324508.906, 330053.594, 1.000
8.000, 12.000, 331681.187, 334986.312, 1.000
9.000, 11.000, 340626.406, 342086.500, 1.000
9.000, 12.000, 343612.094, 346974.687, 1.000
10.000, 11.000, 348545.312, 354106.094, 1.000
10.000, 12.000, 357613.906, 359008.500, 1.000
11.000, 11.000, 360543.594, 366078.594, 1.000
11.000, 12.000, 367618.500, 370994.000, 1.000

jbax added a commit that referenced this issue Dec 15, 2020
@jbax
Copy link
Member

jbax commented Dec 15, 2020

Done, I've made adjustments to disconsider whitespace as a delimiter when there is a suitable alternative delimiter candidate.

I'll release a 2.9.1-SNAPSHOT build in an hour from now which you can use to test. Thank you for using our parsers!

@jbax jbax closed this as completed Dec 15, 2020
@jbax jbax added this to the 2.9.1 milestone Dec 15, 2020
@jbax jbax self-assigned this Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants