-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lines ending with a delimiter confuse hfs-delimited #228
Comments
Acually the lines I used had more tokens than 2. I hope it can be reproduced anyway. |
Also, is there a way to disable quoting entirely on an hfs-delimited source tap? At the moment I just use some unlikely quote character like :quote "\u0003" but still have to strip that character from the input, along with delim chars |
Not sure on this one - you might check out the underlying implementation in Cascading? This would be a nice feature to add. |
@methylene you can pass in I'm going to close this issue for now. Feel free to re-open or msg me if this doesn't work for you. |
Hi Paul, thanks for the tip, that's definitly helpful. But I think it's only a workaround. Maybe there is a misunderstanding? In my input data there is no "missing" field, but an empty one at the end of some line, right before the line break. In particular, there is no line in my input that doesn't have the correct number of delimiters. Using |
@methylene delimited expects an ISO legit delimited file. Your best bet is to just read it in with hfs-textline then parse it manually. Does that answer your question? |
Please explain, how a csv line with the correct number of tokens, but the last token empty, violates an ISO standard? |
You mean, adding quote chars would make the problem go away? |
So yes, I have yet to try "a""t"b\t"" as input line, but I'm afraid that's not what you get as an output when using hfs-delimited or hfs-textline as a sink |
When using hfs-delimited or hfs-textline as a sink tap, empty strings are encoded just the same way as nulll values, that can also be a problem |
Is this test case corresponding to your problem? Quantisan@cac0119 |
I think there's something wrong with that test case. Yes it looks as if it confirms the issue. But the following fails, too: (deftest delimited-empty-notlast-test
(fact
(io/with-fs-tmp [_ tmp]
(?- (hfs-textline tmp) ;; write line
[["Proin\t\thendrerit"]])
(fact "Test hfs-delimited where last element is empty"
(<- [?a ?b !c]
((hfs-delimited tmp :delimiter "\t") ?a ?b !c)) =>
(produces [["Proin" nil "hendrerit"]]))))) |
Sorry, it should have been !b, not ?b. The test describes the problem well. The following test succeeds, but fails without the (deftest delimited-empty-token-test
(let [lines (->> [["0" "1" "2"] ["" "1" "2"]
["0" "" "2"] ["" "" "2"]
["0" "1" "" ] ["" "1" "" ]
["0" "" "" ] ["" "" "" ]]
(remove (comp empty? last))
(vec))]
(io/with-fs-tmp [_ tmp]
(?- (hfs-textline tmp) ;; write line
lines)
(fact (str "Test with nil at various positions")
(<- [!a !b !c]
((hfs-delimited tmp :delimiter "\t") !a !b !c)) =>
(produces (mapv (partial mapv not-empty) lines)))))) |
By the way, I haven't used midje before, but I wonder why all tests in cascalog.more-taps-test have two (fact) forms, the outer one having only one argument and no "=>" symbol? |
I have this problem as well.
I don't get any error, just no result at all. It's also not trapped. This is the code I use to parse it
So if I understand correctly, ther is a testcase, but no fix? I'm not sure what's going on here. It seems to matter which of these two lines I pick.
|
The cascading guys run their tests against Cascalog, so you should be Thanks,
Sam Ritchie (@sritchie) |
I deleted my comment because I realized they where talking about cascading 2.1, while we're already using 2.5, so the thread is irrelevant. I also fixed my problem by using a Nullable variable, so the empty last column doesn't cause it to be excluded. |
@sritchie I've tried putting the |
It's not working, but here's a commit to add the test and bump the cascading version: 71d22c1 |
I've noticed I get an exception when I use something like
as a generator, and a.txt contains some lines where the second token is empty, so that the line ends in \t\n. In this case, I get a "operation added wrong number of fields" exception.
I'm not sure whether it's a cascading or cascalog problem.
The cascalog version is 2.0.0, the hadoop version is 1.1.2.
The text was updated successfully, but these errors were encountered: