Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lines ending with a delimiter confuse hfs-delimited #228

Open
methylene opened this issue Jan 13, 2014 · 19 comments
Open

Lines ending with a delimiter confuse hfs-delimited #228

methylene opened this issue Jan 13, 2014 · 19 comments
Labels

Comments

@methylene
Copy link

I've noticed I get an exception when I use something like

((hfs-delimited "hfds:///a.txt") ?x ?y) 

as a generator, and a.txt contains some lines where the second token is empty, so that the line ends in \t\n. In this case, I get a "operation added wrong number of fields" exception.

I'm not sure whether it's a cascading or cascalog problem.

The cascalog version is 2.0.0, the hadoop version is 1.1.2.

@methylene
Copy link
Author

Acually the lines I used had more tokens than 2. I hope it can be reproduced anyway.

@methylene
Copy link
Author

Also, is there a way to disable quoting entirely on an hfs-delimited source tap? At the moment I just use some unlikely quote character like :quote "\u0003" but still have to strip that character from the input, along with delim chars

@sritchie
Copy link
Collaborator

Not sure on this one - you might check out the underlying implementation in Cascading? This would be a nice feature to add.

@Quantisan
Copy link
Collaborator

@methylene you can pass in :strict? false option into hfs-delimited and your missing field at the end would just be null. Regarding quote, I don't think you can disable it.

I'm going to close this issue for now. Feel free to re-open or msg me if this doesn't work for you.

@methylene
Copy link
Author

Hi Paul, thanks for the tip, that's definitly helpful. But I think it's only a workaround. Maybe there is a misunderstanding? In my input data there is no "missing" field, but an empty one at the end of some line, right before the line break. In particular, there is no line in my input that doesn't have the correct number of delimiters. Using :strict? false will resolve the error, but it also disables a reasonable sanity check on each line. So yes I think this should be reopened.

@Quantisan
Copy link
Collaborator

@methylene delimited expects an ISO legit delimited file. Your best bet is to just read it in with hfs-textline then parse it manually. Does that answer your question?

@Quantisan Quantisan reopened this Feb 5, 2014
@methylene
Copy link
Author

Please explain, how a csv line with the correct number of tokens, but the last token empty, violates an ISO standard?

@methylene
Copy link
Author

You mean, adding quote chars would make the problem go away?

@methylene
Copy link
Author

So yes, I have yet to try

"a""t"b\t""
instead of
a\tb\t

as input line, but I'm afraid that's not what you get as an output when using hfs-delimited or hfs-textline as a sink

@methylene
Copy link
Author

When using hfs-delimited or hfs-textline as a sink tap, empty strings are encoded just the same way as nulll values, that can also be a problem

Quantisan added a commit to Quantisan/cascalog that referenced this issue Feb 26, 2014
@Quantisan
Copy link
Collaborator

Is this test case corresponding to your problem? Quantisan@cac0119

@methylene
Copy link
Author

I think there's something wrong with that test case. Yes it looks as if it confirms the issue. But the following fails, too:

(deftest delimited-empty-notlast-test
  (fact
    (io/with-fs-tmp [_ tmp]
      (?- (hfs-textline tmp) ;; write line
          [["Proin\t\thendrerit"]])
      (fact "Test hfs-delimited where last element is empty"
        (<- [?a ?b !c]
            ((hfs-delimited tmp :delimiter "\t") ?a ?b !c)) =>
        (produces [["Proin" nil "hendrerit"]])))))

@methylene
Copy link
Author

Sorry, it should have been !b, not ?b. The test describes the problem well. The following test succeeds, but fails without the (remove (comp empty? last)) line:

(deftest delimited-empty-token-test
  (let [lines (->> [["0" "1" "2"] [""  "1" "2"]
                    ["0" ""  "2"] [""  ""  "2"]
                    ["0" "1" "" ] [""  "1" "" ]
                    ["0" ""  "" ] [""  ""  "" ]]
                   (remove (comp empty? last))
                   (vec))]
    (io/with-fs-tmp [_ tmp]
      (?- (hfs-textline tmp) ;; write line
          lines) 
      (fact (str "Test with nil at various positions") 
            (<- [!a !b !c]
                ((hfs-delimited tmp :delimiter "\t") !a !b !c)) =>
                (produces (mapv (partial mapv not-empty) lines))))))

@methylene
Copy link
Author

By the way, I haven't used midje before, but I wonder why all tests in cascalog.more-taps-test have two (fact) forms, the outer one having only one argument and no "=>" symbol?

@pepijndevos
Copy link

I have this problem as well.
My file uses commas and quotes, so stuff like

"col","another",""

I don't get any error, just no result at all. It's also not trapped.

This is the code I use to parse it

(defn log-source [path fields]
  (let [cfields (vec (map (partial str "?") fields))
        collect (mapfn [& values]
                  [(zipmap fields values)])
        log (hfs-delimited path
                   :outfields cfields
                   :delimiter ","
                   :compress? true
                   :strict? false)]
    (prn cfields)
    (<- [?map]
        (:trap (stdout))
        (:distinct false)
        (log :>> cfields)
        (collect :<< cfields :> ?map)
      )))

So if I understand correctly, ther is a testcase, but no fix?

I'm not sure what's going on here. It seems to matter which of these two lines I pick.

    (log :> ?field ?another _) ; works
    (log :> ?field ?another ?broken)
    (log :> ?field ?another !works) ; AHA!

@sritchie
Copy link
Collaborator

The cascading guys run their tests against Cascalog, so you should be
able to include it in your project.clj and override the cascalog
version. If it works, I'd love a pull req to bump the default cascading
version.

Thanks,
Sam

Pepijn de Vos mailto:[email protected]
March 25, 2014 4:21 AM

It seems this is fixed upstream?
https://groups.google.com/d/msg/cascading-user/9vy5H0gY8sk/kVM-YRPdycgJ

Can I use that with Cascalog?


Reply to this email directly or view it on GitHub
#228 (comment).

Sam Ritchie (@sritchie)
Paddleguru Co-Founder
703.863.8561
www.paddleguru.com http://www.paddleguru.com/
Twitter http://twitter.com/paddleguru// Facebook
http://facebook.com/paddleguru

@pepijndevos
Copy link

I deleted my comment because I realized they where talking about cascading 2.1, while we're already using 2.5, so the thread is irrelevant.

I also fixed my problem by using a Nullable variable, so the empty last column doesn't cause it to be excluded.
Not sure if @methylene still has problesm.

@methylene
Copy link
Author

@sritchie I've tried putting the [cascading/cascading-hadoop "2.6.0-wip-2"] into the cascalog-more-taps/project.clj in the d76a6b1 commit, appended the test above to cascalog-more-taps/test/cascalog/more-taps-test.clj, then ran the test using lein test :only cascalog.more-taps-test/delimited-empty-token-test. But the issue seems to be still there.

@methylene
Copy link
Author

It's not working, but here's a commit to add the test and bump the cascading version: 71d22c1

@sritchie sritchie added the Bug label Apr 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants