Enhance pipe to support more features we can do in hadoop streaming #638

lyogavin · 2013-06-02T23:41:25Z

In Yahoo we want to migrate some of our hadoop based pipelines to Spark. Our pipelines are mostly implemented in hadoop streaming. We hope we can reuse most of the hadoop streaming code so that we can reduce the effort of the migration, and maybe only focus on rewriting the perf bottleneck. In our prototyping we saw that reusing existing code we can still get significant perf improvement. This should be also helpful for other users if they are currently using hadoop and want to change to Spark.

Though Spark has rdd.pipe() to do some similar work as hadoop streaming, we found some use cases as following still couldn't be supported.

We have some problem with reuse the reducer side hadoop streaming code with pipe(). The problem we had was, in our use case, after keyBy() or groupBy() the size of the data for some key can be very big. If we resolve this problem using map() then pipe(), we need to construct a big string to concat all the strings in the Seq[String] of the RDD of one key, this would consume a quite big chunk of memory in some cases, and seems like a not very scalable and performant solution. Especially Spark is a so memory intensive system, optimizing memory usage is very important for us.
Sometimes in hadoop streaming we need to use distributed cache(-cacheFile, -cacheArchive) to distribute the shared data to all tasktrackers. We need some similar mechanism in spark to do the same thing.

In this Pull Request, we enhanced the pipeRdd and rdd.pipe() to support above 2 features. For 1st, we allow user of pipe() to provide a function to specify how to transform the content of RDD and do the outputing. For 2nd, we allow user to provide a Seq[String] which will be piped to the shell process. User can use broadcast values to implement the similar feature as distributed cache in hadoop streaming.

AmplabJenkins · 2013-06-02T23:42:25Z

Thank you for your pull request. An admin will review this request soon.

rxin · 2013-06-03T00:21:52Z

core/src/main/scala/spark/rdd/PipedRDD.scala

You can probably collapse the two (or even 3) additional instructors by using default arguments.

Also, wrap lines at 100 char width.

rxin · 2013-06-03T00:25:19Z

Thanks, Gavin, for submitting this. This is pretty cool. I made some comments in the coding style (and ask for a little bit more documentation on the new parameters).

JoshRosen · 2013-06-03T00:47:53Z

Is there anything that transform does that can't be done by just calling map() on your RDD prior to pipe()?

Similarly, why does transform have the side-effect of writing to an output stream? Why not make transform a pure function and perform the output writing in PipedRDD?

mridulm · 2013-06-03T04:09:48Z

core/src/main/scala/spark/rdd/PipedRDD.scala

The delimeter should be configurable.

Good point. I'll add another paramter to allow the user to config this.

mridulm · 2013-06-03T04:11:48Z

I agree with Josh, I am not sure what I am missing here : cant this not just be shuffle, map, pipe ?

lyogavin · 2013-06-03T04:22:55Z

@rxin Thanks Reynold, I'll change according to your comments later.

lyogavin · 2013-06-03T04:35:19Z

@JoshRosen Thanks Josh. Sorry I think I didn't describe the problem very clearly. The problem we had with directly using map() was, in our use case, after keyBy() or groupBy() the size of the data for some key can be very big. If we resolve this problem using map(), we need to construct a big string to concat all the strings in the Seq[String] of the RDD of one key. I think this would consume a quite big chunk of memory in some cases, and seems like a not very scalable and performant solution. Especially Spark is a so memory intensive system, optimizing memory usage is very important for us. I'm not sure if there are some better solution to resolve the problem without constructing a very big string using map(). If you can think of some better solution, i'm willing to stay with map().

JoshRosen · 2013-06-03T05:06:36Z

core/src/main/scala/spark/rdd/PipedRDD.scala

Why is this called with out.println instead of out.print? Using out.print would give users the freedom to incrementally build up output on a single line, whereas out.println forces them to either convert the record into one huge string or to split it across multiple lines.

Actually, why not just pass the raw PrintWriter to transform? It would simplify transform's signature and provide a lot of flexibility without sacrificing performance.

This would efficiently support the case where you want to write a reducer's input as a key followed by one value per line (with appropriate delimiters between keys) because it wouldn't force you to append '\n' to the end of each value's string before printing it.

I dont think supporting customizing "\n" is so urgently needed, and I think it's out of the scope of this pull request.

Plus I think it's actually a complicated topic. I think a good API should have a good trade-off between flexibility and cohesion. When we expose PrintWriter to the users, we totally lose control of how they use it, they might use all the methods in PrintWriter interface, some of which are not what we want them to use. And this way, we would be tightly coupled with PrintWriter's implementation details. Exposing an abstract interface like here is a better way to decouple the implementation details. Hadoop abstracts the interface as org.apache.hadoop.streaming.io.InputWriter, and allows customization of almost anything by specifying a customized implementation of InputWriter. So if we really want to support the customization well, i think we can reuse org.apache.hadoop.streaming.io.InputWriter as what we do in hadoopFile for InputFormat. But again, i think that's out of the scope of this pull request.

JoshRosen · 2013-06-03T05:20:33Z

In this implementation, the arguments are serialized and shipped to works as part of a task. If the arguments are expected to be large, it might make sense to ship them using broadcast variables. This could be done by wrapping arguments in a broadcast variable inside of the RDD.pipe() method, then changing PipedRDD's constructor to accept a Broadcast[Seq[String]].

I imagine that arguments will only be used for passing large values to the workers; for small values (e.g. input filenames), it makes more sense to pass them via environment variables or shell arguments. To avoid confusion of its purpose, is there a better name for this new mechanism than arguments?

mridulm · 2013-06-03T06:22:29Z

What sort of key/value distribution are you targeting - specifically related to the memory pressure comment.
Heavy skew of values to keys ? Large number of unique keys ? something else ?

lyogavin · 2013-06-03T21:24:18Z

@JoshRosen you are right. It was suppose to be Broadcast[Seq[String]], I missed that, will add it.

lyogavin · 2013-06-03T21:29:48Z

@mridulm , I our use case, sometimes some key may contain values of huge size, which may be too large that we even dont know if it's possible to fit all into one node even with this change. There's no problem at all with Hadoop to process this kind of case as the data in hadoop are persisted. So I think the purpose of this patch is to make spark more scalable for this kind of use cases.

rxin · 2013-06-06T23:39:45Z

core/src/main/scala/spark/RDD.scala

Since you have the 3rd pipe function with default arguments, I don't think you need the first two anymore?

I agree with this; we should consolidate these into one function.

@rxin @mateiz , I agree with you. I actually tried to combine as much as I can. ;) But I got some weird issues when I define default arguments for 2 overloaded version of pipe(): "in class RDD, multiple overloaded alternatives of method pipe define default arguments.". I dont know if you have run into this as well.

I tested, it seems like when there are more than 1 overloaded version of a function, only 1 is allowed to have default parameters:

scala> class A() { def f(a:Int = 0, b:Int) =0;def f(a:Int, b:String) =0}
defined class A

scala> class A() { def f(a:Int = 0, b:Int) =0;def f(a:Int = 0, b:String) =0}
:7: error: in class A, multiple overloaded alternatives of method f define default arguments.
class A() { def f(a:Int = 0, b:Int) =0;def f(a:Int = 0, b:String) =0}

Looks like there are some complaints about this weird policy already: http://stackoverflow.com/questions/4652095/why-does-the-scala-compiler-disallow-overloaded-methods-with-default-arguments.

Do you know any better way to resolve this problem?

The reason I kept the first 2 version of pipe() were, I wanted to still have the pipe(String) and pipe(Seq[String]) available to maintain backward compatibility. I thought we'd better still keep users' previous code working, otherwise, after this change they would need to modify their code and rebuild. Do you think we should consider the backward compatibility? What's your opinion?

For now, how about only the Seq[String] version of the pipe can have the "advanced" features?

This sounds good to me.

rxin · 2013-06-06T23:41:20Z

Hi Gavin,

Thanks for taking the effort to clean up. I think the number of functions and constructors can each be reduced to 1 with what your have. You can just delete the first two def pipe, and move the default arguments from the 2nd constructor to the first one.

JoshRosen · 2013-06-06T23:42:04Z

core/src/main/scala/spark/RDD.scala

Typo: trasnform should be transform.

mateiz · 2013-06-09T23:08:02Z

Hey Gavin, just a couple of comments on this:

Using default arguments would be good, as others pointed out
For the context, would it make sense to pass a sequence of broadcast variables instead of just one? I can imagine apps that would share part of the context across pipe() calls.

lyogavin · 2013-06-10T23:43:22Z

@mateiz want to make sure i understand your point clearly, did you mean seq[Broadcast[U]]?

mateiz · 2013-06-13T20:43:05Z

Sorry, yes, that's what I meant. Do you think it would be useful?

lyogavin · 2013-06-14T03:11:26Z

@mateiz I think it should be useful. The way we are using it is, we have some big arraies from collect() of some other jobs. Then we combine them into one array, wrap it with Broadcast, then stream it into the pipe process. With Seq[Broadcast[U]], I think we can avoid doing the array merging here. The only thing I'm not sure is, if it makes it a little too complicated. But I think it's worth it. I'll change it accordingly.

rxin · 2013-06-14T03:34:59Z

Gavin - if you think this is getting too complicated, the other option is to have this in a Yahoo internal class YahooRddOps, and RDDs can be implicitly converted into a YahooRddOps (like how it is done in PairRDDFunctions).

…nhance_pipe

lyogavin · 2013-06-17T05:36:17Z

Hi Matei and Reynold, I do feel it's a little too complicated. I thought about it further, I think there's another better solution. We can remove the pipeContext and delimiter parameters, and define another function parameter to allow user to pipe context with. So we'll just have 2 functions one for context data, one for RDD each elements. This way the user can use the Broadcast and delimiter in any way they want.

I have updated the code with this new solution, and updated according to the feedbacks.

Sorry i merged my local branch which makes it a little confusing to read. Please look at fb6d733 and 4508089.

mateiz · 2013-06-17T19:56:22Z

Hey Gavin,

That's a great idea! It's really simplified it. I'm going to merge it in.

Enhance pipe to support more features we can do in hadoop streaming

rxin · 2013-06-18T16:16:16Z

core/src/main/scala/spark/RDD.scala

There is a typo here in the doc. Gavin, do you mind submitting a new pull request to fix this?

Sure. Thanks for pointing it out. I'll fix it.

fix typo and coding style in #638

…data to them. [SPARK-1108] This allows us to use, e.g. HBase's TableOutputFormat with PairRDDFunctions.saveAsNewAPIHadoopFile, which otherwise would throw NullPointerException because the output table name hasn't been configured. Note this bug also affects branch-0.9 Author: Bryn Keller <[email protected]> Closes mesos#638 from xoltar/SPARK-1108 and squashes the following commits: 7e94e7d [Bryn Keller] Import, comment, and format cleanup per code review 7cbcaa1 [Bryn Keller] For outputformats that are Configurable, call setConf before sending data to them. This allows us to use, e.g. HBase TableOutputFormat, which otherwise would throw NullPointerException because the output table name hasn't been configured

This is specially import because some ssh errors are raised as UsageError, preventing an automated usage of the script from detecting the failure. Author: Allan Douglas R. de Oliveira <[email protected]> Closes mesos#638 from douglaz/ec2_exit_code_fix and squashes the following commits: 5915e6d [Allan Douglas R. de Oliveira] EC2 script should exit with non-zero code on UsageError

lyogavin added 2 commits June 1, 2013 00:26

enhance pipe to support what we can do in hadoop streaming

9f84315

add ut for pipe enhancement

4a9913d

rxin reviewed Jun 3, 2013
View reviewed changes

mridulm reviewed Jun 3, 2013
View reviewed changes

JoshRosen reviewed Jun 3, 2013
View reviewed changes

update according to comments

e179ff8

rxin reviewed Jun 6, 2013
View reviewed changes

JoshRosen reviewed Jun 6, 2013
View reviewed changes

core/src/main/scala/spark/RDD.scala Outdated

Copy link

Member

JoshRosen Jun 6, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: trasnform should be transform.

lyogavin and others added 5 commits June 16, 2013 22:32

update according to comments

fb6d733

Merge remote-tracking branch 'upstream1/master' into enhance_pipe

e6ae049

Merge remote branch 'upstream/master' into enhance_pipe

33c55ec

Merge branch 'enhance_pipe' of ssh://github.com/lyogavin/spark into e…

728665d

…nhance_pipe

refine comments and add sc.clean

4508089

mateiz added a commit that referenced this pull request Jun 17, 2013

Merge pull request #638 from lyogavin/enhance_pipe

e6d1277

Enhance pipe to support more features we can do in hadoop streaming

mateiz merged commit e6d1277 into mesos:master Jun 17, 2013

rxin reviewed Jun 18, 2013
View reviewed changes

lyogavin mentioned this pull request Jun 18, 2013

fix typo and coding style in #638 #654

Merged

mateiz added a commit that referenced this pull request Jun 19, 2013

Merge pull request #654 from lyogavin/enhance_pipe

71030ba

fix typo and coding style in #638

Enhance pipe to support more features we can do in hadoop streaming #638

Enhance pipe to support more features we can do in hadoop streaming #638

Uh oh!

Conversation

lyogavin commented Jun 2, 2013

Uh oh!

AmplabJenkins commented Jun 2, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Jun 3, 2013

Uh oh!

JoshRosen commented Jun 3, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm commented Jun 3, 2013

Uh oh!

lyogavin commented Jun 3, 2013

Uh oh!

lyogavin commented Jun 3, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Jun 3, 2013

Uh oh!

mridulm commented Jun 3, 2013

Uh oh!

lyogavin commented Jun 3, 2013

Uh oh!

lyogavin commented Jun 3, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Jun 6, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mateiz commented Jun 9, 2013

Uh oh!

lyogavin commented Jun 10, 2013

Uh oh!

mateiz commented Jun 13, 2013

Uh oh!

lyogavin commented Jun 14, 2013

Uh oh!

rxin commented Jun 14, 2013

Uh oh!

lyogavin commented Jun 17, 2013

Uh oh!

mateiz commented Jun 17, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants