Skip to content

Latest commit

 

History

History
24 lines (19 loc) · 5.3 KB

new-in-this-release.md

File metadata and controls

24 lines (19 loc) · 5.3 KB

What's new in 2.0

With 1.3 we added support for vectorized processing and structured data, and the feedback from users was encouraging. At the same time, we increased the complexity of the API. With this version we tried to define a synthesis between all the modes (record-at-a-time, vectorized and structured) present in 1.3, with the following goals:

  • bring the footprint of the API back to 1.2 levels.
  • make sure that no matter what the corner of the API one is exercising, he or she can rely on simple properties and invariants; writing an identity mapreduce should be trivial.
  • encourage writing the most efficient and idiomatic R code from the start, as opposed to writing against a simple API first and then developing a vectorized version for speed.

This is how we tried to get there:

  • Hadoop data is no longer seen as a big list where the elements can be any pair (key and value) of R objects, but as an on disk representation of a variety of R data structures: lists, atomic vectors, matrices or data frames, split according to the key. Which data type will be determined by the type of the R variable passed to the to.dfs function or returned by map and reduce, or assigned based on the format (csv files are read as data frames, text as character vectors, JSON TBD). Each key-value pair holds a subrange of the data (range of rows where applicable)
  • The keyval function is always vectorized. The data payload is in the value part of a key-value pair. The key is construed as an index to use in splitting the data for its on-disk representation, particularly as it concerns the shuffle operation (the grouping that comes before the reduce phase). The model, albeit with some differences, is the R function split. So if map returns keyval (1, matrix(...)), the second arguments of some reduce call will be another matrix that has the matrix returned by map as a subrange of rows. If you don't want that to happen because, say, you need to sum all the smaller matrices together, not stack them, do not fret. Have your map function return keyval(1, list(matrix(...))) and on the reduce side do a Reduce("+", vv) where vv is the second argument to a reduce. Get the idea? In the first case one is building a large matrix from smaller ones, in the second just collecting the matrices to sum them up. keyval(NULL, x) or, equivalently keyval(x) means that we don't care how the data is split. This is not allowed in the map or combine functions, where defining the grouping is important.
  • As a consequence, all lists of key-value pairs have been ostracized from the API. One keyval call is all that can and needs to be called in each map and reduce call.
  • The mapreduce function is always vectorized, meaning that each map call processes a range of elements or, when applicable, rows of data and each reduce call processes all the data associated with the same key. Please note that we are talking always in terms of R dimensions, not numbers of on disk records, providing some independence from the exact format of the data.
  • The structured option which converted lists into data.frames has no reason to exist any more. What started as list will be seen by the user as list, data frames as data frames etc. throughout a mapreduce, removing the need for complex conversions.
  • In 1.3 the default serialization switched from a very R friendly native to a more efficient sequence.typedbytes when the vectorized option was on. Since that option doesn't exist anymore we need to explain what happens to serialization. We thought that transparency was too important to give up and therefore R native serialization is used unless the alternative is totally transparent to the user. Right now an efficient alternative kicks in only for nameless atomic vectors with the exclusion of character vectors. There is no need for you to know this unless you are using small key-value pairs (say 100 bytes or less) and performance is important. In that case you may find that using nameless non-character vectors gives you a performance boost. Further extensions of the alternate serialization will be considered based on use cases, but the goal is to keep them semantically transparent.

Other improvements

  • The source has been deeply refactored. The subdivision of the source into many files (IO.R extras.R local.R quickcheck-rmr.R basic.R keyval.R mapreduce.R streaming.R) suggests a modularization that is not complete and not enforced by the language, but helps reduce the complexity of the implementation.
  • The testing code (not the actual tests) has been factored out as a quickcheck package, inspired by Haskell's module by the same name. For now it is neither supported nor documented, but it belonged in a separate package. Required only to run the tests.
  • Added useful functions like scatter, gather, rmr.sample, rmr.str and dfs.size, following up on user feedback.
  • When the reduce argument to mapreduce is NULL the mapreduce job is going to be map-only rather than have an identity reduce.
  • The war on boilerplate code continues. keyval(v), with a single argument, means keyval(NULL,v). When you provide a value that is not a keyval return value where one is expected, one is generated with keyval(v) where v was whatever argument has been provided. For instance to.dfs(matrix(...)) means to.dfs(keyval(NULL, matrix(...))).