[SPARK-22451][ML] Reduce decision tree aggregate size for unordered features from O(2^numCategories) to O(numCategories) #19666

WeichenXu123 · 2017-11-06T10:13:06Z

What changes were proposed in this pull request?

We do not need generate all possible splits for unordered features before aggregate,

Change mixedBinSeqOp (which running on executor), for each unordered feature, we do the same stat with ordered features. so for unordered features, we only need O(numCategories) space for this feature stat.
After driver side get the aggregate result, generate all possible split combinations, and compute the best split.

This will reduce decision tree aggregate size for each unordered feature from O(2^numCategories) to O(numCategories), numCategories is the arity of this unordered feature.

This also reduce the cpu cost in executor side. Reduce time complexity for this unordered feature from O(numPoints * 2^numCategories) to O(numPoints).

This won't increase time complexity for unordered features best split computing in driver side.

How to construct statistics of `2^(numCategories - 1) - 1` splits from separate statistics of`numCategories` ?

I use a recursive function traverseUnorderedSplits to traverse all possible splits, and note that, during the traversing, the statistics for each splits will be accumulated, so it will keep the total time complexity to be O(2^n).
I give an example, suppose unordered feature including [a, b, c, d], for each split, decode it by binary representation, e.g, "1010" represent split "[a, c] vs [b, d]" (the "1" represent allocate the value to left child).
And the combNumber of the split is the number whose binary representation match it. (Note that the binary representation is lower bits start from left).
The combNumber will be used for "pruning". (will explain in the following)

Now we want to traverse all possible splits, all possible binary representation will be 2^numCategories, but, the splits number only have 2^(numCategories - 1) - 1 , because we need to exclude the case binary representation all "0" and all "1", and the remaining splits, we need to cut half of them, because split "[a, c] vs [b, d]" and split "[b, d] vs [a, c]" are equivalent, we only need to traverse one of them. The way I use here to filter out the other half is via the combNumber, we only need to traverse the cases whose combNumber satisfy 1 <= combNumber <= numSplists.

So I design the traverseUnorderedSplits function to do the thing.
First we can look into the dfs function in the traverseUnorderedSplits, it's a recursive function, when running, it will compute the combNumber at the same time. When founding current combNumber exceed numSplits, stop recursion.
In the recursive dfs, it also accumulate the statistics. When generating a new split, for example, "[a, c] vs [b, d]", we need compute the statistics for left child, which equals to stats[a] + stats[c]. But, I do not compute this accumulation after the split generated, it will cause the algorithm time complexity to be O(n*2^n), I accumulate the stat in the recursion process. You can check the param stats of dfs to see how it works.

So, my purpose of why traverseUnorderedSplits is designed in this way, the main two reasons:

Make sure the total time complexity not exceed O(2^n), we need to handle the statistics accumulation carefully.
Make high efficient pruning, by the condition 1 <= combNumber <= numSplists.

How was this patch tested?

UT added.

SparkQA · 2017-11-06T13:37:48Z

Test build #83481 has finished for PR 19666 at commit e79abfd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
test(\"Multiclass classification with unordered categorical features: split calculations\")

smurching

I like this idea but I'm a bit confused by some parts of the code/wondering if there's anything we can simplify - I've left a few questions/comments.

I'm wondering if there's a simple way to iterate over the subsets for a given unordered categorical feature instead of generating the subsets recursively (e.g. by iterating over a gray code). This might simplify the logic in traverseUnorderedSplits; however, I'm fine with the recursive approach if it's well-documented.

smurching · 2017-11-06T23:42:23Z