Skip to content
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ class IndexedRowMatrix @Since("1.0.0") (
toBlockMatrix(1024, 1024)
}


Copy link
Member

@viirya viirya Apr 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the extra line.

/**
* Converts to BlockMatrix. Creates blocks of `SparseMatrix`.
* @param rowsPerBlock The number of rows of each block. The blocks at the bottom edge may have
Expand All @@ -112,6 +113,67 @@ class IndexedRowMatrix @Since("1.0.0") (
toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
}

/**
* Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style of the comments here and below is not correct. Can you fix it?

*/
def toBlockMatrixDense(): BlockMatrix = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good idea to have both toBlockMatrix and toBlockMatrixDense for converting to BlockMatrix ?

Shall we combine them and have just one toBlockMatrix method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been going back and forth on this myself. I think converting to a BlockMatrix backed by dense matrices is better default behavior than one backed by sparse matrices, but the the current implementation of toBlockMatrix advertises that it converts to a BlockMatrix backed SparseMatrices, and I thought changing that could negatively affect people who want that behavior. I suppose we could add a default argument to toBlockMatrix like isSparse = true so that it would not break anyone's code but people would be able to convert to dense version if they wanted. What do you think of that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if DenseMatrix-backed is better default behavior for toBlockMatrix. Actually the rows in IndexedRowMatrix can be sparse or dense. Choose which one, SparkMatrix-backed or DenseMartix-backed, is totally depending on the use case.

Looks like toBlockMatrixDense is already a non-small function. Merging it with current toBlockMatrix might not a good idea. I'd keep it as it's now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we can generalize the change to SparseMatrix-based BlockMatrix too. But maybe we can do it in following PR.

Copy link
Contributor Author

@johnc1231 johnc1231 Apr 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this comment. Moved my thoughts on this to new comment down at the bottom of the thread.

toBlockMatrixDense(1024, 1024)
}

/**
* Converts to BlockMatrix. Creates blocks of `DenseMatrix`.
* @param rowsPerBlock The number of rows of each block. The blocks at the bottom edge may have
* a smaller value. Must be an integer value greater than 0.
* @param colsPerBlock The number of columns of each block. The blocks at the right edge may have
* a smaller value. Must be an integer value greater than 0.
* @return a [[BlockMatrix]]
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a @Since annotation like toBlockMatrix. Although I doubt this can make in 2.2.0, you can set it to 2.2.0 temporarily. If there is a suggested version from committers, we can change it later.

def toBlockMatrixDense(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
require(rowsPerBlock > 0,
s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
require(colsPerBlock > 0,
s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")

val m = numRows()
val n = numCols()
val lastRowBlockIndex = m / rowsPerBlock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the index of the final, smaller block, if any? I get it, but if m = 100 and n = 10 then this is 10, which is not the index of the last row block. There is no leftover smaller block and the last one is 9. I think the code works and I'm splitting hairs but wonder if this is clearer if it's the "remainder" block index or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Replaced word "last" with "remainder" and added a small clarifying comment.

val lastColBlockIndex = n / colsPerBlock
val lastRowBlockSize = (m % rowsPerBlock).toInt
val lastColBlockSize = (n % colsPerBlock).toInt
val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt

val blocks: RDD[((Int, Int), Matrix)] = rows.flatMap({ ir =>
val blockRow = ir.index / rowsPerBlock
val rowInBlock = ir.index % rowsPerBlock

ir.vector.toArray
.grouped(colsPerBlock)
.zipWithIndex
.map({ case (values, blockColumn) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: where you are writing ({ ... }) just write { ... }

((blockRow.toInt, blockColumn), (rowInBlock.toInt, values))
})
}).groupByKey(GridPartitioner(numRowBlocks, numColBlocks, rowsPerBlock, colsPerBlock)).map({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't miss anything, the parameters of GridPartitioner are wrong. Should be:

GridPartitioner(numRowBlocks, numColBlocks, rows.partitions.length)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. My code makes the assumption that there is a single block per partition, which is incorrect. Thanks for that.

case ((blockRow, blockColumn), itr) =>
val actualNumRows: Int =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually don't put a type on vals/vars unless it's important for clarity or needed for a cast

if (blockRow == lastRowBlockIndex) lastRowBlockSize else rowsPerBlock
val actualNumColumns: Int =
if (blockColumn == lastColBlockIndex) lastColBlockSize else colsPerBlock

val arraySize = actualNumRows * actualNumColumns
val matrixAsArray = new Array[Double](arraySize)
itr.foreach({ case (rowWithinBlock, values) =>
var i = 0
while (i < values.length) {
matrixAsArray.update(i * actualNumRows + rowWithinBlock, values(i))
i += 1
}
})
((blockRow, blockColumn), new DenseMatrix(actualNumRows, actualNumColumns, matrixAsArray))
})
new BlockMatrix(blocks, rowsPerBlock, colsPerBlock)
}

/**
* Converts this matrix to a
* [[org.apache.spark.mllib.linalg.distributed.CoordinateMatrix]].
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -89,11 +89,42 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {

test("toBlockMatrix") {
val idxRowMat = new IndexedRowMatrix(indexedRows)

// Tests when n % colsPerBlock != 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only tests a IndexedRowMatrix consisted of DenseVector. We should add another test to test the cases of SparseVector and the mix of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll write more tests today.

val blockMat = idxRowMat.toBlockMatrix(2, 2)
assert(blockMat.numRows() === m)
assert(blockMat.numCols() === n)
assert(blockMat.toBreeze() === idxRowMat.toBreeze())

// Tests when m % rowsPerBlock != 0
val blockMat2 = idxRowMat.toBlockMatrix(3, 1)
assert(blockMat2.numRows() === m)
assert(blockMat2.numCols() === n)
assert(blockMat2.toBreeze() === idxRowMat.toBreeze())

intercept[IllegalArgumentException] {
idxRowMat.toBlockMatrix(-1, 2)
}
intercept[IllegalArgumentException] {
idxRowMat.toBlockMatrix(2, 0)
}
}

test("toBlockMatrixDense") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see you test newly added toBlockMatrixDense, do you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, you seem to have commented right on the toBlockMatrixDense tests. Originally, toBlockMatrix had only the tests marked with the comment // Tests when n % colsPerBlock != 0. I added the tests marked with // Tests when m % rowsPerBlock != 0 to toBlockMatrix, then used the same tests for the Dense version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see what you mean now, will fix.

val idxRowMat = new IndexedRowMatrix(indexedRows)

// Tests when n % colsPerBlock != 0
val blockMat = idxRowMat.toBlockMatrix(2, 2)
assert(blockMat.numRows() === m)
assert(blockMat.numCols() === n)
assert(blockMat.toBreeze() === idxRowMat.toBreeze())

// Tests when m % rowsPerBlock != 0
val blockMat2 = idxRowMat.toBlockMatrix(3, 1)
assert(blockMat2.numRows() === m)
assert(blockMat2.numCols() === n)
assert(blockMat2.toBreeze() === idxRowMat.toBreeze())

intercept[IllegalArgumentException] {
idxRowMat.toBlockMatrix(-1, 2)
}
Expand Down