Skip to content

Conversation

@jegonzal
Copy link
Contributor

This simple patch filters out extra whitespace entries.

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22442 has started for PR 2996 at commit e028e84.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22442 has finished for PR 2996 at commit e028e84.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22442/
Test FAILed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor (but since we need to run Jenkins again): filter(_.nonEmpty) is more readable

@jegonzal
Copy link
Contributor Author

Not sure why it failed the test. Is this an issue with the testing framework?

@jegonzal
Copy link
Contributor Author

The following implementation seems a bit more efficient but is needlessly complicated.

  val parsed = sc.textFile(path, minPartitions)
      .map(_.trim)
      .filter(line => !(line.isEmpty || line.startsWith("#")))
      .map { line =>
        val items = line.split(' ')
        val label = items.head.toDouble
        // Count the number of empty values                                                                                                                                                                     
        var i = 1
        var emptyValues = 0
        while (i < items.size) {
          if (items(i).isEmpty) emptyValues += 1
          i += 1
        }
        // Determine the number of non-zero entries                                                                                                                                                             
        val nnzs = items.size - 1 - emptyValues
        // Compute the indices                                                                                                                                                                                  
        val indices = new Array[Int](nnzs)
        val values = new Array[Double](nnzs)
        i = 1
        var j = 0
        while (i < items.size) {
          if (!items(i).isEmpty) {
            val indexAndValue = items(i).split(':')
            indices(j) = indexAndValue(0).toInt - 1 // Convert 1-based indices to 0-based.                                                                                                                      
            values(j) = indexAndValue(1).toDouble
            j += 1
          }
          i += 1
        }
        // assert(j == nnzs)                                                                                                                                                                                    
        // val (indices, values) = items.tail.filter( pair => !pair.isEmpty ).map { item =>                                                                                                                     
        //   val indexAndValue = item.split(':')                                                                                                                                                                
        //   val index = indexAndValue(0).toInt - 1 // Convert 1-based indices to 0-based.                                                                                                                      
        //   val value = indexAndValue(1).toDouble                                                                                                                                                              
        //   (index, value)                                                                                                                                                                                     
        // }.unzip                                                                                                                                                                                              
        (label, indices, values)
      }

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22494 has started for PR 2996 at commit e0227ab.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22494 has finished for PR 2996 at commit e0227ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22494/
Test PASSed.

@asfgit asfgit closed this in c7ad085 Oct 30, 2014
@mengxr
Copy link
Contributor

mengxr commented Oct 30, 2014

LGTM. Merged into master. If the performance gain is worth the extra code complexity, we can switch to the new implementation. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants