Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve each_row_streaming memory usage #436

Closed

Conversation

will89
Copy link
Contributor

@will89 will89 commented Jul 14, 2018

I ran into this issue, #179, when working with an excel file that had 180k rows and the xml for the sheet file was about 386MB on disk. The issue seemed to have narrowed it down to doc.xpath forcefully loading the whole xml into memory. This changes extract_hyperlinks to use Roo::Utils.each_element in order to stream the xml file.

@coveralls
Copy link

coveralls commented Jul 14, 2018

Coverage Status

Coverage increased (+0.02%) to 94.045% when pulling 903dc65 on will89:issue/extract-hyperlinks-mem-consumption into 782420b on roo-rb:master.

@chopraanmol1
Copy link
Member

@will89 thank you for opening this PR. Can you provide a memory profiling and benchmark script with a sample file with masked data?

@will89
Copy link
Contributor Author

will89 commented Sep 11, 2018

@chopraanmol1 Thanks for taking a look at this! I will work on providing a benchmark script with a sample file. Any preferred way for generating the memory profile.

@chopraanmol1
Copy link
Member

@will89 you can use memory_profiler gem

@chopraanmol1 chopraanmol1 self-requested a review September 12, 2018 04:51
@tgturner
Copy link
Contributor

@will89 is it possible for you to provide an excel spreadsheet/case that demonstrates the benefit? I ran the following test case with a spreadsheet that I had on hand and these were the results

# frozen_string_literal: true

begin
  require "bundler/inline"
rescue LoadError => e
  $stderr.puts "Bundler version 1.10 or later is required. Please update your Bundler"
  raise e
end

gemfile(true) do
  source "https://rubygems.org"

  git_source(:github) { |repo| "https://github.com/#{repo}.git" }

  gem "roo", path: "/Users/gturner/development/roo"
  gem "minitest"
  gem "memory_profiler"
end

require "roo"
require "minitest/autorun"
require "memory_profiler"

class BugTest < Minitest::Test
  def test_stuff
    # add your test here
    report = MemoryProfiler.report do
      sheet = Roo::Spreadsheet.open("/Users/gturner/Downloads/abm_codes.xlsx")
      sheet.each_row_streaming do |row|
        row[1]
      end
    end
    report.pretty_print
  end
end
# Master

Total allocated: 570191 bytes (4982 objects)
Total retained:  30925 bytes (569 objects)

allocated memory by gem
-----------------------------------
    264131  roo/lib
    161445  rubyzip-1.2.2
    143182  nokogiri-1.8.4
      1433  lib

# With PR

Total allocated: 577552 bytes (5492 objects)
Total retained:  30266 bytes (564 objects)

allocated memory by gem
-----------------------------------
    276874  roo/lib
    159463  rubyzip-1.2.2
    139782  nokogiri-1.8.4
      1433  lib     1433  lib

As you can see, no gain in allocated memory, and very little in retained.

@will89
Copy link
Contributor Author

will89 commented Sep 15, 2018

Yeah I'm trying to work on generating one in my free time. The only file I have immediately on hand has customer data in it. @tgturner

@will89
Copy link
Contributor Author

will89 commented Sep 16, 2018

I believe this file reproduces the issue,
foo.xlsx. @tgturner

@will89
Copy link
Contributor Author

will89 commented Sep 16, 2018

I modified the test a bit, I think the MemoryProfiler might have issues completing since my process memory went completely out of control with it enabled.

I used the code snippet from, https://stackoverflow.com/questions/7220896/get-current-ruby-process-memory-usage, to print the memory usage at certain times. I modified the Roo::Excelx::SheetDoc#hyperlinks to print the before and after memory usage of calling hyperlinks.

require "roo"
require "minitest/autorun"
require "memory_profiler"

class BugTest < Minitest::Test
  def test_stuff
    Roo::Excelx::SheetDoc.class_eval do
      def hyperlinks(relationships)
        @hyperlinks ||= begin
          pid, size = `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i)
          puts "Before Extract #{pid} - #{size}"
          res = extract_hyperlinks(relationships)
          pid, size = `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i)
          puts "After Extract #{pid} - #{size}"
          res
        end
      end
    end
    # add your test here
    sheet = Roo::Spreadsheet.open("foo.xlsx")
    sheet.each_row_streaming do |row|
      row[1]
      break
    end
    pid, size = `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i)
    puts "#{pid} - #{size}"
  end
end

This is what I got when running with the file above.

# Master
Before Extract 28164 - 279,744
After Extract 28164 - 2,883,688
28164 - 2,888,068

# With PR
Before Extract 27936 - 177,476
After Extract 27936 - 274,868
27936 - 274,948

It's weird in that other very large excel files did not trigger this memory issue, so I'm not sure how to fully explain this. However, it seems that using xpath on any potentially large xml file is probably not advised.

@chopraanmol1
Copy link
Member

@will89 Result do look good, but I'm not sure if xpath is causing issue. Could you test another variant of extract_hyperlinks which uses xpath just like master but do not create intermediate array while building Hash.

So instead of Hash[hyperlinks.map{... [key, value]}] pattern to create hash, use hyperlinks.each{........ hsh[key] = value} like pattern (Used in your patch).

We could be sure if xpath is problematic or not if we compare the result of all 3 variant.

@will89
Copy link
Contributor Author

will89 commented Sep 16, 2018

Modified extract_hyperlinks to

    def extract_hyperlinks(relationships)
      return {} unless (hyperlinks = doc.xpath('/worksheet/hyperlinks/hyperlink'))
      result = {}

      hyperlinks.each do |hyperlink|
         if hyperlink.attribute('id') && (relationship = relationships[hyperlink.attribute('id').text])
           result[::Roo::Utils.ref_to_key(hyperlink.attributes['ref'].to_s)] = relationship.attribute('Target').text
         end
      end

      result
    end

Got

Before Extract 64671 - 176,592
After Extract 64671 - 3,535,648
64671 - 3536064

@chopraanmol1
Copy link
Member

@will89 Thank you for posting memory result. I've also verified this and observed similar result, but this new implementation is terribly slow(3x slow). I personally do like this solution, but I can't merge it in its current state given the performance slowdown.

I've also reviewed the samle xlsx file you provided if this file is similar to file you are using (no hyperlink in the file or you don't care about hyperlink for your application) then, I'll suggest you to use no_hyperlink option for now:

Roo::Spreadsheet.open("foo.xlsx", no_hyperlinks: true)

I'm not closing this PR, just incase you want to improve this PR with regards to performance as well. I'll also look into improving performance for Roo::Utils.each_element implemetation

@will89
Copy link
Contributor Author

will89 commented Sep 17, 2018

Thanks for pointing out that option! Is there more info on what it does? Is there an easy way to determine if I rely on hyperlinks in my application (like roo methods that are called that use them)?

In regards to this branch, is the branch 3x slower or was just extract_hyperlinks 3x slower? I just merged with latest master which I see has some new performance enhancements. Can you share the benchmarks?

@chopraanmol1
Copy link
Member

chopraanmol1 commented Sep 17, 2018

Sample xlsx file which includes hyperlink https://github.com/roo-rb/roo/blob/master/test/files/link.xlsx
You can check the above-specified file to see an example of value with a hyperlink..

When roo finds a hyperlinked cell it wraps value of the cell in Roo::Link which is a subclass of string.

Roo::Link class implements 3 methods url, href, to_uri. You can check your application for usage of this 3 methods in the context of cell's value or see if your applications check for value to be Roo::Link, if doesn't it will be safe to use no_hyperlinks option.

Benchmark Script:

require "roo"
require "minitest/autorun"
require "benchmark"

class BugTest < Minitest::Test
  def test_stuff
    Roo::Excelx::SheetDoc.class_eval do
      def hyperlinks(relationships)
        @hyperlinks ||= begin
          res = nil
          puts Benchmark.measure{
            res = extract_hyperlinks(relationships)
          }
          res
        end
      end
    end
    # add your test here
    sheet = Roo::Spreadsheet.open("foo.xlsx")
    sheet.each_row_streaming do |row|
      row[1]
      break
    end
    pid, size = `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i)
    puts "#{pid} - #{size}"
  end
end

@will89
Copy link
Contributor Author

will89 commented Sep 17, 2018

I think you were right in that there was some performance issues in Roo::Utils.each_element. I think the way I called it was generating a new array for each node element in the tree. I modified it, 903dc65, to pull the array generation outside of the loop. Could you please try the benchmark with the latest version of this branch please?

@chopraanmol1
Copy link
Member

Approximate benchmark result of extract_hyperlinks when I originally tested was ~14 sec for master and ~42 sec for your orignal implementation. Considering noise your orignal implementation is 2.8-3.0x slower that that of master.

When I was benchmarking this I also tried tweaking Roo::Utils.each_element a bit including wrapping elements in array outside of block. Even though I don't remember result, it was still too slow.

@will89
Copy link
Contributor Author

will89 commented Sep 17, 2018

Interesting, your computer must be significantly better than mine :)
Running that test against the file I linked above I got the following:

gem "roo", github: 'will89/roo', branch: 'issue/extract-hyperlinks-mem-consumption'

 50.088052   4.120130  54.208182 ( 59.366028)
10988 - 287764

Finished in 235.568521s, 0.0042 runs/s, 0.0000 assertions/s.

gem "roo", github: 'roo-rb/roo', branch: 'master'

 28.649450  12.480391  41.129841 ( 45.337095)
11292 - 3446960

Finished in 220.953135s, 0.0045 runs/s, 0.0000 assertions/s.

When you tested this branch did it include this commit, 5283e01? I think that commit brought in a lot of the performance increases you brought into this gem recently, which thanks for doing that!

@will89
Copy link
Contributor Author

will89 commented Sep 17, 2018

Ah, I see now. I think I was looking at the wrong times. I was comparing the Finished from the test case instead of the output from the Benchmark.measure. 45 master vs 59 this branch.

@will89
Copy link
Contributor Author

will89 commented Sep 19, 2018

Do you know if attempting to use this interface from nokogiri,
https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/SAX/Parser, to visit the xml would be faster than using the Nokogiri::XML::Reader that https://github.com/roo-rb/roo/blob/master/lib/roo/utils.rb#L87 uses?

@chopraanmol1
Copy link
Member

I've not looked into it yet. If you have some benchmark and memory comparison for both approach it can be great place to start with.

@will89
Copy link
Contributor Author

will89 commented Oct 12, 2018

Sorry for leaving this neglected for so long.

I have been unable to alter each_element to be fast enough for this implementation of extract_hyperlinks to not be a performance regression. However, I think the option of no_hyperlinks is probably a good enough workaround for the memory problem.

Is the modification to lib/roo/utils.rb worth keeping otherwise happy to close this PR? thoughts @chopraanmol1

@chopraanmol1
Copy link
Member

@will89 let's close this PR now. Since there is still scope of improvement you can open a new PR focusing on improving each_row_streaming's memory allocation and performance. Few of the things which you can do in this PR will be as follow:

  1. The modification you made in lib/roo/utils.rb in this PR.
  2. Lazily create
    excelx_type = [:numeric_or_formula, format.to_s]
    this array only when required (maybe create a method name numeric_or_formula_or(format) and call it a bit later when required). And you could also memoize the frozen array (assuming that format value are repetitive).
  3. You can also profile and find many more such optimization scope.

All of above if together make good performance & memory usage improvement we could go ahead with it.

I'll suggest benchmarking and profiling each_row_streaming method with no no_hyperlinks option.

@chopraanmol1
Copy link
Member

@will89 let me know if you're interested(or not) in opening another PR

@will89
Copy link
Contributor Author

will89 commented Oct 12, 2018

@chopraanmol1 Yeah, I'll try to make a follow on PR to address these things.

chopraanmol1 added a commit to chopraanmol1/roo that referenced this pull request Jan 22, 2019
Revisiting roo-rb#436 .

This Patch uses relationships data to determine if a sheet includes hyperlinks or not.

As extract_hyperlinks loads the whole document in memory it is quite problematic for each_row_streaming. This patch tries to skip extract_hyperlinks when not required.
@chopraanmol1
Copy link
Member

@will89 I've revisited one of the issues raised here in #488

With the linked patch you will no longer required to set no_hyperlinks option.

chopraanmol1 added a commit to chopraanmol1/roo that referenced this pull request Feb 1, 2019
Revisiting roo-rb#436 .

This Patch uses relationships data to determine if a sheet includes hyperlinks or not.

As extract_hyperlinks loads the whole document in memory it is quite problematic for each_row_streaming. This patch tries to skip extract_hyperlinks when not required.
aravindm pushed a commit to chobiwa/roo that referenced this pull request Jun 18, 2019
Revisiting roo-rb#436 .

This Patch uses relationships data to determine if a sheet includes hyperlinks or not.

As extract_hyperlinks loads the whole document in memory it is quite problematic for each_row_streaming. This patch tries to skip extract_hyperlinks when not required.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants