-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve each_row_streaming memory usage #436
Improve each_row_streaming memory usage #436
Conversation
@will89 thank you for opening this PR. Can you provide a memory profiling and benchmark script with a sample file with masked data? |
@chopraanmol1 Thanks for taking a look at this! I will work on providing a benchmark script with a sample file. Any preferred way for generating the memory profile. |
@will89 you can use memory_profiler gem |
@will89 is it possible for you to provide an excel spreadsheet/case that demonstrates the benefit? I ran the following test case with a spreadsheet that I had on hand and these were the results # frozen_string_literal: true
begin
require "bundler/inline"
rescue LoadError => e
$stderr.puts "Bundler version 1.10 or later is required. Please update your Bundler"
raise e
end
gemfile(true) do
source "https://rubygems.org"
git_source(:github) { |repo| "https://github.com/#{repo}.git" }
gem "roo", path: "/Users/gturner/development/roo"
gem "minitest"
gem "memory_profiler"
end
require "roo"
require "minitest/autorun"
require "memory_profiler"
class BugTest < Minitest::Test
def test_stuff
# add your test here
report = MemoryProfiler.report do
sheet = Roo::Spreadsheet.open("/Users/gturner/Downloads/abm_codes.xlsx")
sheet.each_row_streaming do |row|
row[1]
end
end
report.pretty_print
end
end
As you can see, no gain in allocated memory, and very little in retained. |
Yeah I'm trying to work on generating one in my free time. The only file I have immediately on hand has customer data in it. @tgturner |
I modified the test a bit, I think the MemoryProfiler might have issues completing since my process memory went completely out of control with it enabled. I used the code snippet from, https://stackoverflow.com/questions/7220896/get-current-ruby-process-memory-usage, to print the memory usage at certain times. I modified the
This is what I got when running with the file above.
It's weird in that other very large excel files did not trigger this memory issue, so I'm not sure how to fully explain this. However, it seems that using |
@will89 Result do look good, but I'm not sure if xpath is causing issue. Could you test another variant of extract_hyperlinks which uses xpath just like master but do not create intermediate array while building Hash. So instead of Hash[hyperlinks.map{... [key, value]}] pattern to create hash, use hyperlinks.each{........ hsh[key] = value} like pattern (Used in your patch). We could be sure if xpath is problematic or not if we compare the result of all 3 variant. |
Modified
Got
|
@will89 Thank you for posting memory result. I've also verified this and observed similar result, but this new implementation is terribly slow(3x slow). I personally do like this solution, but I can't merge it in its current state given the performance slowdown. I've also reviewed the samle xlsx file you provided if this file is similar to file you are using (no hyperlink in the file or you don't care about hyperlink for your application) then, I'll suggest you to use no_hyperlink option for now:
I'm not closing this PR, just incase you want to improve this PR with regards to performance as well. I'll also look into improving performance for Roo::Utils.each_element implemetation |
Thanks for pointing out that option! Is there more info on what it does? Is there an easy way to determine if I rely on hyperlinks in my application (like roo methods that are called that use them)? In regards to this branch, is the branch 3x slower or was just |
Sample xlsx file which includes hyperlink https://github.com/roo-rb/roo/blob/master/test/files/link.xlsx When roo finds a hyperlinked cell it wraps value of the cell in Roo::Link which is a subclass of string. Roo::Link class implements 3 methods url, href, to_uri. You can check your application for usage of this 3 methods in the context of cell's value or see if your applications check for value to be Roo::Link, if doesn't it will be safe to use no_hyperlinks option. Benchmark Script:
|
I think you were right in that there was some performance issues in Roo::Utils.each_element. I think the way I called it was generating a new array for each node element in the tree. I modified it, 903dc65, to pull the array generation outside of the loop. Could you please try the benchmark with the latest version of this branch please? |
Approximate benchmark result of extract_hyperlinks when I originally tested was ~14 sec for master and ~42 sec for your orignal implementation. Considering noise your orignal implementation is 2.8-3.0x slower that that of master. When I was benchmarking this I also tried tweaking Roo::Utils.each_element a bit including wrapping elements in array outside of block. Even though I don't remember result, it was still too slow. |
Interesting, your computer must be significantly better than mine :) gem "roo", github: 'will89/roo', branch: 'issue/extract-hyperlinks-mem-consumption'
gem "roo", github: 'roo-rb/roo', branch: 'master'
When you tested this branch did it include this commit, 5283e01? I think that commit brought in a lot of the performance increases you brought into this gem recently, which thanks for doing that! |
Ah, I see now. I think I was looking at the wrong times. I was comparing the Finished from the test case instead of the output from the Benchmark.measure. 45 master vs 59 this branch. |
Do you know if attempting to use this interface from nokogiri, |
I've not looked into it yet. If you have some benchmark and memory comparison for both approach it can be great place to start with. |
Sorry for leaving this neglected for so long. I have been unable to alter Is the modification to |
@will89 let's close this PR now. Since there is still scope of improvement you can open a new PR focusing on improving each_row_streaming's memory allocation and performance. Few of the things which you can do in this PR will be as follow:
All of above if together make good performance & memory usage improvement we could go ahead with it. I'll suggest benchmarking and profiling each_row_streaming method with no no_hyperlinks option. |
@will89 let me know if you're interested(or not) in opening another PR |
@chopraanmol1 Yeah, I'll try to make a follow on PR to address these things. |
Revisiting roo-rb#436 . This Patch uses relationships data to determine if a sheet includes hyperlinks or not. As extract_hyperlinks loads the whole document in memory it is quite problematic for each_row_streaming. This patch tries to skip extract_hyperlinks when not required.
Revisiting roo-rb#436 . This Patch uses relationships data to determine if a sheet includes hyperlinks or not. As extract_hyperlinks loads the whole document in memory it is quite problematic for each_row_streaming. This patch tries to skip extract_hyperlinks when not required.
Revisiting roo-rb#436 . This Patch uses relationships data to determine if a sheet includes hyperlinks or not. As extract_hyperlinks loads the whole document in memory it is quite problematic for each_row_streaming. This patch tries to skip extract_hyperlinks when not required.
I ran into this issue, #179, when working with an excel file that had 180k rows and the xml for the sheet file was about 386MB on disk. The issue seemed to have narrowed it down to
doc.xpath
forcefully loading the whole xml into memory. This changesextract_hyperlinks
to useRoo::Utils.each_element
in order to stream the xml file.