Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to skip allocation for empty cell. #464

Merged
merged 1 commit into from
Dec 3, 2018

Conversation

chopraanmol1
Copy link
Member

Some excelx client (not sure which) creates a badly formatted file. I've attached one such file which when processed by roo will use a huge amount of memory.

Incase if you open & save such file in most of excelx client will fix the bad formated part (which can be observed with change in file size from ~5MB to ~37KB).
Since in production environment we can not add a constraint on excelx client used to generate the file, some badly formatted file will crash the server/process.

This changeset skips creation of empty cell and skips assignment of the key, value pair for such empty cell in the hash.

Benchmark result for the attached file.

Master:

Total allocated: 962056061 bytes (6385017 objects)
Total retained:  152051928 bytes (2114980 objects)

allocated memory by gem
-----------------------------------
 543193224  rubyzip-1.2.2
 263086610  roo/lib
 155773522  nokogiri-1.8.5
      1465  tmpdir
       880  racc
       320  weakref
        40  other

retained memory by gem
-----------------------------------
 152049144  roo/lib
      2000  nokogiri-1.8.5
       336  rubyzip-1.2.2
       208  tmpdir
       160  weakref
        80  racc

 20.050000   0.060000  20.110000 ( 20.127946)

After Patch:

Total allocated: 853271373 bytes (5337387 objects)
Total retained:  1362040 bytes (19720 objects)

allocated memory by gem
-----------------------------------
 543193224  rubyzip-1.2.2
 155773522  nokogiri-1.8.5
 154301922  roo/lib
      1465  tmpdir
       880  racc
       320  weakref
        40  other

retained memory by gem
-----------------------------------
   1359256  roo/lib
      2000  nokogiri-1.8.5
       336  rubyzip-1.2.2
       208  tmpdir
       160  weakref
        80  racc

  8.210000   0.060000   8.270000 (  8.276100)

Need to discuss following things before we can go ahead with this:

  1. What should be default value for newly added option (In long run default approach should be to not allocate empty cell). We need to determine if anyone depends upon allocation of empty cell which may break by default skipping option.
  2. Should this option be limited to non-streaming method. Since in streaming method objects (particulary empty cells and coordinate) can be GC'd it is not much of problem on current master (coordinates where meoimized in 2.7.1 so no GC). On side note each_row_streaming has pad_cells option which remove any incompatibility which could be introduced if we extend this option to streaming method.

bad-formated-file.xlsx

@chopraanmol1 chopraanmol1 force-pushed the excelx_skip_emty_cell branch from 25702db to cfc2ffb Compare October 9, 2018 06:42
@chopraanmol1 chopraanmol1 requested a review from tgturner October 9, 2018 06:42
@chopraanmol1
Copy link
Member Author

Once the discussion of mention point is completed, I'll fix and add a relevant test case.

This option skips creation of empty cell and assignment of key, value pair for such empty cell in hash.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 94.105% when pulling d68f002 on chopraanmol1:excelx_skip_emty_cell into 8f65109 on roo-rb:master.

@chopraanmol1 chopraanmol1 merged commit f463503 into roo-rb:master Dec 3, 2018
@chopraanmol1
Copy link
Member Author

NOTE: option empty_cells is not extended to streaming methods and is default'd to false (i.e. Roo won't allocate Roo::Excelx::Cell::Empty)

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Jan 20, 2019
pkgsrc change: add "USE_LANGUAGES= # none".

##  [2.8.0] 2019-01-18
### Fixed
- Fixed inconsistent column length for CSV [375](roo-rb/roo#375)
- Fixed formatted_value with `%` for Excelx [416](roo-rb/roo#416)
- Improved Memory consumption and performance [434](roo-rb/roo#434) [449](roo-rb/roo#449) [454](roo-rb/roo#454) [456](roo-rb/roo#456) [458](roo-rb/roo#458) [462](roo-rb/roo#462) [466](roo-rb/roo#466)
- Accept both Transitional and Strict Type for Excelx's worksheets [441](roo-rb/roo#441)
- Fixed ruby warnings [442](roo-rb/roo#442) [476](roo-rb/roo#476)
- Restore support for URL as file identifier for CSV [462](roo-rb/roo#462)
- Fixed missing location for Excelx's links [482](roo-rb/roo#482)

### Changed / Added
- Drop support for ruby 2.2.x and lower
- Updated rubyzip version for fixing security issue. Now minimal version is 1.2.1
- Roo::Excelx::Coordinate now inherits Array [458](roo-rb/roo#458)
- Improved Roo::HeaderRowNotFoundError exception's message [461](roo-rb/roo#461)
- Added `empty_cell` option which by default disable allocation for Roo::Excelx::Cell::Empty [464](roo-rb/roo#464)
- Added support for variable number of decimals for Excelx's formatted_value [387](roo-rb/roo#387)
- Added `disable_html_injection` option to disable html injection for shared string in `Roo::Excelx` [392](roo-rb/roo#392)
- Added image extraction for Excelx [414](roo-rb/roo#414) [397](roo-rb/roo#397)
- Added support for `1e6` as scientific notation for Excelx [433](roo-rb/roo#433)
- Added support for Integer as 0 based index for Excelx's `sheet_for` [455](roo-rb/roo#455)
- Extended `no_hyperlinks` option for non streaming Excelx methods [459](roo-rb/roo#459)
- Added `empty_cell` option to disable Roo::Excelx::Cell::Empty allocation for Excelx [464](roo-rb/roo#464)
- Added support for Integer with leading zero for Roo:Excelx [479](roo-rb/roo#479)
- Refactored Excelx code [453](roo-rb/roo#453) [477](roo-rb/roo#477) [483](roo-rb/roo#483) [484](roo-rb/roo#484)

### Deprecations
- Roo::Excelx::Sheet#present_cells is deprecated [454](roo-rb/roo#454)
- Roo::Utils.split_coordinate is deprecated [458](roo-rb/roo#458)
- Roo::Excelx::Cell::Base#link is deprecated [457](roo-rb/roo#457)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants