Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requiring mime/types accounts for 25% of all application RAM #94

Closed
schneems opened this issue Apr 13, 2015 · 21 comments
Closed

Requiring mime/types accounts for 25% of all application RAM #94

schneems opened this issue Apr 13, 2015 · 21 comments
Assignees
Labels
Milestone

Comments

@schneems
Copy link
Contributor

Using a rails application www.codetriage.com it uses mime/types and when it boots uses:

Application: 53.7695 mb

Without loading mime/types it uses:

Application: 38.9 mb

Thats 52.168 - 38.9 # => 13.268 mb of savings or 25% of all RAM usage. This is a non-trivial amount of memory to use.

I've got some causes and some ideas, but I want some more eyes and some feedback before moving forwards.

Memory Causes

As far as I can tell there are two main culprits that are causing memory use.

1) Loading large JSON blob. Loading a 547 KB file into JSON and converting to a hash takes a bunch of memory. This is done in the loader. The entire json blob and resultant hash cannot fit into memory so Ruby must malloc more. Unfortunately Ruby never free-s memory after it's been allocated. If you are using a large Rails app, this isn't a concern since those empty ruby object slots will eventually be used, however if you're running a really small service, this is a non-trivial operation. While we could optimize this, it likely won't have an impact for most applications.

2) Lots of large objects retained. The mime/types gem proactively generates and retains 1800+ objects that each have quite a bit of data in them. I'm pretty sure this is where the bulk of the memory problems come from. Since we never release a reference to unused mime types, we never get this memory back. The list of types is only going to get longer, however the effective number of types used on a system is dramatically lower than the default set.

While it's currently possible to export and use a custom cache this process isn't easy and most developers don't know that this capability exists.

Potential Solution to #2

Note: Anything we do here will make loading mime/types slower, we're effectively trading off RAM for speed so we need to benchmark any solutions thoroughly.

We don't necessarily have to make any tradeoffs default, instead we could offer them as a flag, however it would be ideal if we could find a middle-ground that was fast enough with less than 5~10% (random number I came up with) total RAM impact.

Any options to lazily create or evaluate MIME::Types could be enhanced by encouraging other libraries to explicitly declare common types they expect to use.

Option Lazy load from JSON Hash) Don't coerce default values into MIME::Type objects. These objects expand the data stored in the default cache quite a bit and are very heavy. Instead we could store the resultant JSON hash in memory and scan it to lazily generate MIME::Type objects so we only create what we need. The first time a mime type is needed it is coerced and retained so that we never have to seek for it again.

Viability: Speed impact: minimal, decreased RAM impact: medium. Depending on common access patterns, and how we store and search the data this could be fast. We are still storing a bunch of data we will never use in memory but it's cheaper than what we're currently doing. We would end up with duplicate info retained in two places, but we could either delete from the source data or it may be inconsequential to keep both around.

Option Distributed File Store) We could get really fancy and try to create a ton of small files each named with how it is accessed so we could simply see if that file exists and read it's contents when it is accessed. If there are multiple common access patterns, we could have different directories with different file names that would redirect or refer to another file containing the full info.

Viability: Speed impact: depends, decreased RAM impact: large (good). We would literally only store the objects in memory we need, so RAM use would be as close to minimal as possible. Reading from disk is really really slow so speed would probably be negatively impacted for most cases except those that only need one or two mime types. In this case data scans would be prohibitively expensive and we might need to keep a stash of JSON data around on disk lest we access and read from 1800+ files.

Option Lazy JSON File ) We lazilly create each and every MIME::Type by loading in the json file and searching for the entry we want when it isn't in memory already.

Viability: Speed impact: large (bad), decreased RAM impact: large (good). Might not be so bad for some cases, for others, this would be a world of hurt. It would help RAM more than the first option of storing the JSON in memory but it would provide us with no random access capabilities, any lookups would require a scan.

Option X) Hopefully there's some other options I've not yet considered. Maybe we could mark the MIME::Types that are being used and provide some kind of a MIME::Type.clean to undefine or remove references to types not being used? We would have to do it in conjunction with a lazy loading lest we be forced to load the whole into memory again if a new mime type gets referenced. Maybe we can use some binary data blob or store. It would be sweet to have a sqlite3 table we could query against, but that would add undue complexity and dependencies to the project. There's no clear winner yet, go crazy and recommend something.

Next Steps

Im interested in concerns that you as library maintainer have with any or all of these plans. You know how people commonly use this gem and maybe you could help provide a top 5 (or whatever number) of use cases. I'm also interested in alternative solutions. If we can figure out one that makes sense to try, i'll be happy to work on an spike implementation so we can benchmark speed an memory use. Hopefully you're interested. Let me know what you think 😄

@sunnyrjuneja
Copy link

👍 on lazy load from JSON Hash. I don't know too much about which are the most popular users of the mime-types gem but I really think that eagerly loading the top 10% of mime-types will probably cover a majority of use cases and dramatically reduce the memory footprint.

@halostatue
Copy link
Member

👎 on lazy load of the types from JSON file. That’s not an option for the slowness.

Memory use is of paramount importance—see #83 by @jeremyevans (also mikel/mail#829) for a similar report. I have no information to indicate what the most commonly used types are, so I don’t want to predictively load anything.

I have a slightly different option (mostly because working from the JSON hash would require a fairly substantial change to MIME::Types and how it loads; currently it knows nothing about its source data—and that’s mostly a good thing, although it has a bug in #79), and I believe that the distributed file approach would be almost as bad as lazy loading from a JSON file.

What I’m leaning toward, and have been a little too busy to try to investigate on, is essentially “mime-types-lite”. Consider the canonical representation in YAML:

- !ruby/object:MIME::Type
  content-type: application/atom+xml
  friendly:
    en: Atom Syndication Format
  encoding: 8bit
  extensions:
  - atom
  references:
  - IANA
  - RFC4287
  - RFC5023
  - ! '{application/atom+xml=http://www.iana.org/assignments/media-types/application/atom+xml}'
  xrefs: !ruby/hash:MIME::Types::Container
    rfc:
    - rfc4287
    - rfc5023
    template:
    - application/atom+xml
  registered: true

What most users care about—based on some scanning of codebases on GitHub—is the content type, the extensions, and maybe the encoding. All of the other data is useful, but not for all applications. There’s a bunch of things that I’ll be removing in mime-types 3.0 because they’ve been deprecated for a while, and I have at least two issues (#45 and #67) asking for more information or at least a different organization of the same data for different purposes. I’m also increasingly convinced that the simplified type (and the sort based on that, per rest-client/rest-client#248) is probably a mistake.

In the short term, #64 looks like it may offer a substantial reduction in duplicated text (e.g., application would only be allocated once). Would something based on that be a good start, @schneems?

@schneems
Copy link
Contributor Author

Thanks for all the feedback. The file store seems to be out of the question. I've played around more with some refactorings and I want to share my experiments, their results, and a suggestion.

Json Parser

We can save memory switching to Yajl for the json parser. Yajl versus JSON uses about 3mb compared to 6mb.

require 'json'
require 'get_process_mem'
require 'yajl'

file_name = "/Users/richardschneeman/Documents/projects/mime-types/data/mime-types.json"

GC.start(full_mark: true, immediate_sweep: true)
before = GetProcessMem.new.mb
# array = JSON.parse(File.open(file_name, 'r:UTF-8:-') { |f| f.read })
array = Yajl::Parser.new.parse(File.new(file_name, 'r'))

GC.start(full_mark: true, immediate_sweep: true)
after = GetProcessMem.new.mb
puts "MEM Difference: #{after - before}"

Unfortunately without doing anything else, we'll see no savings, this is because the act of creating MIME::Type-s and retaining them pretty much guarantees a major GC and memory allocation phase.

Smaller Object Footprint

I experimented using a minimum viable object to store data in. The idea is that once we need it we could either promote it or do something with it.

class MicroMime
  attr_accessor :"content_type", :encoding, :references,
                :xrefs, :registered, :extensions, :obsolete,
                :"use_instead", :friendly, :signature,
                :system, :docs

  def initialize(hash)
    @content_type = hash["content-type".freeze]
    @encoding     = hash["encoding".freeze]
    @references   = hash["references".freeze]
    @xrefs        = hash["xrefs".freeze]
    @registered   = hash["registered".freeze]
    @extensions   = hash["extensions".freeze]
    @obsolete     = hash["obsolete".freeze]
    @use_instead  = hash["use-instead".freeze]
    @friendly     = hash["friendly".freeze]
    @signature    = hash["signature".freeze]
    @system       = hash["system".freeze]
    @docs         = hash["docs".freeze]
  end
end

Compared to the the current MIME::Type this saves us a good bit of memory

ENV['RUBY_MIME_TYPES_LAZY_LOAD'] = "true"
require 'mime/types'

# array = Yajl::Parser.new.parse(json).map {|element| MicroMime.new(element) }
# array = Yajl::Parser.new.parse(json).map {|element| MIME::Type.new(element) }

The results are pretty stark

MIME::Type memory use: 10.6953125 mb
MicroMime memory use: 3.2421875 mb

Maybe we could investigate why the MIME::Type object is so large and bring it down. That will take more time and expertise with the project than i've got. I already looked at it and nothing huge jumped out at me. It seems like minimizing their creation would serve our purposes equally well.

Truly lazy load

I've got a branch of code that stores the raw json from the file (manipulated slightly to make searching by content-type easier). We can re-implement [], type_for, each, and count so that when each method is called it checks to see if the MIME::Type is already loaded and stored, if not it will search through the hash we stored from the json blob and see if it can find an entry that fits the requirements, if it finds it then it will create a MIME::Type object, add it to the main container and re-run the original operation.

While I think this will eventually work, it feels a bit to me like this:

It's slow on the first call and fast on all the others. Here's the memory difference on startup:

array = Yajl::Parser.new.parse(json)
lazy = MIME::Types::Lazy.new(array)

# => memory use: 5.203125 mb

Here we're actually using more memory than saving and retaining the MicroMime object. I think if i'm to continue on this path, I would switch to using Ruby objects similar to the MicroMime ones earlier.

Suggestion

Use Yajl, it's faster and has a smaller memory footprint, without it even if we get savings somewhere else, we might lose it on the json parser. I also think we should switch to lazy creation of MIME::Type-s backed by a store of small ruby objects (such as a MicroMime).

Right now I feel like the logic for type_for is already slow (i.e. it's already doing a search through objects), we could move that over to support the MicroMime type fairly easily and have it search through both. The method [] is a bit harder, this is expected to be fairly fast. We could re-organize the data to store it as a hash with the keys being a MIME::Type.simplified of the content-type, and the values being an array of all matching MicroMime objects. The rest of the methods each and count will be trivial to support this. If we re-organize the data to match it's most common access pattern, then I feel like we can get this almost as fast if not as fast as the current version but with a significant savings in memory usage.

There may be other spots to save memory on, but going forwards this looks like the most sane plan. Let me know what you think or if you have any questions.

@halostatue
Copy link
Member

Thanks for pursuing this further. You’ve confirmed my suspicion that moving to a lighter-weight load of data is going to have a substantial impact, so I’m going to start designing in that direction for mime-types 3—which is where any change of this magnitude has to go.

Unfortunately, Yajl is almost certainly a non-starter, because mime-types has to work on all Rubies conforming to 1.9.2 or higher, and Yajl is a C binding (excluding at least JRuby). This means I could either support MultiJSON (nope), support multiple JSON libraries in a similar way that MultiJSON does (probably not), or provide some sort of configuration mechanism can be provided to pass in a JSON parser that works the way that I use it. (Purely academically, it would be interesting to compare Yajl against Oj.) I am leaning toward the third option, but I have to think about it.

It also doesn’t feel right because mime-types is currently a no-dependency gem for users. (I do not mind adding dependencies for people who are developing mime-types code.)

There’s one other idea that was suggested back at RubyConf that’s a bit radical—what if the default mime-types registry was generated as Ruby files? It’d be a pretty substantial change…but not hard to test. The entire memory use would be at load time, and while the total number of objects may not be lower, using a few of the tricks that you’ve pointed out for an earlier PR would make it fairly easy to keep that lower in memory. If, in fact, a MIME::Type::Lite (or whatever it gets called) is what is written out and the JSON parsing is only used when more data is desired (that is not, as far as I can tell, the average use case, but there are users who do want more data)…this could be interesting.

@sunnyrjuneja
Copy link

@halostatue Just curious, why is MultiJSON not an option?

@halostatue
Copy link
Member

Main reason? Because mime-types is currently a no-dependencies-required gem, and I want to keep it that way. There’s nothing about mime-types that should require it use anything beyond that which is included with Ruby by default (at least for a sane Ruby like MRI or JRuby). Enabling advanced usage is one thing; requiring it by default for a library like mime-types is the wrong thing to do.

Beyond that? I don’t like MultiJSON as a user. The only reason it will ever show up in a project that I’m involved with is if a library that I use has chosen not to make a choice on a library; it will never be primary in the Gemfile. I don’t want to discourage people from using it, but I won’t use it by choice, and I’ll try to get it removed any time I can do so.

@sunnyrjuneja
Copy link

@halostatue Sorry to bother but why don't you like as MultiJSON as a user? I know you're probably busy so if you're not interested in explaining, feel free to leave the question unanswered.

@halostatue
Copy link
Member

I’m planning on being at RubyConf this fall; if you’re there, we can have that discussion. I’d like to keep this discussion focussed on the performance improvements suggested by @schneems—and how to keep them within the goals and constraints I have for mime-types as a library.

@halostatue
Copy link
Member

I was wrong about the suggestion of generating Ruby to include this. This was suggested by @postmodern in #85, and is the only reason that ticket is still open. It does also suggest that adding unstated dependencies (because apparently Rubinius does not include JSON in its standard library but every other Ruby does) is potentially problematic.

@schneems
Copy link
Contributor Author

@halostatue I tried extracting the storage of the mime types into pure ruby files, and weirdness happened. I pulled them into a runnable benchmark: https://github.com/schneems/require_memory_size_benchmarks. Looks like to minimize memory we would need to split out the requires into different files. Also it ends up not saving us any memory versus what we're currently doing.

Note the "json" method in that example uses yajl which even if we can't use it here, has so far been my "best case" and I think we should shoot for near that memory footprint.

While I was at RailsConf @jeremyevans gave a recommendation to check out SDBM which is in the stdlib as a way of storing and loading data. I played around with it, below are some benchmarks and thoughts, please take a look.

Option store info in a SDBM database and create objects as needed

The SDBM library require 'sdbm' ships with c-ruby standard lib. It allows you to store key-value string pairs on disk. It stores the data in one file and an index in another file. Instead of trying to reduce the size of the objects we're retaining, we could instead not retain any objects and instead load from the disk. While this sounds similar to a distributed file cache store, it benefits from optimized implementation that is proven and we don't have to write or maintain new disk access code. The SDBM API is nice to work with and it appears to be very efficient.

I would propose storing two SDBM databases one that has a reference to simplified content types, another that has a reference to extensions.

I used a script like this for preparing the two databases, I've got pros and cons listed below of using such an approach

require 'sdbm'
require 'json'

file_name = File.expand_path("../data/mime-types.json", __FILE__)

file  = File.open(file_name, 'r:UTF-8:-') { |f| f.read }
array = JSON.parse(file).map { |hash| hash }

require 'mime/types'

SDBM.open 'data/content-types' do |db|
  db.clear
  array.each do |hash|
    simplified = MIME::Type.simplified(hash["content-type"])
    db[simplified] ||= "[]"
    previous = JSON.parse(db[simplified])
    previous << hash
    db[simplified] = previous.to_json
  end
end

SDBM.open 'data/extensions' do |db|
  db.clear
  array.each do |hash|
    next unless hash["extensions"]
    hash["extensions"].each do |extension|
      db[extension] ||= "[]"
      previous = JSON.parse(db[extension])
      previous << MIME::Type.simplified(hash["content-type"])
      db[extension] = previous.to_json
    end
  end
end

Pros

  • Fast execution. We can lazy load from SDBM database on disk extremely quickly. I.e. when you call MIME::Types[] we do the lookup and return a new MIME::Type.new and it looks like it's still quite quick. We could speed this up with in memory caching the types we need, but based on initial benchmarks, it's fast enough (TM). I benchmarked this on an SSD. Will need to do more benchmarks on a disk drive, It will be slower but hopefully fast enough.

Here's an example of how I was doing lookups

@content_type_database = SDBM.open("data/content-types")
@extensions_database   = SDBM.open("data/extensions")

def find_by_content_type(type)
  array_string = @content_type_database[type] || "[]"
  JSON.parse(array_string).map {|hash|  MIME::Type.new(hash) }
end

def find_extension(ext)
  if types = @extensions_database[ext]
    JSON.parse(types).flat_map do |type|
      find_by_content_type(type)
    end
  end
end

The benchmarks are promising

Benchmark.ips do |bm|
  bm.report("find extension") { find_extension("html".freeze) }
  bm.report("find by type") { find_by_content_type("application/applefile".freeze) }
end

# Calculating -------------------------------------
#     find extension   4.195k i/100ms
#     find by type     8.332k i/100ms
# -------------------------------------------------
#     find extension  43.075k (±14.6%) i/s - 213.945k
#     find by type    86.857k (±13.1%) i/s - 433.264k
  • Fast boot: We're removing the need to load and parse a json file every time you require 'mime/types'.
  • Low memory footprint: We only load the data we need, we retain nothing other than references to the SDBM files.

Cons

  • Large disk space usage. The two sdbm databases I created were 22 mb and 35 mb on disk. IMHO this isn't a big deal. Disk space is really cheap, and I would gladly trade 50mb of disk space for 20mb of extra RAM and a faster boot. It is a downside though and it will cause people who are already close to any kind of a disk limit to exceed it. On Heroku "slug size" is a concern, a slug is a snapshot of your app that must be downloaded to a new machine every time a dyno boots. However slugs are zipped, and when you zip these files they end up only being around 304kb, so this shouldn't affect Heroku customers, but it may affect other hosting environments.
  • SDBM has a length limit on values (it appears). We can't just use it wholesale, we've got to be conscious of it's limitations.
  • Adding mime types at runtime is difficult. When someone calls MIME::Types.add() we want this to be added to our result set that we search through. The simplest most naive way to do this would be to add it to the database and forget about it. This causes problems as we're using the same database file for all apps on the same computer, it would also be possible for someone to add a mime-type via #add then remove that call and their code keeps working but fails in production. This would be bad.

To work around, we could copy the database to a tmp location on the first #add() call. This will double the disk space requirements but make the code stay very simple, it will also make the first call to add() much slower. I don't know if any OS features could help us, essentially I want a COW like behavior but for the disk. Alternatively we could write a wrapper that stores these types separately from the database and make all calls search through both. This will minimize disk space requirements but make the code much more difficult and error prone, speed should be roughly the same.

require 'tmpdir'
require 'benchmark'

Benchmark.ips do |bm|
  bm.report("copy") do
    tmpdir = Dir.mktmpdir
    FileUtils.cp(%W{ data/extensions.dir data/extensions.pag data/content-types.dir data/content-types.pag }, tmpdir)
  end
end
# Calculating -------------------------------------
#  copy  1.000 i/100ms
#-------------------------------------------------
#  copy 10.254 (± 9.8%) i/s - 51.000

The copy technique is slow, i.e. takes between 0.09 and 0.125 seconds. We'll make a little of that back by not having to parse json and create objects, but not nearly all of it. Maybe there's a better or more efficient way to do the copying, at least it would only have to be done once no matter how many times you call add after the first time.

  • Work will be required to get it to function across multiple rubies: It is in the MRI stdlib and I manually tested with MRI 1.9.3, and 2.2.2, however... It doesn't look like sdbm is implemented in rubinius or JRuby (not the two versions I tried on my machine). If this is the only blocker, i.e. looks like a really good approach going forwards, we could work around this potentially.

Fin

Anywhoo, that's what I've been playing around with, any thoughts?

Also thanks to Jeremy for the recommendation, I didn't know about SDBM.

@jeremyevans
Copy link
Contributor

@schneems Did you have a chance to benchmark how much this will slow down something like Mime::Types[/json/]? That currently has to iterate over the whole database, and I'm guessing it would be quite slow, especially without an SSD.

To speed that up, it may be advantageous to have a file with a simplified mime type per line. Then we can scan the file to get all simplified types matching the regexp, map over that to look up each matching simplified type in the database. That should significantly decrease IO for non-stupid regexps (e.g. //), and I'm guessing this code would be IO bound. We'd have to benchmark to see how much that would help.

I guess I never checked whether SDBM was ruby code or a C extension, and figured ruby. I'm surprised Rubinius and JRuby don't implement it because they implement most of the rest of the stdlib, but maybe they just never got a request for it, since so few people use or know about sdbm. However, since mime-types should work in Rubinius and JRuby, we'd have to create an sdbm-compatible reader in pure ruby and ship it with mime-types, which while possible, would probably be a lot of work.

@halostatue The other idea that @schneems and I discussed at RailsConf was using a columnar storage approach. Basically, have the main mime-types file load Mime::Type objects containing just the type name, and have all other accessors load the file containing the data for that column. That way, if you only need say the type name and whether or not type is binary (as the mail gem does), you only have to load two files, which will probably be a lot less memory. As far as I know, that approach has only been discussed, nobody has written code to validate that approach yet. What are your thoughts on that?

@halostatue halostatue self-assigned this Apr 25, 2015
@halostatue halostatue added this to the 3.0 milestone Apr 25, 2015
@halostatue
Copy link
Member

@schneems & @jeremyevans: This leaves me with a lot to chew on. After I release mime-types 2.5, I’m going to start a 3.x branch where I can start trying to figure out how the data can be sorted out between minimal, current, and expanded. The first step is to get rid of the deprecated methods and data. That will probably cut the memory usage by up to a third (we are duplicating some data between references and xrefs). Then I can start identifying what we need now and what we don’t, and maybe store that in a packed format that can be handled with Array#pack and String#unpack. I may even end up going back to something similar to the regexp parser that I had for MIME::Types 1.x for the lightweight data and fall back to JSON or YAML for the heavier data.

My first guess on the data required is content type/subtype, extensions, and two binary flags (un/registered and text/binary). The extensions are required for anyone interacting with MIME::Types through CarrierWave or another uploader binary (including Restify).

I think that 3.x is going to be a breaking change with some (lots of?) incompatibility in methods (to the point where I may even work on this in a mime-types-ng repo with a decision on release to be made later); as I mentioned in one of my earlier comments, the simplified format is sort-of giving me problems and I need to completely change how #priority_sort works (and that’s going to be really important to figure out because that’s the piece that tells you preferred types for given objects).

I’m uncomfortable with the SDBM approach because while it will do better for MRI, a pure-Ruby reader for SDBM will probably drag the performance of JRuby and Rubinius down substantially). I don’t know enough about columnar stores to be able to implement them myself…and there are ~2k MIME types involved.

I’m also going to shift my approach to mime-types development as soon as I release 2.5.

@halostatue
Copy link
Member

(And, BTW, I am jealous of the two of you having been at RailsConf. Hope to see you both at RubyConf in the fall—and I’m trying to figure out something to possibly propose a talk about this. Maybe this. Maybe one of several things I am doing at work.)

@jeremyevans
Copy link
Contributor

I'll see if I can work on a proof of concept of my columnar store idea and see if that will work (pass all tests) and provide enough memory savings to make the approach worthwhile.

I'm not sure I'll be able to make RubyConf this year, but I'll try.

@halostatue
Copy link
Member

Sounds great. Just as a warning, I’m releasing 2.5 tonight (if at all possible), and then I’m going to move mime-types development to mime-types/ruby-mime-types and keep this as a fork so people don’t completely wonder where this has gone. I’m updating the documentation in the gem right now.

@schneems
Copy link
Contributor Author

@jeremyevans It takes about ~0.01 seconds to iterate through all the entries and search via regex

@content_types_database.each.select { |k, _| k =~ regex }.map {|v| MIME::Type.new(v) }

Or about 55 iterations per second

@halostatue I look forward to seeing you in San Antonio. If you're interested in a getting a set of eyes on a title/abstract or you want someone to ping ideas off i've got some experience and would be happy to help, you can shoot me an email [email protected].

I think SDBM might be a non-starter unless easy and performant jruby and rubinius options come out of the woodwork.

Thanks to both of you for your time and responses!

@jeremyevans
Copy link
Contributor

Here's my work in progress diff: http://pastie.org/pastes/10112762/text

Basic approach: Use a plain text file to store the content type and extensions for each mime type in the json file, with one line per mime type. Have supplementary data files for each separate mime-type attribute (currently implemented: friendly and encoding). The supplementary data files also use one line per mime-type, keeping the same order as the base data file.

If an attribute getter is called on a mime type instance, and the attribute value has not been loaded yet, just load the attribute value for all mime types, then return the getter value. We iterate over the file lines using each_line to avoid loading the entire file into memory.

The parse_mime_json script included with the diff parses the json file into the plain text data files. The check_mime_json script checks for behavior. There is currently one mime-type where the behavior is different, the text/directory - DEPRECATED by RFC6350 mime-type. This looks like an existing bug to me, as that doesn't appear to be a real mime-type.

Memory difference:

Ruby by itself: 11752KB RSS
Current behavior with JSON file: 33972KB RSS
With diff, loading just the default data: 13816KB RSS
With diff, also loading the friendly and encoding data: 13920KB RSS

So about a 10x improvement on base memory use. Additionally, the time to load mime-types has been reduced also by about 3.8x, from 0.38 seconds to 0.1 seconds on my machine.

I haven't actually started running the tests with this yet, since I need to handle the rest of the attributes before doing so. But I'd like your thoughts before I continue further down this path.

I haven't added mutex locking around loading the supplementary data files to ensure thread safety, but I definitely plan to do that if this looks like a good approach.

@halostatue @schneems Your thoughts on this approach?

@halostatue
Copy link
Member

If you rebase what you’re doing against master (mime-types/ruby-mime-types, now; update your remotes!) you should see this improved (I improved my parser to dump that crap into a new xrefs:notes field), as I just release 2.5.

Do note that the JSON file is not the source file; that's the YAML files in type-lists/, so if we do the columnar approach (which feels pretty good) we want to make it pretty easy to make this work with this. I’m probably going to pull the source files out of the main repo and either submodule them or figure out a way to make the data handled separately, because I’ve written an Io library that uses the same JSON data and it turns out that someone wrote a Python library that uses the old 1.x text format (and that I may flex some Python muscles on to get it to work with the new data, although I’ll probably skip right to the 3.x format we are talking about now).

@jeremyevans
Copy link
Contributor

Unfortunately, I'm not seeing a significant difference after rebasing to the current master branch, as my initial checkout was only 4-5 hours ago.

Since you feel pretty good about the columnar approach, I'll try to work tomorrow on filling out the remaining attributes, changing the data file creation to use the yaml files instead of the json ones, adding rake tasks for creating the data files, and fixing the remaining issues I discussed, as well as making sure all of the tests pass.

@jeremyevans
Copy link
Contributor

I've added pull request #96 which implements the columnar storage idea. I believe it should be backwards compatible, and it offers significant memory savings (10x best case, 2x worst case).

@halostatue halostatue modified the milestone: 3.0 Apr 26, 2015
jsonn pushed a commit to jsonn/pkgsrc that referenced this issue Jun 8, 2015
== 2.6.1 / 2015-05-25

* Bugs:
  * Make columnar store handle all supported extensions, not just the first.
  * Avoid circular require when using the columnar store.

== 2.6 / 2015-05-25

* New Feature:
  * Columnar data storage for the MIME::Types registry, contributed by Jeremy
    Evans (@jeremyevans). Reduces default memory use substantially (the mail
    gem drops from 19 Mib to about 3 Mib). Resolves
    {#96}[mime-types/ruby-mime-types#96],
    {#94}[mime-types/ruby-mime-types#94],
    {#83}[mime-types/ruby-mime-types#83]. Partially
    addresses {#64}[mime-types/ruby-mime-types#64]
    and {#62}[mime-types/ruby-mime-types#62].
* Development:
  * Removed caching of deprecation messages in preparation for mime-types 3.0.
    Now, deprecated methods will always warn their deprecation instead of only
    warning once.
  * Added a logger for deprecation messages.
  * Renamed <tt>lib/mime.rb</tt> to <tt>lib/mime/deprecations.rb</tt> to not
    conflict with the {mime}[https://rubygems.org/gems/mime] gem on behalf of
    the maintainers of the {Praxis Framework}[http://praxis-framework.io/].
    Provided by Josep M. Blanquer (@blanquer),
    {#100}[mime-types/ruby-mime-types#100].
  * Added the columnar data conversion tool, also provided by Jeremy Evans.
* Documentation:
  * Improved documentation and ensured that all deprecated methods are marked
    as such in the documentation.
* Development:
  * Added more Ruby variants to Travis CI.
  * Silenced deprecation messages for internal tools. Noisy deprecations are
    noisy, but that's the point.

== 2.5 / 2015-04-25

* Bugs:
  * David Genord (@albus522) fixed a bug in loading MIME::types cache where a
    container loaded from cache did not have the expected +default_proc+,
    {#86}[mime-types/ruby-mime-types#86].
  * Richard Schneeman (@schneems) provided a patch that substantially reduces
    unnecessary allocations.
* Documentation:
  * Tibor Szolár (@flexik) fixed a typo in the README,
    {#82}[mime-types/ruby-mime-types#82]
  * Fixed {#80}[mime-types/ruby-mime-types#80],
    clarifying the relationship of MIME::Type#content_type and
    MIME::Type#simplified, with Ken Ip (@kenips).
* Development:
  * Juanito Fatas (@JuanitoFatas) enabled container mode on Travis CI,
    {#87}[mime-types/ruby-mime-types#87].
* Moved development to a mime-types organization under
  {mime-types/ruby-mime-types}[https://github.com/mime-types/ruby-mime-types].
@SamSaffron
Copy link

SamSaffron commented Dec 13, 2016

FWIW all that the mail gem needs and all that rest client needs is a very minimal subset of mime types ... so I am making this

https://github.com/discourse/mini_mime/blob/master/lib/mini_mime.rb

and will do PRs for mail gem and rest client to swap it out.

100% lazy loaded with binary search

cc @jeremyevans @schneems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants