Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatic recompilation of stale cache files #12458

Merged
merged 1 commit into from
Aug 6, 2015

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Aug 4, 2015

This fixes #12259 by automating the recompilation of stale cache files whenever you require a module (e.g. by using) and a stale cache file is found. Currently, staleness is judged by timestamps; other mechanisms like checksums are left to future PRs.

For example, here is a typical session after I had done Base.compile(:PyPlot) and then updated a file in Compat:

julia> using PyPlot
INFO: Recompiling stale cache file /Users/stevenj/.julia/lib/v0.4/Compat.ji for module Compat.
INFO: Recompiling stale cache file /Users/stevenj/.julia/lib/v0.4/LaTeXStrings.ji for module LaTeXStrings.
INFO: Recompiling stale cache file /Users/stevenj/.julia/lib/v0.4/PyPlot.ji for module PyPlot.
INFO: Recompiling stale cache file /Users/stevenj/.julia/lib/v0.4/PyCall.ji for module PyCall.
INFO: Recompiling stale cache file /Users/stevenj/.julia/lib/v0.4/Color.ji for module Color.
INFO: Recompiling stale cache file /Users/stevenj/.julia/lib/v0.4/FixedPointNumbers.ji for module FixedPointNumbers.

This patch supersedes #12445.

Like #12445, it adds a list of the dependency files to the .ji file, with a new include_dependency(path) function to manually supply non-include dependencies for a module.

(Also as discussed in #12445, I changed the serialization metadata to use bigendian rather than littleendian storage, for ease of reading back in via ntoh.)

Once #12448 lands, it will be easy to add other information about the dependencies (e.g. checksums) if that is judged necessary in the future, but I think it is better to develop that incrementally. (Merging a hash/checksum algorithm is a nontrivial task that is best left to a separate PR.)

Note that you still have to manually Base.compile a module at least once—this PR only automates the recompilation of the image. A separate PR can implement a @cacheable tag (or whatever) to allow modules to opt-in for automatic initial compilation.

@stevengj stevengj added the compiler:precompilation Precompilation of modules label Aug 4, 2015
@jakebolewski
Copy link
Member

Looks great. Could we restrict the verbose info output to interactive sessions only?

@stevengj
Copy link
Member Author

stevengj commented Aug 4, 2015

@jakebolewski, sure I guess that seems reasonable.

However, I think that may end up suppressing some of the info output even "interactive" sessions, because some of the compiles are triggered within Base.compile, which executes in a separate non-interactive process. (e.g. the compilation of FixedPointNumbers above is triggered by the compilation of Color.) I take that back, all the info calls should be from the main process. Yep, this happens e.g. whenever recompilation is triggered by an updated file in the module, because then it calls Base.compile before requiring any other modules.

...Okay, what I can do is to have info output only if isinteractive() || 0 != ccall(:jl_generating_output, Cint, ()), which enables it during Base.compile sessions too.

@johnmyleswhite
Copy link
Member

So glad this looks to be ready for 0.4. Thank you, @stevengj

@KristofferC
Copy link
Member

This is game changing!

@timholy
Copy link
Member

timholy commented Aug 4, 2015

Yes, this really puts the icing on the cake. 0.4 is awesome.

@vtjnash
Copy link
Member

vtjnash commented Aug 5, 2015

AppVeyor got confused (assigned two jobs the same number). AFAICT the build can not be restarted from here, so you'll have to force-push a new SHA1 hash if you want to trigger a new build.

(also Travis is having issues with their Mac cluster, so those are backed up right now too)

@stevengj
Copy link
Member Author

stevengj commented Aug 5, 2015

@vtjnash, I already pushed a couple of patches since the initial AppVeyor failure, and the same failure happened again.

@vtjnash
Copy link
Member

vtjnash commented Aug 5, 2015

I had to manually increment the AppVeyor "next build" number to un-stick it. Builds should be working again now, however.

@ViralBShah
Copy link
Member

We have all green now!

@ViralBShah ViralBShah added this to the 0.4.0 milestone Aug 5, 2015
@timholy
Copy link
Member

timholy commented Aug 5, 2015

LGTM

@stevengj
Copy link
Member Author

stevengj commented Aug 5, 2015

I'll merge by the end of the day if there are no objections.

// are include depenencies
void jl_serialize_dependency_list(ios_t *s)
{
size_t total_size = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation again

@stevengj
Copy link
Member Author

stevengj commented Aug 5, 2015

(rebased to squash commits, merge with #12448, fix indentation)

stevengj added a commit that referenced this pull request Aug 6, 2015
automatic recompilation of stale cache files
@stevengj stevengj merged commit 55cfad9 into JuliaLang:master Aug 6, 2015
@stevengj stevengj deleted the ji_rebuild branch August 6, 2015 02:58
@timholy
Copy link
Member

timholy commented Aug 6, 2015

Hooray!!

used via `include`. It has no effect outside of compilation.
```
"""
include_dependency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newly added functions do not appear to get correctly populated by doc/genstdlib.jl. I think you need to add an opening signature for this somewhere in the RST docs for it to work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the new doc system for the manual. @one-more-minute, can you clarify what we are supposed to do? I'd really rather just put the documentation inline in loading.jl.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting the docs inline should work fine. As with the rest of the docstrings, genstdlib.jl will look for .. function:: include_dependency(...) and splice the doc string it finds there, so if that doesn't exist already it needs to be added (or there wouldn't be any way to know where the doc string should go).

When I have more free time I'm going to put up a bunch of examples of how to work with this, which will hopefully make it clearer.

@tkelman
Copy link
Contributor

tkelman commented Aug 11, 2015

I'm already seeing timestamps cause issues here, editing packages and not seeing the corresponding .ji files get recompiled correctly. As I mentioned elsewhere I work over scp a lot editing files on remote machines, and the time stamps are not always getting preserved right. I can potentially cope by finding the right settings to tweak on every editor and ftp client multiplied by every machine I use, but this is not a very robust solution.

@StefanKarpinski
Copy link
Member

I have to wonder if there's some hybrid hashing/timestamp solution to this.

@tkelman
Copy link
Contributor

tkelman commented Aug 11, 2015

What would that look like? Check time stamp first, then check hash if equal? Doesn't seem like that would be all that beneficial over checking just the hashes.

Maybe opt in to "I work across multiple machines therefore need this to be more careful" somehow? Is there some way to make this user-configurable? I'm trying to override the recompile_stale function to a pessimistic always-recompile version (just as a start, for the sake of testing) via Base.recompile_stale(mod, cachefile) = Base.create_expr_cache(Base.find_in_path(string(mod)), cachefile) but that doesn't seem to do the job.

end
modules, files = cache_dependencies(io)
for f in files
if mtime(f) > cachefile_mtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of checking whether mtime(f) > cachefile_mtime this should be mtime(f) != cachefile_mtime to assure that the cache is in sync with the file content. Otherwise if you change machines and one of your clocks is skewed a file change might happen "in the past".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache_dependencies returns all of the included files, and compiling a .ji doesn't modify the mtime for each of those included files, so I suspect this would result in recompiling every time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. Somehow I took it that we store all those mtimes and if one of them changes we should recompile. We are clearly not doing that, sorry for the noise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #12559, now we do store all the mtimes.

@pao
Copy link
Member

pao commented Aug 11, 2015

What would that look like? Check time stamp first, then check hash if equal? Doesn't seem like that would be all that beneficial over checking just the hashes.

"env.Decider('MD5-timestamp'): as of SCons 0.98, you can set the Decider function on an environment. MD5-timestamp says if the timestamp matches, don't bother re-MD5ing the file. This can give huge speedups. See the man page for info." (quoting https://bitbucket.org/scons/scons/wiki/GoFastButton)

@stevengj
Copy link
Member Author

The discussion in #12259 seemed to indicate that a checksum would not be too expensive. If you want one, the steps would be:

  • Pick a checksum algorithm and merge a decent implementation of it into base
  • Modify this dependency code to store the checksum in the .ji file and to check it when determining staleness
    • to ensure that the file is not edited between including the file and computing the checksum, you might want to modify include file to compute the checksum as it goes along. This would be most convenient with a C checksum implementation

I would check the timestamp before checking the checksum, mainly because it gives a little extra protection against the unlikely event of checksum collisions yielding false negatives. (I'm assuming we would use a checksum optimized for speed here, like CRC32, rather than a cryptographically secure checksum.) As @pao says, we can also store the timestamp in the file and check whether it matches, not just whether it is > mtime, as an optimization (although this only helps in the case where the file is stale, which is slow anyway).

@stevengj
Copy link
Member Author

We could store the checksums timestamps in the .ji file and then check whether they match as @pao implied, rather than are > mtime. That would (a) protect against most clock-skew problems, (b) protect against the case where you replace a module file with an older file, which also requires recompilation, and (c) be super-easy and quick to implement .

@pao
Copy link
Member

pao commented Aug 11, 2015

CRC32 is great for bit error detection, but I'm not sure about possibly large intentional changes (this site notes that crc32("plumless") == crc32("buckeroo")). I wouldn't go further than git on this though and figure SHA1 is the upper bound on complexity.

On a reasonably large scons-built project, I've never run into an MD5 hash collision causing a miscompile.

@vchuravy
Copy link
Member

Instead of adding a hashing algorithm to base we could also use git_odb_hashfile [1] from libgit2

https://libgit2.github.com/libgit2/#HEAD/group/odb/git_odb_hashfile

@ScottPJones
Copy link
Contributor

Remember, this is going to be done every time a module is loaded.
What worked very well, in a very high volume environment, in my past, was:

  1. check length 2) check timestamp == 3) check major/minor versions 4) check CRC-32
    This isn't meant to prevent intentional changes, it is meant to keep an optimization from giving incorrect results. Doing MD5 or SHA starts cutting into the benefit of the optimization you were trying to get in the first place.

@stevengj
Copy link
Member Author

@vchuravy, I agree that if we want an cryptographically secure hash, we should just use SHA1 from libgit2. @pao, of course with a cryptographically secure hash, a collision is astronomically unlikely.

@tkelman
Copy link
Contributor

tkelman commented Aug 11, 2015

The need for a hash is now significantly lower, I don't think it's immediately pressing.

I'm having trouble getting git_odb_hashfile or the equivalent git hash-object to give results that are consistent with sha1sum. And since part of this would need to be accessible from C, I'm not sure if requiring libjulia to be linked against libgit2 is something we absolutely want. Nettle.jl has SHA1 implementations that work well, but better to avoid bringing in a new dependency library of that size. There's a BSD-licensed C implementation of SHA1 here https://github.com/dottedmag/libsha1/blob/master/sha1.c or a public-domain once here https://github.com/minix3/minix/blob/master/common/lib/libc/hash/sha1/sha1.c that are both pretty short.

@StefanKarpinski
Copy link
Member

Does git maybe hash the file including some implied header or something?

@ScottPJones
Copy link
Contributor

@tkelman A CRC-32 in C is very fast, and not at all much code (Intel even has an assembly instruction for a CRC-32 step). It would be easy to add it to the julia core, and not link against anything else.
SHA1 is really over-engineering, and would just hurt the performance that the caching is trying to improve in the first place.

@tkelman
Copy link
Contributor

tkelman commented Aug 11, 2015

We also already have murmurhash easily available from either C or Julia, worth benchmarking and thinking about hash size/collision likelihood tradeoffs there. Unless I find a way to hit any more trouble I don't think we need to worry about it for now. Will let you know if my programmatically generated .jl file ideas get anywhere.

@ScottPJones
Copy link
Contributor

What are you doing with programmatically generating Julia code? Sounds interesting.
(very heavy use of code generation on a distributed system is why I've been a PITA about using a hash instead of / or with a timestamp).

@tkelman
Copy link
Contributor

tkelman commented Aug 12, 2015

@ScottPJones
Copy link
Contributor

Murmurhash3 would be good too. I hope Julia has more success than the Brain in taking over the world!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:precompilation Precompilation of modules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

static compile part 5 (automatic recompilation)