Skip to content
This repository has been archived by the owner on Feb 4, 2020. It is now read-only.

adds lockfile based locking strategy #82

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

akleber
Copy link
Contributor

@akleber akleber commented May 23, 2016

This PR adds an additional lockfile based locking mechanism. To enable the lockfile based locking, "LockingStrategy" has to be set to "File" in the cache config file. A command line parameter to set the locking mechanism is not yet provided. In contrast to an earlier implementation that was removed some time ago, this implementation does not rely on a specific file being present or not, but instead uses a win32 api to explicitly lock the file. This mechanism is based on a file handle of the actual lock file and should release the lock in case the process dies unexpectedly.
This PR is still rough around the edges, but I wanted to provided a first look.
My reason for implementing this is, that I want to share the cache via a smb share to multiple build machines.
I ran some benchmarks (unfortunately not with the new test harness). It ran on a 12 core machine with SSD and 1Gibt network. All numbers are averages over 4 runs. The cache contains 1740 objects and is 1.5 Gb in size.

Lockfile, local cache, cold cache: 480s
Mutex, local cache, cold cache: 482s

Lockfile, local cache, warm cache: 270s
Mutex, local cache, warm cache: 272s

Lockfile, cache on smb, cold cache: 525s
Lockfile, cache on smb, warm cache: 361s

So it seems that the performance of the lockfile compared to the mutex is not significant different. I attribute the differences to inaccuracies in the measurement.
Placing the cache on a smb share has however a noticeable penalty.
I am going to test this implementation on a build cluster with a "real world" work load and have a look how it performs and will report back.

try:
lockfile = open(self._lockfileName, 'a')
except OSError:
lockfile = None
Copy link
Owner

@frerich frerich May 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assignment is unneeded: _acquire is only called if self.lockfile == None holds, which is still True in case open raises an exception. I'd say you can just return here. Doing so would permit dropping the explicit else branch and thus lowering the indentation.

@frerich
Copy link
Owner

frerich commented May 23, 2016

Great PR, it looks very good! The idea to not rely on the sheer existence of a file for the locking but rather locking file regions (and then relying on those regions being released automatically when the file handle is closed) is really nice, that didn't occur to me. It nicely takes care of the main gripe we had with file-based locks (which is: not cleaning up in case of abnormal exists). I really only have a few stylistic concerns here.

I'd like to hear what you think about my comments, but in general I'm ready to merge this.

@webmaster128
Copy link
Contributor

I guess this non-trivial change justifies an entry in the copyright paragraph, right? What do you think @frerich?

Does this file based locking require any specific file system? Or can I do that on a FAT or FTP mount? Could you extend the README to tell people everything about File locking they need to know?

Regarding configuration: everything except cache size is configured via an environment variable and my feeling is that it would be better to move cache size there too. Since we're passing to cl, I think we should keep the clcache parameter parsing logic as minimal as possible. So for now I'd suggest configuring the locking method via an environment variable. Having everything in variables allows us to migrate to a more general configuration approach as seen in ccache Frerich?

elif self._timeoutMs >= 0 and time.time() - start_time > self._timeoutMs/1000:
raise ObjectCacheLockException("Timeout waiting for file lock")
else:
poll_intervall = random.uniform(0.01, 0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

camelCase; is this really an interval? I think pollDelay would be a better name indicating that the delay before the next poll – a single float number – is stored.

@frerich
Copy link
Owner

frerich commented May 24, 2016

@webmaster128 Wether or not some configuration knob is done as part of the cache configuration or via environment variables is - to me - mostly an issue of whether it's something you typically turn on and off for certain projects and/or builds (in which case I'd go for an environment variable), or whether it's a property of the cache itself which is independent of any particular build (e.g. the maximum cache size) in which case I'd go for a cache configuration switch.

To me, @akleber 's decision seems perfectly sensible plausible, but alas defective given the above mentioned hen and egg problem.

As for the copyright notice -- if @akleber doesn't mind getting some of the bl^H^Hfame, I'd be happy to add him to the list of contributors.

@webmaster128
Copy link
Contributor

webmaster128 commented May 24, 2016

+1 for making this the default behavior (override via environment variable?) as soon as we have the file system requirements on the table.

@akleber
Copy link
Contributor Author

akleber commented May 25, 2016

Thanks for the thorough review so far.

Regarding the config location: My thinking was that the type of locking is specific to the cache, so every process that wants to use a particular cache has to use the appropriate locking mechanism.
However I perfectly see the hen and egg problem.
I think changing the locking mechanism configuration to an env variable is fine in so far, as one who sets up a shared cache should also be able to properly configure the locking setting on the participating clcache processes.

@frerich I am fine with being added to the list of contributors.

Also I am adding relevant infos to the README regarding the lockfile requirements and usage.

* simplifies code
* locking strategy as env variable
* documentation
@webmaster128
Copy link
Contributor

I was especially thinking of the copyright header in the clcache.py file. I think this should be extended with every non-trivial change to that file.

@akleber
Copy link
Contributor Author

akleber commented Jun 3, 2016

I updated this PR with changes concerning the reviews. However I also observed some race-condition when using a network share for the cache so please do NOT merge this PR yet. I am still investigating...

@@ -186,7 +186,7 @@ def acquire(self):
elif time.time() - startTime > self._timeoutMs/1000:
raise ObjectCacheLockException("Timeout waiting for file lock")
else:
pollDelay = random.uniform(0.01, 0.1)
pollDelay = random.uniform(0.1, 1.0)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: what does this fix? SMB connections don't like polling too often for some reason?

@frerich
Copy link
Owner

frerich commented Jun 17, 2016

@akleber ping? Any new insights recarding that race condition when using a network share? I saw that you bumped the retry interval when polling SMB shares, did that help with anything? Do you understand why?

@akleber
Copy link
Contributor Author

akleber commented Jul 5, 2016

I currently do not have much time to work on this, but I will soon hopefully. Unfortunately I was not yet able to fix the race condition when using the lockfile on a smb share and I even do not yet understand what the problem might be. Using the lockfile on a local filesystem works fine however as far I could see.

@frerich
Copy link
Owner

frerich commented Jul 5, 2016

Thanks for the update, @akleber ! Pity that the cause for the race condition is still unknown, because non-local file systems were the prime reason for using a file-based locking mechanism in the first place. I guess it doesn't hurt to keep this PR around, I suspect it's "almost there".

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants