Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash getting global temp dir when restarting jobtree #27

Open
adamnovak opened this issue Feb 5, 2015 · 2 comments
Open

Crash getting global temp dir when restarting jobtree #27

adamnovak opened this issue Feb 5, 2015 · 2 comments

Comments

@adamnovak
Copy link
Contributor

When I go to restart one of my jobTree scripts that uses the global and local temp directories, after some of its jobs have I get a crash from the internal jobTree code complaining about some directories not existing.

Can the code be made to handle the lack of existence of those directories?

log.txt:    ---JOBTREE SLAVE OUTPUT LOG---
log.txt:    Traceback (most recent call last):
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/src/jobTreeSlave.py", line 271, in main
log.txt:        defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth)
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/stack.py", line 153, in execute
log.txt:        self.target.run()
log.txt:      File "/cluster/home/anovak/hive/sgdev/mhc/targets.py", line 244, in run
log.txt:        index_dir = sonLib.bioio.getTempFile(rootDir=self.getGlobalTempDir())
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/target.py", line 103, in getGlobalTempDir
log.txt:        self.globalTempDir = self.stack.getGlobalTempDir()
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/stack.py", line 129, in getGlobalTempDir
log.txt:        return getTempDirectory(rootDir=self.globalTempDir)
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/sonLib-1.0-py2.7.egg/sonLib/bioio.py", line 457, in getTempDirectory
log.txt:        os.mkdir(rootDir)
log.txt:    OSError: [Errno 20] Not a directory: '/cluster/home/anovak/hive/sgdev/mhc/tree7/jobs/t2/t3/t1/t0/gTD0/tmp_Zss3uyl5X6/tmp_45OevDhWor/tmp_vxiVIbzGSw'
log.txt:    Exiting the slave because of a failed job on host ku-1-21.local
log.txt:    Due to failure we are reducing the remaining retry count of job /cluster/home/anovak/hive/sgdev/mhc/tree7/jobs/t2/t3/t1/t0/job to 0
@joelarmstrong
Copy link
Collaborator

Is this a from a run that crashed over the weekend? Something went majorly wrong with the cluster, and the jobTree might not be recoverable. The directory that it's trying to make a subdirectory of is a 0-length file, which is totally screwed up. I don't see any code path that could lead to that being a file.

In my jobTree, I also had 0-length pickle files show up when it was (according to the posix spec) impossible for them to, so I think this isn't a jobTree error, but a fluke caused by our cluster trouble.

@adamnovak
Copy link
Contributor Author

No, I wiped the tree and started this yesterday. It could be due to the
fact that one of the cluster nodes has a hung filesystem mount of some sort
and has been taking jobs and doing who knows what with them.

I noticed that sonLib's temp directory getting function, if it decided to
try a temp directory name and the name is taken, tries to make a
subdirectory of that name instead of picking a new name in the root. I've
changed that around in my copy, and then jobTree successfully can run my
target and start up my C++ code.

On Thu, Feb 5, 2015 at 11:05 AM, Joel Armstrong [email protected]
wrote:

Is this a from a run that crashed over the weekend? Something went majorly
wrong with the cluster, and the jobTree might not be recoverable. The
directory that it's trying to make a subdirectory of is a 0-length file,
which is totally screwed up. I don't see any code path that could lead to
that being a file.

In my jobTree, I also had 0-length pickle files show up when it was
(according to the posix spec) impossible for them to, so I think this isn't
a jobTree error, but a fluke caused by our cluster trouble.


Reply to this email directly or view it on GitHub
#27 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants