Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRA test creating a huge local file/cache #442

Closed
cmnbroad opened this issue Jan 20, 2016 · 12 comments
Closed

SRA test creating a huge local file/cache #442

cmnbroad opened this issue Jan 20, 2016 · 12 comments

Comments

@cmnbroad
Copy link
Collaborator

Running code pulled from master (up to commit 96a5571), this SRA test tests fill up the hard drive on my MacBook running Yosemite. The same code/build/test (ish) running on my Ubuntu VM on a Windows host leaves only a 16M file. In both case, the tests all pass. I'm fairly sure its not downloading that much data since I'm behind a relatively slow T1 line and the test runs much too quickly. If I delete the ncbi folder and rerun the test the same thing reappears:

~$ du -h ncbi
193G ncbi/public/sra
193G ncbi/public
193G ncbi

~$ ls -l ncbi/public/sra
total 403865600
-rw-r--r-- 1 cmn staff 206779187200 Jan 20 14:47 SRR822962.sra.cache

@cmnbroad
Copy link
Collaborator Author

BTW, its the first test in the data provider that causes the problem. If I run just the second test (or even all the tests in the SRA package except that first one), the cache is there but much smaller.

@droazen
Copy link
Contributor

droazen commented Jan 21, 2016

@a-nikitiuk Could you have a look at this?

@a-nikitiuk
Copy link
Contributor

Thanks for the information provided. We are working on a fix

@kwrodarmer
Copy link

The actual fix for this will be to supply a sparse-file implementation for HFS, which resides in the C library that will either need to be automatically downloaded or require a manual download step on the part of the user. We'll let you know when that's available.

We can quickly disable the first test, if that helps in the meantime.

@lbergelson
Copy link
Member

In general tests should not leave any files lying around. Please have tests delete any files they create. A lot of people would be unpleasantly surprised to have ~/ncbi show up in their home directory without explanation, even if it only takes up 16mb.

Is there a way for users to specify where the SRA cache goes?

@kwrodarmer
Copy link

I've been considering how to address this one... It seems to me that we might want to be more clear about requirements, especially those that are in conflict with one another.

  1. You want adequate tests. Presumably this means some real-world tests. If not, we can check in some canned data that will solve most of the problem and have the tests access them directly. Unfortunately, this will not exercise any real-world usage of the SRA.
  2. You don’t want users to be aware that SRA support is available. The tenor of much of the feedback has been that of annoyance at undesired side-effects. Presumably there are clients for whom the side-effect is desired.
  3. You don’t want us to use the network unless the user explicitly turns it on. This is in direct conflict with apparent requirement 2 above. If the user has to manually enable this feature, but is unaware that it exists, then we know it won't be enabled. Further, without it being enabled, we can't run enough real-world tests.
  4. You don’t want us to cache files that are downloaded unless the user tells us where to put the files. This falls into the same category as number 3. Generally our users will configure their environment via the SRA Toolkit tool vdb-config which gives them control over all of these things.
  5. You want us to clean up after our tests, meaning to differentiate between files cached during test or during some user operation. While this concept is obvious to anyone, the VDB cache area is shared with the SRA Toolkit, so it is not strictly clear whether modifications to this area were from a test on real-world data or some other parallel operation.
  6. Cache differentiation between tests and actual usage usually means setting up a temporary cache area to be removed later. This is a clear way to perform tests that can be easily cleaned up, but it may mean duplication of data on their disk as well as duplicate network access. Furthermore, we can't designate any testing area for other than trivial downloads without the user's involvement.

I haven't yet seen any evidence of a lack of robustness of SRA tests, but as I mentioned earlier we will disable tests that access real-world data if the issue is disk-space usage. As far as simultaneously providing a robust test suite that does not utilize the network or disk store, all while behaving as if we didn't exist as far as the user is concerned - these appear to be contradictory requirements.

@lbergelson
Copy link
Member

@kwrodarmer I didn't mean to attack the test suite in general. We're trying to make it so that everyone can live in harmony and happiness, and fixing problems as we find them is a necessary part of that.

  1. We do indeed want adequate tests. Extensive testing is good, and this will by necessity include tests that use network access. We would like to limit network access though because it's a common source of false positive test failures due to network issues, server downtime, etc. Ideally most tests can run on canned data with only a few that have to reach out to touch a remote server. It's necessary and fine to have those tests, but it would be best if they were a) labelled so we know they're reaching out to the network, and b) network failures clearly present themselves as such.

  2. We would like users to be aware that there is SRA support. We do not want to hide this capability. It's more important to us though, that all the other users are unaffected by the inclusion of SRA support. I don't believe there are any users who expect or want to have a test suit leave 40Gb files hanging around on the harddrive though.

  3. We don't want to be downloading and executing native code unless the users explicitly request it, and we don't want to be downloading large files unless the users explicitly request it. We don't mind fetching small files and web requests as part of the test suit in cases where that is required to test the functionality of network components. I don't believe these things are incompatible.

  4. I don't care where SRA is are cached if a user is reading SRA as part of running a tool. Wherever the standard location for the SRA cache is is the ideal place since it conforms to users expectations. However, running the test suite should ideally not leave any trace on the users machine. I think the distinction between these is important. A user asking to read an SRA file is expecting to cache large files on the hard drive, a user running the tests is not. I was curious about how to configure the standard caching location, because on our systems the home directory is very limited in size and is not a viable place for caching files.

5/6) It's totally fine to set up a temporary cache are for the tests. It's also fine to download necessary files from the internet in order to test that functionality. We have mechanisms for creating temp files / directory for just this sort of case. It's problematic if the files being downloaded are large though. If we're having to stream 40g files from somewhere then we should change the tests so they use artificially small data files. It sounded like the actual size is much smaller though, and the giant cache files are a result of a deficiency in the file handling code?

@kwrodarmer
Copy link

@lbergelson:

I think we're in agreement with having adequate tests. What about the possibility of dividing the tests into +SRA and -SRA, where the former would imply the user's agreement that they want to use SRA facilities, probably including network access, downloads of native code, establishing cache areas, etc., and the latter would exclude SRA from the testing? Or perhaps divide them into network-enabled and network-disabled?

One of the things that you seem to be struggling with is that the NGS Java code is nothing more than a language binding. Most of the real work happens in C within the shared library. To enable SRA without having previously installed this library, without network access, and/or without the ability to download and refresh SRA code might not make much sense. Incidentally, we are going to try to provide better stack traces from the C code across the Java boundary, because otherwise it becomes difficult to look into what really happens.

So taking your points in order:

  1. We can create network-on and network-off tests. The network-on will fail if we cannot locate an adequate shared library with native code, since the Java code cannot do anything useful without it. We agree that involving the network for tests can give false positives, so the network-off tests will work on small test files that are committed to the source repository.
  2. All accessions used for the network-on tests will be limited to the smallest size possible. This will limit the number of accessions that can be used, of course, but there you have it. There is also another possibility, which is to disable caching for network tests on Macintosh, because they are the only platform that does not support sparse files. Our tests generally access only a slice of data, but until we fix the problem with HFS and sparse files, any access creates an image with the entire size of the original file. This is the problem you had, but it does not occur with Linux or Windows.
  3. Downloading native code is fun and useful, since the code that actually accesses SRA is in there. Therefore, anyone wanting to test SRA at all should go through the exercise of installing VDB and the SRA Toolkit beforehand, running vdb-config to set up where they want files cached, network access, proxies, etc. Then, our Java code will still have to be allowed to search in standard locations for this shared library. The loading process will attempt to load the library in a separate process (because the JVM cannot unload libraries once loaded) in order to verify that the library is in fact recent enough to support the ABI. This is where the Error: Could not find or load main class was coming from.
  4. The user will care whether caching is turned on/off, and where files are stored. This is normally configured external to the tool using NGS, but we intend to extend the API to allow tools to manage configuration directly. Again, the large file appearing on a drive comes from HFS, since normal file systems support sparse files. We will address this in a future shared library. We don't actually consume very much real space unless an entire run is accessed.
  5. We will establish a separate cache area for the network-on tests (or simply disable caching entirely). The cache is not required for operation - it is just useful for cases where there is repetitive and often random access. But we can run entirely over network if necessary.
  6. Sorry - should've put this in the last tome - we only access a minimal amount of data from NCBI, in 128K chunks which are stored in sparse files. These are the files named SRA123456.sra.cache. They have a bitmap stored at the end. For file systems that support sparse files, this takes up only the sectors required to represent 128K plus a bitmap to cover the size of the whole file. So the actual files are usually much smaller - the giant cache file is a result of a deficiency in HFS. We will find a workaround.

@cmnbroad
Copy link
Collaborator Author

I''m able to run these tests now - I think this was fixed in #638 and can be closed ?

@cmnbroad
Copy link
Collaborator Author

@a-nikitiuk Is there anything else to be done here ?

@a-nikitiuk
Copy link
Contributor

Nope, it can be closed

@cmnbroad
Copy link
Collaborator Author

Closed via #638.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants