diff --git a/Documentation/Data.md b/Documentation/Data.md index 2027cfc9418..681ad9fa95a 100644 --- a/Documentation/Data.md +++ b/Documentation/Data.md @@ -2,8 +2,13 @@ ITK Data ======== This page documents how to add test data while developing ITK. See our -[CONTRIBUTING](../CONTRIBUTING.md) and -[UploadBinaryData](UploadBinaryData.md) guides for more information. +[CONTRIBUTING](../CONTRIBUTING.md) and [UploadBinaryData] guides for more +information. + +While these instructions assume that the required data will be contained in +binary files, the procedure (except that related to the content link file +generation) also applies to any other data contained in a text file that a +test may require, if any. Setup ----- @@ -50,36 +55,19 @@ to the test directory: * Files in `Testing/Data` may be referenced as `DATA{${ITK_DATA_ROOT}/Input/MyInput.png}`. * If the data file references other data files, e.g. `.mhd -> .raw`, follow the - link to the ExternalData module on the right and read the documentation on + link to the `ExternalData` module on the right and read the documentation on "associated" files. * Multiple baseline images and other series are handled automatically when the reference ends in the ",:" option; follow the link to the `ExternalData` module on the right for details. -### Run CMake - -CMake will move the original file. Keep your own copy if necessary. - -Run cmake on the build tree: - -```sh - $ cd ../ITK-build - $ cmake . -``` -(*Or just run `make` to do a full configuration and build.*) - -```sh - $ cd ../ITK -``` +### Create content link -During configuration CMake will display a message such as: +For the reasons stated in the [Discussion](#discussion) section, rather than +the binary files themselves, ITK and related projects use content link files +associated with these files. -```sh - Linked Modules/.../test/Baseline/MyTest.png.md5 to ExternalData MD5/... -``` - -This means that CMake converted the file into a data object referenced by a -"content link". +To generate the content link file, use the procedure in [Upload Binary Data]. ### Commit @@ -88,47 +76,11 @@ Continue to create the topic and edit other files as necessary. Add the content link and commit it along with the other changes: ```sh - $ git add Modules/.../test/Baseline/MyTest.png.md5 + $ git add Modules/.../test/Baseline/MyTest.png.sha512 $ git add Modules/.../test/CMakeLists.txt $ git commit ``` -The local `pre-commit` hook will display a message such as: - -```sh - Modules/.../test/Baseline/MyTest.png.md5: Added content to Git at refs/data/MD5/... - Modules/.../test/Baseline/MyTest.png.md5: Added content to local store at .ExternalData/MD5/... - Content link Modules/.../test/Baseline/MyTest.png.md5 -> .ExternalData/MD5/... -``` - -This means that the pre-commit hook recognized that the content link references -a new data object and [prepared it for upload](#pre-commit). - -### Push - -Follow the instructions to share the topic. When you push it to Gerrit for review using - -```sh - $ git gerrit-push -``` - -Part of the output will be of the form - -```sh - * ...:refs/data/commits/... [new branch] - * HEAD:refs/for/master/my-topic [new branch] - Pushed refs/data and removed local copy: - MD5/... -``` - -This means that the `git gerrit-push` script pushed the topic and -[uploaded the data](#git-gerrit-push) it references. - -Options for `gerrit-push`: - - * `--dry-run`: Report push that would occur without actually doing it - * `--no-topic`: Push the data referenced by the topic but not the topic itself - Building -------- @@ -140,9 +92,9 @@ directly, e.g. `make ITKData`, to obtain the data without a complete build. The output will be something like ```sh - -- Fetching ".../ExternalData/MD5/..." + -- Fetching ".../ExternalData/SHA512/..." -- [download 100% complete] - -- Downloaded object: "ITK-build/ExternalData/Objects/MD5/..." + -- Downloaded object: "ITK-build/ExternalData/Objects/SHA512/..." ``` The downloaded files appear in `ITK-build/ExternalData` by default. @@ -158,7 +110,7 @@ build trees, e.g. "`/home/user/.ExternalData`": $ cmake -DExternalData_OBJECT_STORES=/home/user/.ExternalData ../ITK ``` -The ExternalData module will store downloaded objects in the local store +The `ExternalData` module will store downloaded objects in the local store instead of the build tree. Once an object has been downloaded by one build it will persist in the local store for re-use by other builds without downloading again. @@ -173,92 +125,10 @@ data object by a hash of its content. At build time the the module fetches data needed by enabled tests. This allows arbitrarily large data to be added and removed without bloating the version control history. -The above [#workflow] allows developers to add a new data file almost as if -committing it to the source tree. The following subsections discuss details of -the workflow implementation. - -### ExternalData - -While [#run-cmake] runs the `ExternalData` module evaluates `DATA{}` -references. ITK -[sets](https://github.com/InsightSoftwareConsortium/ITK/blob/master/CMake/ExternalData.cmake) -the `ExternalData_LINK_CONTENT` option to `MD5` to enable -automatic conversion of raw data files into content links. When the module -detects a real data file in the source tree it performs the following -transformation as specified in the module documentation: - - * Compute the MD5 hash of the file - * Store the `${hash}` in a file with the original name plus `.md5` - * Rename the original file to `.ExternalData_MD5_${hash}` - -The real data now sit in a file that we -[tell Git to ignore](https://github.com/InsightSoftwareConsortium/ITK/blob/master/.gitignore). -For example: - -```sh - $ cat Modules/.../test/Baseline/.ExternalData_MD5_477e602800c18624d9bc7a32fa706b97 |md5sum - 477e602800c18624d9bc7a32fa706b97 - - $ cat Modules/.../test/Baseline/MyTest.png.md5 - 477e602800c18624d9bc7a32fa706b97 -``` - -#### Recover Data File - -To recover the original file after running CMake but before committing, undo -the operation: - -```sh - $ cd Modules/.../test/Baseline - $ mv .ExternalData_MD5_$(cat MyTest.png.md5) MyTest.png -``` - -### pre-commit - -While [committing](#commit) a new or modified content link the -[`pre-commit`](https://github.com/InsightSoftwareConsortium/ITK/blob/master/Utilities/Hooks/pre-commit) -hook moves the real data object from the `.ExternalData_MD5_${hash}` file left -by the `ExternalData` module to a local object repository stored in a -`.ExternalData` directory at the top of the source tree. - -The hook also uses Git plumbing commands to store the data object as a blob in -the local Git repository. The blob is not referenced by the new commit but -instead by `refs/data/MD5/${hash}`. This keeps the blob alive in the local -repository but does not add it to the project history. For example: - -```sh - $ git for-each-ref --format="%(refname)" refs/data - refs/data/MD5/477e602800c18624d9bc7a32fa706b97 - $ git cat-file blob refs/data/MD5/477e602800c18624d9bc7a32fa706b97 | md5sum - 477e602800c18624d9bc7a32fa706b97 - -``` - -### git gerrit-push - -The `git gerrit-push` command is actually an alias for the -`Utilities/Git/git-gerrit-push` script. In addition to pushing the topic branch -to [Gerrit] the script also detects content links added or modified by the -commits in the topic. It reads the data object hashes from the content links -and looks for matching `refs/data/` entries in the local Git repository. - -The script pushes the matching data objects to Gerrit inside a temporary commit object disjoint from the rest of history. For example: - -```sh - $ git gerrit-push --dry-run --no-topic - * f59717cfb68a7093010d18b84e8a9a90b6b42c11:refs/data/commits/f59717cfb68a7093010d18b84e8a9a90b6b42c11 [new branch] - Pushed refs/data and removed local copy: - MD5/477e602800c18624d9bc7a32fa706b97 - $ git ls-tree -r --name-only f59717cf - MD5/477e602800c18624d9bc7a32fa706b97 - $ git log --oneline f59717cf - f59717c data -``` - -A robot runs every few minutes to fetch the objects from individual -[data.kitware.com] accounts and uploads them to the ITK collection on -[data.kitware.com]. We tell `ExternalData`to search this locations and other -redundant data stores at build time. - +For more information, see +[CMake ExternalData: Using Large Files with Distributed Version Control](https://blog.kitware.com/cmake-externaldata-using-large-files-with-distributed-version-control/). [data.kitware.com]: https://data.kitware.com/ -[Gerrit]: http://review.source.kitware.com/p/ITK +[Github]: https://github.com/InsightSoftwareConsortium/ITK +[UploadBinaryData]: UploadBinaryData.md diff --git a/Documentation/UploadBinaryData.md b/Documentation/UploadBinaryData.md index 83639dd61b1..3439bc7b2bc 100644 --- a/Documentation/UploadBinaryData.md +++ b/Documentation/UploadBinaryData.md @@ -13,8 +13,8 @@ download the files at build time with [CMake]. A "content link" file contains an identifying [SHA512 hash]. The content link is stored in the [Git] repository at the path where the file would exist, -but with a `.sha512` extension appended to the file name. CMake will find these -content link files at **build** time, download them from a list of server +but with a `.sha512` extension appended to the file name. [CMake] will find +these content link files at **build** time, download them from a list of server resources, and create symlinks or copies of the original files at the corresponding location in the **build tree**. @@ -52,9 +52,10 @@ Using Girder to get the SHA512 hash ### Prerequisites The [data.kitware.com] server is an ITK community resource where any community -member can upload binary data files. There are two methods available to upload -data files: +member can upload binary data files. There are three methods available to +upload data files: + 1. The [UploadBinaryData.sh] shell script. 1. The [Girder web interface]. 2. The `girder-cli` command line executable that comes with the [girder-client] Python package. @@ -63,9 +64,37 @@ Before uploading data, please visit [data.kitware.com] and register for an account. Once files have been uploaded to your account, they will be publicly available -and accessible since data is content addressed. At release time, the release -manager will upload and archive repository data references in the -[ITK collection] and other redundant storage locations. +and accessible since data is content addressed. Specifically, the +[hashsum_download] plugin in Girder looks through all public (or private if +authenticated) data for files with the given hash. Thus, so as long as the file +is publically available somewhere on [data.kitware.com], ITK will be able to +retrieve the corresponding file. + +At release time, the release manager will upload and archive repository data +references in the [ITK collection] and other redundant storage locations. + + +### Upload via the shell script + +The script will authenticate to [data.kitware.com], upload the file to your +user account's *Public* folder, and create a `*.sha512` [CMake] `ExternalData` +content link file. After the content link has been created, you will need to +add the `*.sha512` file to your commit. + +To use the script: + + 1. Sign up for an account at [data.kitware.com]. + 2. Place the binary file at the desired location in the Git repository. + 3. Run this script, and pass in the binary file(s) as arguments to script. + 4. In the corresponding `test/CMakeLists.txt` file, use the + `itk_add_test macro` and reference the file path with \`DATA\` and braces, + e.g.: DATA{}. + 5. Re-build ITK, and the testing data will be downloaded into the build tree. + + +If a `GIRDER_API_KEY` environmental variable is not set, a prompt will appear +for your username and password. The API key can be created from the +[data.kitware.com] user account web browser interace. ### Upload via the web interface @@ -135,11 +164,13 @@ actual file is desired in the build tree. Stage the new file to your commit: [girder-client]: https://girder.readthedocs.io/en/latest/python-client.html#the-command-line-interface [Girder web interface]: https://girder.readthedocs.io/en/latest/user-guide.html [Git]: https://git-scm.com/ +[hashsum_download]: https://girder.readthedocs.io/en/latest/plugins.html#hashsum-download [ITK collection]: https://data.kitware.com/#collection/57b5c9e58d777f126827f5a1 [ITK community]: https://discourse.itk.org/ [ITK Examples]: https://itk.org/ITKExamples/index.html [ITK Software Guide]: https://itk.org/ItkSoftwareGuide.pdf [solution to this problem]: https://blog.kitware.com/cmake-externaldata-using-large-files-with-distributed-version-control/ +[UploadTestingData.sh]: ../Utilities/UploadTestingData.sh [Analyze format]: http://www.grahamwideman.com/gw/brain/analyze/formatdoc.htm [MD5 hash]: https://en.wikipedia.org/wiki/MD5