Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 21 additions & 151 deletions Documentation/Data.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,13 @@ ITK Data
========

This page documents how to add test data while developing ITK. See our
[CONTRIBUTING](../CONTRIBUTING.md) and
[UploadBinaryData](UploadBinaryData.md) guides for more information.
[CONTRIBUTING](../CONTRIBUTING.md) and [UploadBinaryData] guides for more
information.

While these instructions assume that the required data will be contained in
binary files, the procedure (except that related to the content link file
generation) also applies to any other data contained in a text file that a
test may require, if any.

Setup
-----
Expand Down Expand Up @@ -50,36 +55,19 @@ to the test directory:
* Files in `Testing/Data` may be referenced as
`DATA{${ITK_DATA_ROOT}/Input/MyInput.png}`.
* If the data file references other data files, e.g. `.mhd -> .raw`, follow the
link to the ExternalData module on the right and read the documentation on
link to the `ExternalData` module on the right and read the documentation on
"associated" files.
* Multiple baseline images and other series are handled automatically when the
reference ends in the ",:" option; follow the link to the `ExternalData`
module on the right for details.

### Run CMake

CMake will move the original file. Keep your own copy if necessary.

Run cmake on the build tree:

```sh
$ cd ../ITK-build
$ cmake .
```
(*Or just run `make` to do a full configuration and build.*)

```sh
$ cd ../ITK
```
### Create content link

During configuration CMake will display a message such as:
For the reasons stated in the [Discussion](#discussion) section, rather than
the binary files themselves, ITK and related projects use content link files
associated with these files.

```sh
Linked Modules/.../test/Baseline/MyTest.png.md5 to ExternalData MD5/...
```

This means that CMake converted the file into a data object referenced by a
"content link".
To generate the content link file, use the procedure in [Upload Binary Data].


### Commit
Expand All @@ -88,47 +76,11 @@ Continue to create the topic and edit other files as necessary. Add the content
link and commit it along with the other changes:

```sh
$ git add Modules/.../test/Baseline/MyTest.png.md5
$ git add Modules/.../test/Baseline/MyTest.png.sha512
$ git add Modules/.../test/CMakeLists.txt
$ git commit
```

The local `pre-commit` hook will display a message such as:

```sh
Modules/.../test/Baseline/MyTest.png.md5: Added content to Git at refs/data/MD5/...
Modules/.../test/Baseline/MyTest.png.md5: Added content to local store at .ExternalData/MD5/...
Content link Modules/.../test/Baseline/MyTest.png.md5 -> .ExternalData/MD5/...
```

This means that the pre-commit hook recognized that the content link references
a new data object and [prepared it for upload](#pre-commit).

### Push

Follow the instructions to share the topic. When you push it to Gerrit for review using

```sh
$ git gerrit-push
```

Part of the output will be of the form

```sh
* ...:refs/data/commits/... [new branch]
* HEAD:refs/for/master/my-topic [new branch]
Pushed refs/data and removed local copy:
MD5/...
```

This means that the `git gerrit-push` script pushed the topic and
[uploaded the data](#git-gerrit-push) it references.

Options for `gerrit-push`:

* `--dry-run`: Report push that would occur without actually doing it
* `--no-topic`: Push the data referenced by the topic but not the topic itself

Building
--------

Expand All @@ -140,9 +92,9 @@ directly, e.g. `make ITKData`, to obtain the data without a complete build.
The output will be something like

```sh
-- Fetching ".../ExternalData/MD5/..."
-- Fetching ".../ExternalData/SHA512/..."
-- [download 100% complete]
-- Downloaded object: "ITK-build/ExternalData/Objects/MD5/..."
-- Downloaded object: "ITK-build/ExternalData/Objects/SHA512/..."
```

The downloaded files appear in `ITK-build/ExternalData` by default.
Expand All @@ -158,7 +110,7 @@ build trees, e.g. "`/home/user/.ExternalData`":
$ cmake -DExternalData_OBJECT_STORES=/home/user/.ExternalData ../ITK
```

The ExternalData module will store downloaded objects in the local store
The `ExternalData` module will store downloaded objects in the local store
instead of the build tree. Once an object has been downloaded by one build it
will persist in the local store for re-use by other builds without downloading
again.
Expand All @@ -173,92 +125,10 @@ data object by a hash of its content. At build time the the
module fetches data needed by enabled tests. This allows arbitrarily large data
to be added and removed without bloating the version control history.

The above [#workflow] allows developers to add a new data file almost as if
committing it to the source tree. The following subsections discuss details of
the workflow implementation.

### ExternalData

While [#run-cmake] runs the `ExternalData` module evaluates `DATA{}`
references. ITK
[sets](https://github.com/InsightSoftwareConsortium/ITK/blob/master/CMake/ExternalData.cmake)
the `ExternalData_LINK_CONTENT` option to `MD5` to enable
automatic conversion of raw data files into content links. When the module
detects a real data file in the source tree it performs the following
transformation as specified in the module documentation:

* Compute the MD5 hash of the file
* Store the `${hash}` in a file with the original name plus `.md5`
* Rename the original file to `.ExternalData_MD5_${hash}`

The real data now sit in a file that we
[tell Git to ignore](https://github.com/InsightSoftwareConsortium/ITK/blob/master/.gitignore).
For example:

```sh
$ cat Modules/.../test/Baseline/.ExternalData_MD5_477e602800c18624d9bc7a32fa706b97 |md5sum
477e602800c18624d9bc7a32fa706b97 -
$ cat Modules/.../test/Baseline/MyTest.png.md5
477e602800c18624d9bc7a32fa706b97
```

#### Recover Data File

To recover the original file after running CMake but before committing, undo
the operation:

```sh
$ cd Modules/.../test/Baseline
$ mv .ExternalData_MD5_$(cat MyTest.png.md5) MyTest.png
```

### pre-commit

While [committing](#commit) a new or modified content link the
[`pre-commit`](https://github.com/InsightSoftwareConsortium/ITK/blob/master/Utilities/Hooks/pre-commit)
hook moves the real data object from the `.ExternalData_MD5_${hash}` file left
by the `ExternalData` module to a local object repository stored in a
`.ExternalData` directory at the top of the source tree.

The hook also uses Git plumbing commands to store the data object as a blob in
the local Git repository. The blob is not referenced by the new commit but
instead by `refs/data/MD5/${hash}`. This keeps the blob alive in the local
repository but does not add it to the project history. For example:

```sh
$ git for-each-ref --format="%(refname)" refs/data
refs/data/MD5/477e602800c18624d9bc7a32fa706b97
$ git cat-file blob refs/data/MD5/477e602800c18624d9bc7a32fa706b97 | md5sum
477e602800c18624d9bc7a32fa706b97 -
```

### git gerrit-push

The `git gerrit-push` command is actually an alias for the
`Utilities/Git/git-gerrit-push` script. In addition to pushing the topic branch
to [Gerrit] the script also detects content links added or modified by the
commits in the topic. It reads the data object hashes from the content links
and looks for matching `refs/data/` entries in the local Git repository.

The script pushes the matching data objects to Gerrit inside a temporary commit object disjoint from the rest of history. For example:

```sh
$ git gerrit-push --dry-run --no-topic
* f59717cfb68a7093010d18b84e8a9a90b6b42c11:refs/data/commits/f59717cfb68a7093010d18b84e8a9a90b6b42c11 [new branch]
Pushed refs/data and removed local copy:
MD5/477e602800c18624d9bc7a32fa706b97
$ git ls-tree -r --name-only f59717cf
MD5/477e602800c18624d9bc7a32fa706b97
$ git log --oneline f59717cf
f59717c data
```

A robot runs every few minutes to fetch the objects from individual
[data.kitware.com] accounts and uploads them to the ITK collection on
[data.kitware.com]. We tell `ExternalData`to search this locations and other
redundant data stores at build time.

For more information, see
[CMake ExternalData: Using Large Files with Distributed Version Control](https://blog.kitware.com/cmake-externaldata-using-large-files-with-distributed-version-control/).


[data.kitware.com]: https://data.kitware.com/
[Gerrit]: http://review.source.kitware.com/p/ITK
[Github]: https://github.com/InsightSoftwareConsortium/ITK
[UploadBinaryData]: UploadBinaryData.md
45 changes: 38 additions & 7 deletions Documentation/UploadBinaryData.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ download the files at build time with [CMake].

A "content link" file contains an identifying [SHA512 hash]. The content
link is stored in the [Git] repository at the path where the file would exist,
but with a `.sha512` extension appended to the file name. CMake will find these
content link files at **build** time, download them from a list of server
but with a `.sha512` extension appended to the file name. [CMake] will find
these content link files at **build** time, download them from a list of server
resources, and create symlinks or copies of the original files at the
corresponding location in the **build tree**.

Expand Down Expand Up @@ -52,9 +52,10 @@ Using Girder to get the SHA512 hash
### Prerequisites

The [data.kitware.com] server is an ITK community resource where any community
member can upload binary data files. There are two methods available to upload
data files:
member can upload binary data files. There are three methods available to
upload data files:

1. The [UploadBinaryData.sh] shell script.
1. The [Girder web interface].
2. The `girder-cli` command line executable that comes with the
[girder-client] Python package.
Expand All @@ -63,9 +64,37 @@ Before uploading data, please visit [data.kitware.com] and register for an
account.

Once files have been uploaded to your account, they will be publicly available
and accessible since data is content addressed. At release time, the release
manager will upload and archive repository data references in the
[ITK collection] and other redundant storage locations.
and accessible since data is content addressed. Specifically, the
[hashsum_download] plugin in Girder looks through all public (or private if
authenticated) data for files with the given hash. Thus, so as long as the file
is publically available somewhere on [data.kitware.com], ITK will be able to
retrieve the corresponding file.

At release time, the release manager will upload and archive repository data
references in the [ITK collection] and other redundant storage locations.


### Upload via the shell script

The script will authenticate to [data.kitware.com], upload the file to your
user account's *Public* folder, and create a `*.sha512` [CMake] `ExternalData`
content link file. After the content link has been created, you will need to
add the `*.sha512` file to your commit.

To use the script:

1. Sign up for an account at [data.kitware.com].
2. Place the binary file at the desired location in the Git repository.
3. Run this script, and pass in the binary file(s) as arguments to script.
4. In the corresponding `test/CMakeLists.txt` file, use the
`itk_add_test macro` and reference the file path with \`DATA\` and braces,
e.g.: DATA{<Relative/Path/To/Source/Tree/File>}.
5. Re-build ITK, and the testing data will be downloaded into the build tree.


If a `GIRDER_API_KEY` environmental variable is not set, a prompt will appear
for your username and password. The API key can be created from the
[data.kitware.com] user account web browser interace.


### Upload via the web interface
Expand Down Expand Up @@ -135,11 +164,13 @@ actual file is desired in the build tree. Stage the new file to your commit:
[girder-client]: https://girder.readthedocs.io/en/latest/python-client.html#the-command-line-interface
[Girder web interface]: https://girder.readthedocs.io/en/latest/user-guide.html
[Git]: https://git-scm.com/
[hashsum_download]: https://girder.readthedocs.io/en/latest/plugins.html#hashsum-download
[ITK collection]: https://data.kitware.com/#collection/57b5c9e58d777f126827f5a1
[ITK community]: https://discourse.itk.org/
[ITK Examples]: https://itk.org/ITKExamples/index.html
[ITK Software Guide]: https://itk.org/ItkSoftwareGuide.pdf
[solution to this problem]: https://blog.kitware.com/cmake-externaldata-using-large-files-with-distributed-version-control/
[UploadTestingData.sh]: ../Utilities/UploadTestingData.sh

[Analyze format]: http://www.grahamwideman.com/gw/brain/analyze/formatdoc.htm
[MD5 hash]: https://en.wikipedia.org/wiki/MD5
Expand Down