Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc: optional cache file encryption #1317

Closed
efiop opened this issue Nov 12, 2018 · 17 comments
Closed

dvc: optional cache file encryption #1317

efiop opened this issue Nov 12, 2018 · 17 comments
Labels
enhancement Enhances DVC feature request Requesting a new feature p2-medium Medium priority, should be done, but less important

Comments

@efiop
Copy link
Contributor

efiop commented Nov 12, 2018

E.g. we would push encrypted files to remote for improved security.

@efiop efiop added enhancement Enhances DVC feature request Requesting a new feature labels Nov 12, 2018
@ghost
Copy link

ghost commented Nov 12, 2018

From the Discord channel:

The basic idea would be to be able to store data and models on an untrusted cloud provider 😇 while stile being able to share with others or work on it locally by using a shared decryption key among members

@aurelien-clu
Copy link

Description

dvc enables us to share data and models on a remote server / data storage service.
Such service may not always be trusted for our (or our clients) data and models.

This would be as follows:

  • pushing encrypts automatically data
  • fetching decrypts data (at the cost of using twice the space temporarily at best or all the time at worst)
  • local data is always un-encrypted (so it is usable)

The encryption key would have to be shared among project members.

Configuration example

dvc remote add [...] --encryption-method AES-256 --encryption-key XYZ

@aurelien-clu
Copy link

aurelien-clu commented Nov 12, 2018

Example of dvc file ( source ):

cmd: python cmd.py input.data output.data metrics.json
    deps:
    - md5: da2259ee7c12ace6db43644aef2b754c
      path: cmd.py
      md5_encrypted: [...]
    - md5: e309de87b02312e746ec5a500844ce77
      path: input.data
      md5_encrypted: [...]
    md5: 521ac615cfc7323604059d81d052ce00 # <= this hash is about this file?
    outs:
    - cache: true
      md5: 70f3c9157e3b92a6d2c93eb51439f822
      md5_encrypted: [...]
      metric: false
      path: output.data
    - cache: false
      md5: d7a82c3cdfd45c4ace13484a931fc526
      md5_encrypted: [...]
      metric:
        type: json
        xpath: AUC
      path: metrics.json
   locked: True

@efiop
Copy link
Contributor Author

efiop commented Nov 12, 2018

@aurelien-clu

That md5 is a checksum for that dvc file. See https://dvc.org/doc/user-guide/dvc-file-format .

EDIT: sorry, didn't notice your link right away.

@efiop
Copy link
Contributor Author

efiop commented Nov 12, 2018

Also, remote configuration is done with dvc remote modify command, so something like dvc remote modify myremote encryption_method rsa.

@aurelien-clu
Copy link

aurelien-clu commented Nov 13, 2018

Alright thanks.
(sorry did not have much time to look further in the documentation)

I am looking into how to handle file encryption and decryption in python and then where to hook this into dvc code.

I feel like dvc.remote.base is a good candidate but I am not certain.
And there are other parts to update as well:

  • the cli,
  • the handling of .dvc file

Imho.

EDIT:
File encryption and decryption using AES (still updated though not many stars)

@ghost ghost assigned ghost and unassigned ghost Nov 20, 2018
@swanandgore
Copy link

FWIW, it will be a very useful feature (at least for my workplace!) for dvc to first encrypt and then store into cloud storage. It will allow more layers of defence and control while storing sensitive artefacts. Google cloud's KMS would be easy to add into RemoteGS class (remote/gs.py) _upload, _download methods for this. This can sit as another remote or additional config on GCS remote.

@shcheklein
Copy link
Member

@swanandgore hi! make total sense to me, and probably we will make the priority higher for this. Could you though elaborate a bit why cloud side encryption is not enough? Something like: sse option for S3 remote here - https://dvc.org/doc/commands-reference/remote/modify

@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p4-not-important labels Sep 2, 2019
@dr-duplo
Copy link

dr-duplo commented Feb 4, 2020

Why not using existing encryption solutions like git-crypt or the like for transparent encryption?
The tools already provide an AES key and a PGP user "management" which is very handy.
This would also reduce the implementation complexity a lot.

Especially in our use case it would be very convenient. We encrypt some no-dvc-files with git-crypt and also want to encrypt those managed by dvc so that only encrypted data is stored in the storage (and cache).

@efiop
Copy link
Contributor Author

efiop commented Feb 5, 2020

@dr-duplo Great idea! Maybe you could elaborate on how the git-crypt-based implementation could look like?

@dr-duplo
Copy link

dr-duplo commented Feb 8, 2020

Yes. Sure. I will try, but I'm not deeply familiar with the inner workings of dvc, yet:

I assume for now git-crypt is used. With it you can generate multiple AES keys used to encrypt files. Besides the default key you can also create a named one. After having a dvc specific key created you can permit a GPG user to use the key. Other users will not be able to get the key. It is individual encrypted with the users GPG public key. Only already permitted users can allow other users to use the key. The per-user encrypted key get's added to the repo.

The reveal the key for a user "git-crypt unlock" has to be used. It can now be found in the ".git" directory. DVC can use the key to encrypt/decrypt the data to/from remote storage or the cache.
This should be completely transparent to the user.

DVC could generate individual keys for every remote or on the repo level.
The encryption as git-crypt uses it, produces stable cipher-texts, which means if the content of the file
doesn't change it's encrypted version is also fixed (read more in docu of git-crypt). Question here is if one could reuse it's encryption/decryption routines.

Possible DVC command extensions:

  • dvc remote add ...
    • -e, --encrypt [other remote | keyfile] encrypt remote storage
    • initializes key or reuses existing key
    • key is named by remote name per default
    • used key name is stored in the remotes config
  • dvc remote modify ...
    • -p, --enc-permit-user GPG_USER_ID
    • adds a gpg user to the key of the remote

This is a rough draft w/o checking anything. A POC would be nice.

@jackwellsxyz
Copy link
Contributor

Want to circle back on this issue - SSE-KMS for AWS S3 back end is still important, as many shops require this for policy reasons.

@shcheklein
Copy link
Member

@jackwellsxyz does dvc remote modify myremote sse <AES256, aws:kms, etc> work for you?

please check it here https://dvc.org/doc/command-reference/remote/modify in the list of AWS S3 specific options.

@jackwellsxyz
Copy link
Contributor

jackwellsxyz commented May 7, 2020

Thanks @shcheklein, I've been diving into the code a bit. That is half of the solution: I need to set "sse = aws:kms" and also pass the 'SSEKMSKeyId'= variable to self.s3.upload_file as an ExtraArgs. Since there are quite a few extra arguments you can pass to the boto S3 upload_file() function, I wonder if it makes sense to enable extra keys in the S3 config schema so you can pass some of these arguments directly.

I'd be happy to do a PR if that would help - in either case, it seems like a relatively straightforward fix (don't they all...)

@shcheklein
Copy link
Member

@jackwellsxyz it looks like we already down the path of adding all the options to the schema (I hope it's a finite amount after all). It's definitely great to have a PR for that. I would start with git grep <name_of_other_option> to see where it should be added. Thanks 🙏

@efiop
Copy link
Contributor Author

efiop commented Oct 8, 2021

Closing as stale. We have remote-spefic configuration for serverside encryption (e.g. sse for s3). DVC itself is not a security tool, and encrypting files ourselves is out of scope for now. Users can embed ecrypting stages into their pipeline on theirown though.

@efiop efiop closed this as completed Oct 8, 2021
@0x2b3bfa0
Copy link
Member

Users can embed encrypting stages into their pipeline on their own though.

Sounds good! Leaving age as a message–in–a–bottle recommendation for future readers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC feature request Requesting a new feature p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

7 participants