-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc: optional cache file encryption #1317
Comments
From the Discord channel:
|
Description
This would be as follows:
The encryption key would have to be shared among project members. Configuration exampledvc remote add [...] --encryption-method AES-256 --encryption-key XYZ |
Example of cmd: python cmd.py input.data output.data metrics.json
deps:
- md5: da2259ee7c12ace6db43644aef2b754c
path: cmd.py
md5_encrypted: [...]
- md5: e309de87b02312e746ec5a500844ce77
path: input.data
md5_encrypted: [...]
md5: 521ac615cfc7323604059d81d052ce00 # <= this hash is about this file?
outs:
- cache: true
md5: 70f3c9157e3b92a6d2c93eb51439f822
md5_encrypted: [...]
metric: false
path: output.data
- cache: false
md5: d7a82c3cdfd45c4ace13484a931fc526
md5_encrypted: [...]
metric:
type: json
xpath: AUC
path: metrics.json
locked: True |
That md5 is a checksum for that dvc file. See https://dvc.org/doc/user-guide/dvc-file-format . EDIT: sorry, didn't notice your link right away. |
Also, remote configuration is done with |
Alright thanks. I am looking into how to handle file encryption and decryption in python and then where to hook this into I feel like
Imho. EDIT: |
FWIW, it will be a very useful feature (at least for my workplace!) for dvc to first encrypt and then store into cloud storage. It will allow more layers of defence and control while storing sensitive artefacts. Google cloud's KMS would be easy to add into |
@swanandgore hi! make total sense to me, and probably we will make the priority higher for this. Could you though elaborate a bit why cloud side encryption is not enough? Something like: |
Why not using existing encryption solutions like git-crypt or the like for transparent encryption? Especially in our use case it would be very convenient. We encrypt some no-dvc-files with git-crypt and also want to encrypt those managed by dvc so that only encrypted data is stored in the storage (and cache). |
@dr-duplo Great idea! Maybe you could elaborate on how the git-crypt-based implementation could look like? |
Yes. Sure. I will try, but I'm not deeply familiar with the inner workings of dvc, yet: I assume for now git-crypt is used. With it you can generate multiple AES keys used to encrypt files. Besides the default key you can also create a named one. After having a dvc specific key created you can permit a GPG user to use the key. Other users will not be able to get the key. It is individual encrypted with the users GPG public key. Only already permitted users can allow other users to use the key. The per-user encrypted key get's added to the repo. The reveal the key for a user "git-crypt unlock" has to be used. It can now be found in the ".git" directory. DVC can use the key to encrypt/decrypt the data to/from remote storage or the cache. DVC could generate individual keys for every remote or on the repo level. Possible DVC command extensions:
This is a rough draft w/o checking anything. A POC would be nice. |
Want to circle back on this issue - SSE-KMS for AWS S3 back end is still important, as many shops require this for policy reasons. |
@jackwellsxyz does please check it here https://dvc.org/doc/command-reference/remote/modify in the list of AWS S3 specific options. |
Thanks @shcheklein, I've been diving into the code a bit. That is half of the solution: I need to set "sse = aws:kms" and also pass the 'SSEKMSKeyId'= variable to self.s3.upload_file as an ExtraArgs. Since there are quite a few extra arguments you can pass to the boto S3 upload_file() function, I wonder if it makes sense to enable extra keys in the S3 config schema so you can pass some of these arguments directly. I'd be happy to do a PR if that would help - in either case, it seems like a relatively straightforward fix (don't they all...) |
@jackwellsxyz it looks like we already down the path of adding all the options to the schema (I hope it's a finite amount after all). It's definitely great to have a PR for that. I would start with |
Closing as stale. We have remote-spefic configuration for serverside encryption (e.g. sse for s3). DVC itself is not a security tool, and encrypting files ourselves is out of scope for now. Users can embed ecrypting stages into their pipeline on theirown though. |
Sounds good! Leaving age as a message–in–a–bottle recommendation for future readers. |
E.g. we would push encrypted files to remote for improved security.
The text was updated successfully, but these errors were encountered: