-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Alterative hashings for data (#1676) #1920
Conversation
@vyloy Please rebase. |
@vyloy Your tests are failing. |
dvc/utils/__init__.py
Outdated
filtered = dict_filter(d, exclude) | ||
byts = json.dumps(filtered, sort_keys=True).encode("utf-8") | ||
return bytes_md5(byts) | ||
hasher = CHECKSUM_MAP[checksum_type]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not return bytes_checksum(byts, checksum_type)
then?
dvc/stage.py
Outdated
self.locked = locked | ||
self.tag = tag | ||
self._state = state or {} | ||
|
||
if repo and repo.config and repo.config.config: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code matches the one in remote local. How about we just use repo.cache.local.hash
instead, since it is pretty much the same thing and local cache is guaranteed to exist?
dvc/remote/local.py
Outdated
@@ -202,8 +216,11 @@ def isdir(self, path_info): | |||
def walk(self, path_info): | |||
return os.walk(path_info["path"]) | |||
|
|||
def get_hash_list(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the point of this wrapper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to reuse the hash list inside output.base
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why not just set self.hash = [self.PARAM_CHECKSUM]
in the RemoteBase and use it instead of get_hash_list
and get_prefer_hash_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my practice (in other languages), functions or methods are more flexible/optimizable/better than class fields in many cases.
-
The first thought was to prevent
hash
get changed from outside. -
Like
get_prefer_hash_type
will be more clear and easy to refactor than self.hash[0] in the code(11 callers) in the future, and the callers don't have to figure out the differences between different remote implements, on the other hand, other remote implements are free to use other different strategies. -
set self.hash = [self.PARAM_CHECKSUM] in the RemoteBase
will need long-term memory in heap for gc process to check, instead of short-term or stack-alloc(if compiler is smart enough) here for other implements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great points! getters/setters are nice when you have some complex logic, but in our case properties are more than enough.
Thinking about it a little more, it actually feels like much of that checksum logic should be contained in the dvc/remote/local
. That way you won't even have to access those hash types from the outside. For example, Stage.changed_checksum() could then look like:
return self.checksum != self._compute_checksum()
and Stage.compute_checksum() would
...
return self.cache.local.dict_checksum(d, exclude=...)
that way almost all of the internals are hidden where they matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually pretty similar to how we handle RemoteLOCAL.cache_types
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactoring so much lines for showing properties are more than enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vyloy Sorry, what do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring this two places to avoid use property self.hash = [self.PARAM_CHECKSUM]
outside remote is fine, but how about other places, they need to be consistent, right?
for example stage.py is_cached:
for out in outs:
out.pop(self.repo.cache.local.get_prefer_hash_type(), None)
out.pop(RemoteS3.PARAM_CHECKSUM, None)
dvc/remote/local.py
Outdated
self.hash = [utils.CHECKSUM_MD5] | ||
else: | ||
self.hash = [utils.CHECKSUM_MD5] | ||
RemoteLOCAL.PARAM_CHECKSUM = self.hash[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overwriting constants is not nice.
dvc/remote/local.py
Outdated
if isinstance(core_hash, str): | ||
core_hash = [h.strip() for h in core_hash.split(",")] | ||
self.hash = core_hash | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just set this once before if
instead of having two else
s
dvc/output/base.py
Outdated
@@ -154,10 +158,14 @@ def exists(self): | |||
return self.remote.exists(self.path_info) | |||
|
|||
def changed_checksum(self): | |||
cs = self.checksum | |||
if cs is None: | |||
self.remote.update_state_checksum(self.path_info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed? state is updated in get_file_checksum
and get_dir_checksum
already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fixes a bug while we use unit test and terminal in the same time.
if we initial the environment by test(e.g. add foo), and then change the config to sha256
in terminal, then it shows that it will uses the hash from state in the dvc file directly without rehash(e.g. in the dvc file, out: sha256: xxxxxx(this is a md5)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a proper way to solve this. We should have a way to distinguish between different checksums in the state file itself. Seems like introducing a column for each checksum type would be a good way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a quick hack.
if cs is None, means config hash.local
has been changed and no record fits, same to state record, so this update to state is needed anyway.
for example, changing config hash.local
from md5
to sha256
will go into this condition, while from md5
to sha256, md5
don't.
Introducing a column for each checksum type, many places need to be changed, and will lead to worse performance because additional checks that do not necessary for usual use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example in the unit test:
def test_incompatibility(self):
ret = main(["add", self.FOO])
self.assertEqual(0, ret)
ret = main(["config", "hash.local", "sha256"])
self.assertEqual(0, ret)
ret = main(["status"])
self.assertEqual(0, ret)
ret = main(["status", "-q"])
self.assertEqual(1, ret)
ret = main(["add", self.FOO]) // update_state_checksum is called here.
self.assertEqual(0, ret)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't get your point...
Why self.remote.save_info
can expose while self.remote.update_state_checksum
can not?
For example?
The fact that this particular output doesn't have some checksum doesn't mean that it is not used anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why self.remote.save_info can expose while self.remote.update_state_checksum can not?
Because save_info
is not exposing state notion, while update_state_checksum
does. Your update_state_checksum
is almost a full copy of get_checksum
and it is called right before it, which is just not right. What we could do instead is leave checksum type notion inside of the Remote, so it is the one dealing with it without you having to call dirty hacks from Output. For example, you could get rid of update_state_checksum
if you simply make Remote.get_checksum
do smth like checksum = self.state.get(path_info, checksum_type=self.checksum)
, so if it say doesn't have sha256, then it would return None. This would help us avoid a lot of code duplication.
For example?
For example, you have two branches: one is using sha256 and the other old one is still using md5s. We could also decide to allow per-output/dependency checksum type configuration in the future, so the same file could be used with different checksums. So it makes sense to make this generic enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the name is exposing state notion, how about change the name to refresh_info
?
It will return True then, why do you say so? And it is only called while the output checksum is not match with config. That is not even the usual use case.
it is called right before it, which is just not right.
If we leave checksum type notion inside of the Remote, how would Remote know what checksum type is output using for querying from state to compare?
Actually, I have tried this before used this most simple way, this way leads to a lot of code refactoring. 🙂 Why do people always like to ignore the simplest and straightforward way?
For example, you could get rid of update_state_checksum if you simply make Remote.get_checksum do smth like checksum = self.state.get(path_info, checksum_type=self.checksum), so if it say doesn't have sha256, then it would return None.
IMHO, There will be a lot of advantages with this little code duplication here, e.g. better performance, less cognitive complexity etc. 🙂
That make sense from the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try some refactoring to implement this.
That is a great point! Indeed, creating columns for each additional checksum type is a bad solution. On the other hand, we still might want to keep a few checksum for each file. So currently we store md5 in md5 column, how about we replace it with checksum column instead, that would store a dict(e.g. serilized as json) instead that looks smth like {"md5": "79054025255fb1a26e4bc422aef54eb4", "sha256": "b8423365e1e2245851689f675dab11574532c71bfe2e3e53aa10fa0158e28e2b"}? It is the same way we treat checksum dicts in other places in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great points!
If the name is exposing state notion, how about change the name to refresh_info?
The fact still stays the same: you are exposing things that should be handled by save_info
inside the remote to the outside.
It will return True then, why do you say so? And it is only called while the output checksum is not match with config. That is not even the usual use case.
Deciding whether or not state contains something valid and needs or doesn't need a refresh based on the config is pretty fragile and will only work while everyone is accessing it through changed_checksum
, which might not always be the case. It is the remote itself who should decide to refresh anything if it needs to, otherwise, we are breaking the non-formal API of the Remote.
If we leave checksum type notion inside of the Remote, how would Remote know what checksum type is output using for querying from state to compare?
By using return self.remote.changed(self.path_info, self.info)
in output.changed_checksum()
instead of comparing self.info with self.remote.save_info by-hand 🙂
Actually, I have tried this before used this most simple way, this way leads to a lot of code refactoring. slightly_smiling_face Why do people always like to ignore the simplest and straightforward way?
Because "the simplest and straightforward" doesn't always mean "the best" 🙂 We are playing a long game here, so we need the design to make sense not only now but also in the future.
I will try some refactoring to implement this.
👍 Thanks! Looking forward to that! 🙂
dvc/remote/local.py
Outdated
@@ -202,8 +216,11 @@ def isdir(self, path_info): | |||
def walk(self, path_info): | |||
return os.walk(path_info["path"]) | |||
|
|||
def get_hash_list(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why not just set self.hash = [self.PARAM_CHECKSUM]
in the RemoteBase and use it instead of get_hash_list
and get_prefer_hash_type
?
dvc/output/__init__.py
Outdated
@@ -41,7 +41,8 @@ | |||
# so when a few types of outputs share the same name, we only need | |||
# specify it once. | |||
CHECKSUM_SCHEMA = { | |||
schema.Optional(RemoteLOCAL.PARAM_CHECKSUM): schema.Or(str, None), | |||
schema.Optional(utils.CHECKSUM_MD5): schema.Or(str, None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move hashing related stuff to a separate module? so, that it looks like hash.CHECKSUM_MD5
. utils
should contain really generic stuff, everything else should be moving into packages with some meaningful names
dvc/remote/local.py
Outdated
if hash_local: | ||
if isinstance(hash_local, str): | ||
hash_local = [h.strip() for h in hash_local.split(",")] | ||
self.hash = hash_local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we check for supported hashes here? It looks like a code duplicate from the config class that reads a list. Should be abstracted and reused most likely.
dvc/stage.py
Outdated
@@ -120,6 +120,7 @@ class Stage(object): | |||
STAGE_FILE_SUFFIX = ".dvc" | |||
|
|||
PARAM_MD5 = "md5" | |||
PARAM_SHA256 = "sha256" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how will we support additional checksums that were asked in the ticket? Adding them one by one like this is not scalable, it will be hard to read and extend it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vyloy good stuff! I left a few questions in the PR. To summarize:
- Could you clarify why do we need to support a list of checksums per remote?
- What will happen in the future if we have to checksum implementations that returns the same length keys? Should we append distinguish checksums by their name? (keep only md5 as a defautl without md5)?
In general it seems that it would be hard to support other checksums requested in the ticket (I'm even not sure why it's limited to sha256
only now). We need to abstract it in a way that it's easy to add a few implementations without touching 20 files manually.
What are your opinions @efiop @shcheklein about the new format for hash of dvc file?
if we use hash algorithm as key name, then we have to modify several files every time we add one more, e.g. schemes files. |
Great idea! But I'm not so sure about this format.
With
which doesn't look nice IMHO.
Not really. If we would to define SCHEMA for checksums in one place, it should be pretty easy and compact to extend. Also, as @shcheklein noted, you are currently passing |
Regarding
format: In most cases we will only have one checksum, so it would only introdue additional level of indentation
which also makes it harder to read or to merge if needed. |
Got it, thanks for your helps. Especially @efiop 's great advice. |
I think we shouldn't complicate .dvc file format even if that helps us with implementation. Code is volatile, API will stay. So I am in favor of simply adding keys for new hash funcs at the same level. This is another argument to augment @efiop points. |
The backward compatibility will be break if just modifying state table column Does some additional code to keep old state data working sound good? |
Sure, but if it is too much hustle, you could simply adjust table structure as needed and then increase state version in State, so that it will automatically drop old state db and will reformat it with the new layout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vyloy ! Please see a few more comments down below. 🙂
) | ||
|
||
def changed_cache(self): | ||
if not self.use_cache or not self.checksum: | ||
return True | ||
|
||
return self.cache.changed_cache(self.checksum) | ||
return self.cache.changed_cache(self.checksum, self.checksum_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, just pass checksum_info and let the remote deal with types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vyloy ?
dvc/remote/base.py
Outdated
@@ -153,6 +153,12 @@ def group(self, name): | |||
def cache(self): | |||
return getattr(self.repo.cache, self.scheme) | |||
|
|||
def get_checksum_type_list(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's define something like
DEFAULT_CHECKSUMS = ["md5"]
for each remote instead of PARAM_CHECKSUM and then use it if checksums
are not specified in the configs. Then make something like
@property
def checksums(self):
return self._checksums
@property
def checksum(self):
return self._checksums[0]
and use that, so you don't have to define your getters for each remote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to define DEFAULT_CHECKSUM_TYPES but some tests like TestCacheExists are setting PARAM_CHECKSUM after init
Which is against from what you said, don't know how it got accepted in the code base 🙂
Overwriting constants is not nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, yeah, you'll have to adjust those to do things things properly.
It is acceptable for tests and unacceptable to overwrite constants in the core code itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is acceptable for tests and unacceptable to overwrite constants in the core code itself.
IMHO, it is a bad decision.
Unacceptable things should be keeping consistent for both the core code and tests. They are under the same project, right? Find excuse to write unacceptable code in the project is not nice. 🙂
Otherwise, means the tests are using the core code in the unacceptable ways.
@@ -180,7 +186,7 @@ def _collect_dir(self, path_info): | |||
dir_info.append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just make get_file_checksum
return a dict {"checksum": "mychecksum"}
and then use .update({self.PARAM_RELPATH: relpath})
. Otherwise, you are splitting the knowtion of what checksum type get_file_checksum
has returned and then assembling it all together, which is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep packing and unpacking checksum and checksum type as a dict is not efficient and is unnecessary.
Because generating new checksum will always use the preferred checksum type, which is always known here.
@property
def checksums(self):
return self._checksums
@property
def checksum(self):
return self._checksums[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because generating new checksum will always use the preferred checksum type, which is always known here.
It is not true though. Check out remote.changed_checksum
, which needs to get_checksum of a particular type. Also, in "Remote.save_info" you are packing it into dict anyway. I'm just suggesting to keep it all together and not dispursing checksum notion everywhere. Check out your save_info:
return {
checksum_type
or self.checksum_type(): self.get_checksum(
path_info, checksum_type
)
}
with changes that I suggest it would be
return self.get_checksum(path_info)
🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not true though. Check out remote.changed_checksum, which needs to get_checksum of a particular type. Also, in "Remote.save_info" you are packing it into dict anyway. I'm just suggesting to keep it all together and not dispursing checksum notion everywhere. Check out your save_info:
That is why I said generating.
When you are passing a particular type to get_checksum
, you knows what type of checksum it is returning, isn't it?
🙂 It is a legacy code, not my packing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do know if you specify that explicitly. But you are getting your checksum type from Remote while you are in the Output class, and then you are passing it back to Remote. Doesn't seem very optimal to me. You could return {"checksum": checksum}
and let your remote decide by itself which default checksum type it wants to use.
Legacy code didn't deal with multiple checksums, and it is the point of this task to set it up so that it is able to do that properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do know if you specify that explicitly. But you are getting your checksum type from Remote while you are in the Output class, and then you are passing it back to Remote. Doesn't seem very optimal to me. You could return {"checksum": checksum} and let your remote decide by itself which default checksum type it wants to use.
About But you are getting your checksum type from Remote while you are in the Output class, and then you are passing it back to Remote.
Not sure about where you are talking, but
Lines 143 to 148 in 00cfb04
@property | |
def checksum(self): | |
return self.info.get(self.remote.PARAM_CHECKSUM) | |
@property |
shows the code was doing that since the beginning 🙂
And legacy code did deal with multiple checksums, because remotes don't use same checksums (types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here https://github.com/iterative/dvc/pull/1920/files#diff-9487de78a43eff53118660a866090b89R144 you are getting checksum type from remote and using it later.
The legacy code was mostly built around the assumption of 1 checksum per remote, hence PARAM_CHECKSUM. The point of the task that you are doing is to generalize it and do it properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code was doing that since the beginning .🙂
Getting the checksum type from remote. self.info.get(self.remote.PARAM_CHECKSUM)
So, that does not change the fact that the legacy code still did deal with multiple checksums.
dvc/remote/base.py
Outdated
@@ -195,8 +201,8 @@ def get_dir_checksum(self, path_info): | |||
if self.cache.changed_cache_file(checksum): | |||
self.cache.move(from_info, to_info) | |||
|
|||
self.state.save(path_info, checksum) | |||
self.state.save(to_info, checksum) | |||
self.state.save(path_info, checksum, self.get_prefer_checksum_type()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does state really need to know checksum_type explicitly? Maybe we could just pass it a dict that looks something like {"md5": "mymd5"}
? Relevant to the comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.state.save(path_info, checksum, self.get_prefer_checksum_type())
compare to
self.state.save(path_info, {self.get_prefer_checksum_type(): checksum})
Which one do you like? Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep packing and unpacking checksum and checksum type as a dict is not efficient and is unnecessary.
And doing this means many refactoring with the legacy code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you are packing it into dict in state.save() anyway 🙂 And then you are manually updating particular keys. So why not just .update()
already existing record with this new one? 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Packing them into dict later is better, because it will save you an unpacking when you need to in the future. 🙂
Or in your design, you will pass multiple checksum types and checksums here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, yes. The point is that you are not spreading checksum type notion across the code. State.save() doesn't need to know anything about checksum types, it just needs to save it.
rename get_checksum_type_list to checksum_types
dvc/remote/base.py
Outdated
|
||
return checksum | ||
|
||
def save_info(self, path_info): | ||
def save_info(self, path_info, checksum_type=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think save_info needs checksum_type parameter any longer, since it is not used anywhere ;)
@vyloy Looks like this #1920 (comment) comment got lost somewhere in your feed. Please take a look. |
Oops, accidentally closed. |
@@ -20,6 +21,8 @@ class RemoteHDFS(RemoteBase): | |||
REGEX = r"^hdfs://((?P<user>.*)@)?.*$" | |||
PARAM_CHECKSUM = "checksum" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you have SUPPORTED_CHECKSUM_TYPES down below, what do you need PARAM_CHECKSUM for ?
Closing in favor of #2022 |
Closing as stale. It is still linked to the original issue, so it won't be lost. |
Based on branch 1654 from @efiop 's fork
Fixes #1676