Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Alterative hashings for data (#1676) #1920

Closed
wants to merge 21 commits into from
Closed

Conversation

vyloy
Copy link
Contributor

@vyloy vyloy commented Apr 24, 2019

Based on branch 1654 from @efiop 's fork

Fixes #1676

@efiop
Copy link
Contributor

efiop commented Apr 24, 2019

@vyloy Please rebase.

@efiop
Copy link
Contributor

efiop commented Apr 25, 2019

@vyloy Your tests are failing.

dvc/stage.py Outdated Show resolved Hide resolved
filtered = dict_filter(d, exclude)
byts = json.dumps(filtered, sort_keys=True).encode("utf-8")
return bytes_md5(byts)
hasher = CHECKSUM_MAP[checksum_type]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not return bytes_checksum(byts, checksum_type) then?

dvc/stage.py Outdated
self.locked = locked
self.tag = tag
self._state = state or {}

if repo and repo.config and repo.config.config:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code matches the one in remote local. How about we just use repo.cache.local.hash instead, since it is pretty much the same thing and local cache is guaranteed to exist?

@@ -202,8 +216,11 @@ def isdir(self, path_info):
def walk(self, path_info):
return os.walk(path_info["path"])

def get_hash_list(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the point of this wrapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to reuse the hash list inside output.base

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why not just set self.hash = [self.PARAM_CHECKSUM] in the RemoteBase and use it instead of get_hash_list and get_prefer_hash_type?

Copy link
Contributor Author

@vyloy vyloy Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my practice (in other languages), functions or methods are more flexible/optimizable/better than class fields in many cases.

  1. The first thought was to prevent hash get changed from outside.

  2. Like get_prefer_hash_type will be more clear and easy to refactor than self.hash[0] in the code(11 callers) in the future, and the callers don't have to figure out the differences between different remote implements, on the other hand, other remote implements are free to use other different strategies.

  3. set self.hash = [self.PARAM_CHECKSUM] in the RemoteBase will need long-term memory in heap for gc process to check, instead of short-term or stack-alloc(if compiler is smart enough) here for other implements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great points! getters/setters are nice when you have some complex logic, but in our case properties are more than enough.

Thinking about it a little more, it actually feels like much of that checksum logic should be contained in the dvc/remote/local. That way you won't even have to access those hash types from the outside. For example, Stage.changed_checksum() could then look like:

return self.checksum != self._compute_checksum()

and Stage.compute_checksum() would

...
return self.cache.local.dict_checksum(d, exclude=...)

that way almost all of the internals are hidden where they matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually pretty similar to how we handle RemoteLOCAL.cache_types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactoring so much lines for showing properties are more than enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyloy Sorry, what do you mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring this two places to avoid use property self.hash = [self.PARAM_CHECKSUM] outside remote is fine, but how about other places, they need to be consistent, right?

for example stage.py is_cached:

        for out in outs:
            out.pop(self.repo.cache.local.get_prefer_hash_type(), None)
            out.pop(RemoteS3.PARAM_CHECKSUM, None)

self.hash = [utils.CHECKSUM_MD5]
else:
self.hash = [utils.CHECKSUM_MD5]
RemoteLOCAL.PARAM_CHECKSUM = self.hash[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overwriting constants is not nice.

if isinstance(core_hash, str):
core_hash = [h.strip() for h in core_hash.split(",")]
self.hash = core_hash
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just set this once before if instead of having two elses

@vyloy vyloy changed the title [WIP] config: Alterative hashings for data (#1676) config: Alterative hashings for data (#1676) Apr 26, 2019
@@ -154,10 +158,14 @@ def exists(self):
return self.remote.exists(self.path_info)

def changed_checksum(self):
cs = self.checksum
if cs is None:
self.remote.update_state_checksum(self.path_info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? state is updated in get_file_checksum and get_dir_checksum already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fixes a bug while we use unit test and terminal in the same time.
if we initial the environment by test(e.g. add foo), and then change the config to sha256 in terminal, then it shows that it will uses the hash from state in the dvc file directly without rehash(e.g. in the dvc file, out: sha256: xxxxxx(this is a md5)).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a proper way to solve this. We should have a way to distinguish between different checksums in the state file itself. Seems like introducing a column for each checksum type would be a good way to go.

Copy link
Contributor Author

@vyloy vyloy Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a quick hack.
if cs is None, means config hash.local has been changed and no record fits, same to state record, so this update to state is needed anyway.

for example, changing config hash.local from md5 to sha256 will go into this condition, while from md5 to sha256, md5 don't.

Introducing a column for each checksum type, many places need to be changed, and will lead to worse performance because additional checks that do not necessary for usual use cases.

Copy link
Contributor Author

@vyloy vyloy Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example in the unit test:

    def test_incompatibility(self):
        ret = main(["add", self.FOO])
        self.assertEqual(0, ret)

        ret = main(["config", "hash.local", "sha256"])
        self.assertEqual(0, ret)

        ret = main(["status"])
        self.assertEqual(0, ret)

        ret = main(["status", "-q"])
        self.assertEqual(1, ret)

        ret = main(["add", self.FOO]) // update_state_checksum is called here. 
        self.assertEqual(0, ret)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't get your point...
Why self.remote.save_info can expose while self.remote.update_state_checksum can not?

For example?

The fact that this particular output doesn't have some checksum doesn't mean that it is not used anywhere else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why self.remote.save_info can expose while self.remote.update_state_checksum can not?

Because save_info is not exposing state notion, while update_state_checksum does. Your update_state_checksum is almost a full copy of get_checksum and it is called right before it, which is just not right. What we could do instead is leave checksum type notion inside of the Remote, so it is the one dealing with it without you having to call dirty hacks from Output. For example, you could get rid of update_state_checksum if you simply make Remote.get_checksum do smth like checksum = self.state.get(path_info, checksum_type=self.checksum), so if it say doesn't have sha256, then it would return None. This would help us avoid a lot of code duplication.

For example?

For example, you have two branches: one is using sha256 and the other old one is still using md5s. We could also decide to allow per-output/dependency checksum type configuration in the future, so the same file could be used with different checksums. So it makes sense to make this generic enough.

Copy link
Contributor Author

@vyloy vyloy May 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the name is exposing state notion, how about change the name to refresh_info?

It will return True then, why do you say so? And it is only called while the output checksum is not match with config. That is not even the usual use case.

it is called right before it, which is just not right.

If we leave checksum type notion inside of the Remote, how would Remote know what checksum type is output using for querying from state to compare?

Actually, I have tried this before used this most simple way, this way leads to a lot of code refactoring. 🙂 Why do people always like to ignore the simplest and straightforward way?

For example, you could get rid of update_state_checksum if you simply make Remote.get_checksum do smth like checksum = self.state.get(path_info, checksum_type=self.checksum), so if it say doesn't have sha256, then it would return None.

IMHO, There will be a lot of advantages with this little code duplication here, e.g. better performance, less cognitive complexity etc. 🙂

That make sense from the example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try some refactoring to implement this.

That is a great point! Indeed, creating columns for each additional checksum type is a bad solution. On the other hand, we still might want to keep a few checksum for each file. So currently we store md5 in md5 column, how about we replace it with checksum column instead, that would store a dict(e.g. serilized as json) instead that looks smth like {"md5": "79054025255fb1a26e4bc422aef54eb4", "sha256": "b8423365e1e2245851689f675dab11574532c71bfe2e3e53aa10fa0158e28e2b"}? It is the same way we treat checksum dicts in other places in the code.

Copy link
Contributor

@efiop efiop May 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyloy

Great points!

If the name is exposing state notion, how about change the name to refresh_info?

The fact still stays the same: you are exposing things that should be handled by save_info inside the remote to the outside.

It will return True then, why do you say so? And it is only called while the output checksum is not match with config. That is not even the usual use case.

Deciding whether or not state contains something valid and needs or doesn't need a refresh based on the config is pretty fragile and will only work while everyone is accessing it through changed_checksum, which might not always be the case. It is the remote itself who should decide to refresh anything if it needs to, otherwise, we are breaking the non-formal API of the Remote.

If we leave checksum type notion inside of the Remote, how would Remote know what checksum type is output using for querying from state to compare?

By using return self.remote.changed(self.path_info, self.info) in output.changed_checksum() instead of comparing self.info with self.remote.save_info by-hand 🙂

Actually, I have tried this before used this most simple way, this way leads to a lot of code refactoring. slightly_smiling_face Why do people always like to ignore the simplest and straightforward way?

Because "the simplest and straightforward" doesn't always mean "the best" 🙂 We are playing a long game here, so we need the design to make sense not only now but also in the future.

I will try some refactoring to implement this.

👍 Thanks! Looking forward to that! 🙂

@@ -202,8 +216,11 @@ def isdir(self, path_info):
def walk(self, path_info):
return os.walk(path_info["path"])

def get_hash_list(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why not just set self.hash = [self.PARAM_CHECKSUM] in the RemoteBase and use it instead of get_hash_list and get_prefer_hash_type?

dvc/config.py Outdated Show resolved Hide resolved
@@ -41,7 +41,8 @@
# so when a few types of outputs share the same name, we only need
# specify it once.
CHECKSUM_SCHEMA = {
schema.Optional(RemoteLOCAL.PARAM_CHECKSUM): schema.Or(str, None),
schema.Optional(utils.CHECKSUM_MD5): schema.Or(str, None),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move hashing related stuff to a separate module? so, that it looks like hash.CHECKSUM_MD5. utils should contain really generic stuff, everything else should be moving into packages with some meaningful names

dvc/remote/local.py Outdated Show resolved Hide resolved
if hash_local:
if isinstance(hash_local, str):
hash_local = [h.strip() for h in hash_local.split(",")]
self.hash = hash_local
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we check for supported hashes here? It looks like a code duplicate from the config class that reads a list. Should be abstracted and reused most likely.

dvc/stage.py Outdated Show resolved Hide resolved
dvc/stage.py Outdated
@@ -120,6 +120,7 @@ class Stage(object):
STAGE_FILE_SUFFIX = ".dvc"

PARAM_MD5 = "md5"
PARAM_SHA256 = "sha256"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will we support additional checksums that were asked in the ticket? Adding them one by one like this is not scalable, it will be hard to read and extend it.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyloy good stuff! I left a few questions in the PR. To summarize:

  1. Could you clarify why do we need to support a list of checksums per remote?
  2. What will happen in the future if we have to checksum implementations that returns the same length keys? Should we append distinguish checksums by their name? (keep only md5 as a defautl without md5)?

In general it seems that it would be hard to support other checksums requested in the ticket (I'm even not sure why it's limited to sha256 only now). We need to abstract it in a way that it's easy to add a few implementations without touching 20 files manually.

@vyloy
Copy link
Contributor Author

vyloy commented Apr 29, 2019

What are your opinions @efiop @shcheklein about the new format for hash of dvc file?

hash: 25ce158390f1477e1e418f54e445cb7f
hash-type: md5
cmd: python code.py foo file1
wdir: .
deps:
- hash: acbd18db4cc2f85cedef654fccc4a4d8
  hash-type: md5
  path: foo
- hash: d05447644b89960913c7eee5fd776adb
  hash-type: md5
  path: code.py
outs:
- hash: acbd18db4cc2f85cedef654fccc4a4d8
  hash-type: md5
  path: file1
  cache: true
  metric: false
  persist: false

if we use hash algorithm as key name, then we have to modify several files every time we add one more, e.g. schemes files.

@efiop
Copy link
Contributor

efiop commented Apr 29, 2019

@vyloy

- hash: acbd18db4cc2f85cedef654fccc4a4d8
  hash-type: md5

Great idea! But I'm not so sure about this format.

  1. It is not backward compatible;
  2. It consists of 2 lines, so merging these files would be even more of a nightmare;
  3. I'm not sure if we would ever need it, but the current explicit format allows for having multiple checksums like so:
    md5: acbd18db4cc2f85cedef654fccc4a4d8
    sha256: cf47b5de0e349888e655572b8f6e01969661d029ec69972ef51fd0fb88f46c6a

With hash-type format, it would have to potentially look something like:

- hashes:
   - hash: acbd18db4cc2f85cedef654fccc4a4d8
     hash-type: md5
   - hash: cf47b5de0e349888e655572b8f6e01969661d029ec69972ef51fd0fb88f46c6a
     hash-type: sha256

which doesn't look nice IMHO.

if we use hash algorithm as key name, then we have to modify several files every time we add one more, e.g. schemes files.

Not really. If we would to define SCHEMA for checksums in one place, it should be pretty easy and compact to extend. Also, as @shcheklein noted, you are currently passing md5=..., sha256=... to Stage.init, which is not scalable. Maybe we could pass something like a dict hash={"md5":..., "sha256":...} instead? Or maybe there is a better way to solve this.

@efiop
Copy link
Contributor

efiop commented Apr 29, 2019

Regarding

hash:
    md5: acbd18db4cc2f85cedef654fccc4a4d8
    sha256: cf47b5de0e349888e655572b8f6e01969661d029ec69972ef51fd0fb88f46c6a

format: In most cases we will only have one checksum, so it would only introdue additional level of indentation

deps:
- hash:
    md5: acbd18db4cc2f85cedef654fccc4a4d8
  path: foo

which also makes it harder to read or to merge if needed.

@shcheklein
Copy link
Member

@efiop @vyloy 👍on Ruslan's points and please forward on, let's keep all discussions open - either here or #dev-general on Discord.

@vyloy
Copy link
Contributor Author

vyloy commented Apr 29, 2019

Got it, thanks for your helps. Especially @efiop 's great advice.

@Suor
Copy link
Contributor

Suor commented Apr 30, 2019

I think we shouldn't complicate .dvc file format even if that helps us with implementation. Code is volatile, API will stay. So I am in favor of simply adding keys for new hash funcs at the same level.

This is another argument to augment @efiop points.

dvc/stage.py Outdated Show resolved Hide resolved
dvc/utils/hash/__init__.py Outdated Show resolved Hide resolved
dvc/stage.py Outdated Show resolved Hide resolved
dvc/config.py Outdated Show resolved Hide resolved
@vyloy
Copy link
Contributor Author

vyloy commented May 2, 2019

The backward compatibility will be break if just modifying state table column md5 to checksum and using json string for multiple checksum types.

Does some additional code to keep old state data working sound good?

@efiop
Copy link
Contributor

efiop commented May 2, 2019

Does some additional code to keep old state data working sound good?

Sure, but if it is too much hustle, you could simply adjust table structure as needed and then increase state version in State, so that it will automatically drop old state db and will reformat it with the new layout.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vyloy ! Please see a few more comments down below. 🙂

dvc/output/base.py Outdated Show resolved Hide resolved
)

def changed_cache(self):
if not self.use_cache or not self.checksum:
return True

return self.cache.changed_cache(self.checksum)
return self.cache.changed_cache(self.checksum, self.checksum_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, just pass checksum_info and let the remote deal with types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyloy ?

@@ -153,6 +153,12 @@ def group(self, name):
def cache(self):
return getattr(self.repo.cache, self.scheme)

def get_checksum_type_list(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define something like
DEFAULT_CHECKSUMS = ["md5"]
for each remote instead of PARAM_CHECKSUM and then use it if checksums are not specified in the configs. Then make something like

@property
def checksums(self):
    return self._checksums

@property
def checksum(self):
    return self._checksums[0]

and use that, so you don't have to define your getters for each remote.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to define DEFAULT_CHECKSUM_TYPES but some tests like TestCacheExists are setting PARAM_CHECKSUM after init

Which is against from what you said, don't know how it got accepted in the code base 🙂

Overwriting constants is not nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yeah, you'll have to adjust those to do things things properly.

It is acceptable for tests and unacceptable to overwrite constants in the core code itself.

Copy link
Contributor Author

@vyloy vyloy May 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is acceptable for tests and unacceptable to overwrite constants in the core code itself.

IMHO, it is a bad decision.
Unacceptable things should be keeping consistent for both the core code and tests. They are under the same project, right? Find excuse to write unacceptable code in the project is not nice. 🙂
Otherwise, means the tests are using the core code in the unacceptable ways.

@@ -180,7 +186,7 @@ def _collect_dir(self, path_info):
dir_info.append(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make get_file_checksum return a dict {"checksum": "mychecksum"} and then use .update({self.PARAM_RELPATH: relpath}). Otherwise, you are splitting the knowtion of what checksum type get_file_checksum has returned and then assembling it all together, which is confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep packing and unpacking checksum and checksum type as a dict is not efficient and is unnecessary.
Because generating new checksum will always use the preferred checksum type, which is always known here.

@property
def checksums(self):
    return self._checksums

@property
def checksum(self):
    return self._checksums[0]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because generating new checksum will always use the preferred checksum type, which is always known here.

It is not true though. Check out remote.changed_checksum, which needs to get_checksum of a particular type. Also, in "Remote.save_info" you are packing it into dict anyway. I'm just suggesting to keep it all together and not dispursing checksum notion everywhere. Check out your save_info:

        return {
            checksum_type
            or self.checksum_type(): self.get_checksum(
                path_info, checksum_type
            )
        }

with changes that I suggest it would be

return self.get_checksum(path_info)

🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not true though. Check out remote.changed_checksum, which needs to get_checksum of a particular type. Also, in "Remote.save_info" you are packing it into dict anyway. I'm just suggesting to keep it all together and not dispursing checksum notion everywhere. Check out your save_info:

That is why I said generating.
When you are passing a particular type to get_checksum, you knows what type of checksum it is returning, isn't it?
🙂 It is a legacy code, not my packing.

Copy link
Contributor

@efiop efiop May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do know if you specify that explicitly. But you are getting your checksum type from Remote while you are in the Output class, and then you are passing it back to Remote. Doesn't seem very optimal to me. You could return {"checksum": checksum} and let your remote decide by itself which default checksum type it wants to use.

Legacy code didn't deal with multiple checksums, and it is the point of this task to set it up so that it is able to do that properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do know if you specify that explicitly. But you are getting your checksum type from Remote while you are in the Output class, and then you are passing it back to Remote. Doesn't seem very optimal to me. You could return {"checksum": checksum} and let your remote decide by itself which default checksum type it wants to use.

About But you are getting your checksum type from Remote while you are in the Output class, and then you are passing it back to Remote.
Not sure about where you are talking, but

dvc/dvc/output/base.py

Lines 143 to 148 in 00cfb04

@property
def checksum(self):
return self.info.get(self.remote.PARAM_CHECKSUM)
@property

shows the code was doing that since the beginning 🙂
And legacy code did deal with multiple checksums, because remotes don't use same checksums (types)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here https://github.com/iterative/dvc/pull/1920/files#diff-9487de78a43eff53118660a866090b89R144 you are getting checksum type from remote and using it later.

The legacy code was mostly built around the assumption of 1 checksum per remote, hence PARAM_CHECKSUM. The point of the task that you are doing is to generalize it and do it properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code was doing that since the beginning .🙂
Getting the checksum type from remote. self.info.get(self.remote.PARAM_CHECKSUM)
So, that does not change the fact that the legacy code still did deal with multiple checksums.

@@ -195,8 +201,8 @@ def get_dir_checksum(self, path_info):
if self.cache.changed_cache_file(checksum):
self.cache.move(from_info, to_info)

self.state.save(path_info, checksum)
self.state.save(to_info, checksum)
self.state.save(path_info, checksum, self.get_prefer_checksum_type())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does state really need to know checksum_type explicitly? Maybe we could just pass it a dict that looks something like {"md5": "mymd5"}? Relevant to the comment above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.state.save(path_info, checksum, self.get_prefer_checksum_type())
compare to
self.state.save(path_info, {self.get_prefer_checksum_type(): checksum})

Which one do you like? Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep packing and unpacking checksum and checksum type as a dict is not efficient and is unnecessary.
And doing this means many refactoring with the legacy code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you are packing it into dict in state.save() anyway 🙂 And then you are manually updating particular keys. So why not just .update() already existing record with this new one? 🙂

Copy link
Contributor Author

@vyloy vyloy May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packing them into dict later is better, because it will save you an unpacking when you need to in the future. 🙂
Or in your design, you will pass multiple checksum types and checksums here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, yes. The point is that you are not spreading checksum type notion across the code. State.save() doesn't need to know anything about checksum types, it just needs to save it.

dvc/remote/base.py Outdated Show resolved Hide resolved
vyloy added 2 commits May 7, 2019 14:13

return checksum

def save_info(self, path_info):
def save_info(self, path_info, checksum_type=None):
Copy link
Contributor

@efiop efiop May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think save_info needs checksum_type parameter any longer, since it is not used anywhere ;)

@efiop
Copy link
Contributor

efiop commented May 9, 2019

@vyloy Looks like this #1920 (comment) comment got lost somewhere in your feed. Please take a look.

@efiop efiop closed this May 9, 2019
@efiop
Copy link
Contributor

efiop commented May 9, 2019

Oops, accidentally closed.

@efiop efiop reopened this May 9, 2019
@@ -20,6 +21,8 @@ class RemoteHDFS(RemoteBase):
REGEX = r"^hdfs://((?P<user>.*)@)?.*$"
PARAM_CHECKSUM = "checksum"
Copy link
Contributor

@efiop efiop May 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you have SUPPORTED_CHECKSUM_TYPES down below, what do you need PARAM_CHECKSUM for ?

@shcheklein
Copy link
Member

Closing in favor of #2022

@shcheklein shcheklein closed this May 19, 2019
@shcheklein shcheklein reopened this May 28, 2019
@shcheklein shcheklein changed the title config: Alterative hashings for data (#1676) [WIP] Alterative hashings for data (#1676) May 28, 2019
@efiop
Copy link
Contributor

efiop commented Oct 16, 2019

Closing as stale. It is still linked to the original issue, so it won't be lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alterative hashings for data
4 participants