Skip to content
This repository was archived by the owner on May 21, 2024. It is now read-only.

Conversation

@smalyala
Copy link
Contributor

@smalyala smalyala commented Nov 12, 2021

Switches the proto module the mixins in the experimental module use from google protobuf to betterproto-generated code. The main focus for review is in client.py. This is where the conversion between google protobuf and betterproto objects occurs, sandwiching the gRPC call (which happens in _Client._req()).

Additionally:

  • Adds a separate folder of tests for the experimental module in tests/
  • Updates linting to ignore new proto object attributes
  • Renames python_pachyderm.ExperimentalClient() to python_pachyderm.experimental.Client()
  • Replaces pdoc3 references with sphinx references (contributing.md, __init__.py)
  • Adds pfs.py, util.py, service.py to experimental module
  • Generates docs for experimental module

@smalyala smalyala force-pushed the betterproto branch 8 times, most recently from 7150380 to d993de8 Compare November 15, 2021 23:43
@smalyala smalyala changed the title [WIP] Betterproto Add betterproto to experimental module Nov 18, 2021
@smalyala smalyala marked this pull request as ready for review November 18, 2021 20:50
@smalyala
Copy link
Contributor Author

In tox tests, one of the spouts test is failing due to the bug @PFedak discovered. Since this PR doesn't depend on spouts, I'm okay with merging this after review.

[
n
for n in attrs(request_cls)
if n not in PROTO_OBJECT_BUILTINS and "FIELD_NUMBER" not in n
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@msteffen msteffen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes to self/initial thoughts:

  • Is there some way we could preview the generated docs? Since that's a big part of what we hope to get out of this change?
  • Did we ever figure out why betterproto-generated objects sometimes use ints for enums and other times use python enums?
  • I'm only now realizing that this will double the amount of garbage generated per request. That seems mostly fine, but could be a problem for PutFile/GetFile. We should have wrappers for those (ModifyFileClient et. al.) but is there some way this could become a problem? Also: confirm that the wrappers use the Google protobuf objects for efficiency.

@smalyala
Copy link
Contributor Author

smalyala commented Dec 1, 2021

  1. There's a new section for the Experimental Module- https://python-pachyderm.readthedocs.io/en/betterproto/index.html. The way Sphinx reads the new protos isn't perfect so I included a message to encourage users to click the "source code" button in the docs for better readability on how to use the betterproto dataclass objects.
  2. The deserialization step lacked a case for enums. There's an open PR that addresses this- Map enum int's into Enums redux danielgtaylor/python-betterproto#293.
  3. Could you please clarify this more?

@msteffen
Copy link
Contributor

msteffen commented Dec 8, 2021

Could you please clarify this more?

Basically, I just meant that the wire-level protocol for PutFile previously looked something like:

python-pach ---PutFileReq{"/foo.txt", "co.."    }---> pachd
python-pach ---PutFileReq{"",         "..nte.." }---> pachd
python-pach ---PutFileReq{"",         "..nt"    }---> pachd

But because, with this change, we typically create two objects for each RPC, it means that putting a large file might now look more like:

python-pach ---BetterProtoPutFileReq{"/foo.txt", "co.."    } ---> GooglePutFileReq{"/foo.txt", "co.."    }---> pachd
python-pach ---BetterProtoPutFileReq{"",         "..nte.." } ---> GooglePutFileReq{"",         "..nte.." }---> pachd
python-pach ---BetterProtoPutFileReq{"",         "..nt"    } ---> GooglePutFileReq{"",         "..nt"    }---> pachd

Basically, with PutFile in particular, one logical request might consist of a many smaller stream RPC messages. Because, with this change, we now create two protobuf objects for each request, a BetterProto object and a Google protobuf object, we might be generating twice as much memory garbage in the course of getting a file into pachyderm.

It's probably fine, but I wanted to check that e.g. ModifyFileClient, which just takes a path and a byte source, doesn't bother generating BetterProto objects behind its API and converting them to Google protobuf objects, and instead just sends Google protobuf objects directly, that's all.

@msteffen
Copy link
Contributor

msteffen commented Dec 8, 2021

In tox tests, one of the spouts test is failing due to the bug @PFedak discovered. Since this PR doesn't depend on spouts, I'm okay with merging this after review.

For posterity/my personal sanity, the issue is this one:
Original question in #eng

sahith Nov 16th at 1:42 PM
Reposting here. I see two meta commits that remain after DeleteAll which always happens after I create 2 spout pipelines

Further discussion in #pach-core (and here)

Peter Fedak Nov 16th at 4:13 PM
oh wow, just found a subtle bug with our postgres usage that I think merits some discussion tomorrow. most of our transaction functions are written to error if they reach some state which would be inconsistent (e.g. creating a branch on a repo that doesn't exist), but they don't clean up after themselves, assuming the next layer up will fail the transaction so that it's irrelevant
Peter Fedak 4:14 PM
but the caller often permits certain expected errors, with the understanding that, in those cases, no meaningful work has taken place that needs to be cleaned up
4:15
but nothing enforces this contract, so in particular stop pipeline on a spout ends up creating a commit and branch for a meta repo that doesn't exist

TODO: Link to core pach PR (if any)

Copy link
Contributor

@msteffen msteffen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry again that this took so long, and thanks for being so patient! Now that I've finally gotten through the review, I see how much tedious rewiring went into making this work. I'm really glad we got this experiment done; I still really think this is something we need to try. Really awesome work.

LGTM

- PFS: ``mount()``/``unmount()``- Mounting/unmounting Pachyderm repos locally

**Note:** The experimental module WILL NOT follow semver. Breaking changes can
be introduced in the next minor version of :mod:`python_pachyderm`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


bp_obj = bp_class().parse(pb_obj.SerializeToString())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

from .service import BP_MODULES


class ProtoIterator:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

path=path, datum=datum, raw=wrappers_pb2.BytesValue(value=chunk)
path=path,
datum=datum,
raw=chunk,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are some nice ergonomic improvements with BetterProto, honestly

return repo_name


def create_test_pipeline(client: python_pachyderm.Client, test_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you happen to know where this gets used? Is this just carried over from the non-experimental client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 references to create_test_pipeline() in tests/experimental/test_pps.py.

from python_pachyderm.proto.v2 import pfs
from python_pachyderm.service import pfs_proto, Service
from python_pachyderm.experimental.pfs import commit_from, Commit, uuid_re
from python_pachyderm.service import Service # , pfs_proto
Copy link
Contributor

@msteffen msteffen Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying my above comment and pinning this for posterity: we could, as a later optimization, import python_pachyderm.service.pfs_proto and then change the parts of ModifyFileClient's API that take a URL, byte iterator, or anything else that isn't a protobuf so that they use the non-experimental client's GRPC API. Thus instead of:

        >>> with client.modify_file_client(c) as mfc:
        >>>     mfc.put_file_bytes(
        ...         "/new_file.txt",
        ...         data)

becoming send(bp_to_pb(experimental.proto.PutFileRequest("/new_file.txt", data))), it could be send(proto.PutFileRequest("/new_file.txt", data)) and avoid the bp_to_pb round trip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add this change to make the ModifyFile rpc route not use any betterproto objects.

Copy link
Contributor

@msteffen msteffen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new ModifyFileClient LGTM! Nice job keeping the change small!

@smalyala smalyala merged commit f9fa1e2 into master Jan 7, 2022
@smalyala smalyala deleted the betterproto branch January 7, 2022 19:25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants