Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2 roadmap proposal #230

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 201 additions & 0 deletions v2-roadmap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# V2 implementation proposal

## Why this proposal

The current implementation uses different `CloudEvent` classes for the implementation
of different protocol bindings. This adds complexity the integration of new technologies
like `pydantic`.

The separation of responsibilities in the package is not clear. We have multiple JSON
implementations, while the format specification is the same for all bindings in the
[`CloudEvent` spec](https://github.com/cloudevents/spec/blob/v1.0.2/cloudevents/formats/json-format.md):

* Pydantic implementation is calling the HTTP implementation for this, adding unnecessary overhead
* the Kafka binding uses its own implementation, creating code duplication and unnecessary maintenance efforts

We should have a single `CloudEvent` class used for all bindings/formats since the event
is the Python representation of the event data, which is the same no matter what binding
is used. (we potentially could receive a HTTP event and forward it as it is using Kafka).

The specific bindings and formats implementations should be separate from the `CloudEvent`
class.

This proposal contains some base rules for the package architecture and how we could implement
the `CloudEvent` spec in this package.

## Requirements

In order to be able to transform a python CloudEvent to/from a specific
protocol binding, this SDK has the following responsibilities:

* Identifying what attributes have to be marshalled/unmarshalled (identifying the cloudevents extensions)
* Validate required fields and generation of defaults
* Marshalling/unmarshalling the cloudevents core and extensions fields
* Marshalling/unmarshalling the `data` field
* Encode/decode the marshalled fields to/from the specific binding format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd arguably say this is the responsibility of the specific binding format extension rather than the core library

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a list of all the operations needed to transform a python class in a specific event format implementation


## Modules

We should separate the implementation of the requirements so that they can be tested independently and
each component be used for composition.

Ideally we should have these modules (naming TBC):

* `cloudevents.event` - This would contain the `CloudEvent` base class. This class is the same for all bindings/formats.
* `cloudevents.formats` - This would contain the structured formats implementations (JSON, AVRO, Protobuf)
* `cloudevents.bindings` - This would contain the specific binding implementations (kafka, http)
Comment on lines +45 to +46
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of moving forward and scoping modules like they are scoped in the spec. Looks like a good go-to solution to me


This matches the naming and concepts we have in the `CloudEvent` spec

## Matching requirements and modules

### The `cloudevents.event` module and `CloudEvent` class

The `CloudEvent` class would satisfy the following requirements:

* Validate required fields and generation of default values
* Marshalling/unmarshalling the `data` field
* Identifying what attributes have to be marshalled/unmarshalled

The `data` field depends on the specific event and the `contentdatatype` attribute and
its marshalling/unmarshalling logic should be the same for any binding/format. It seems
a `CloudEvent` class responsibility to take care of this. (We should implement this logic
in overridable methods)
Comment on lines +60 to +63
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So from one perspective, I do agree with the statement, but from the other, it sometimes is beneficial to have the marshalling/unmarshalling logic moved elsewhere. IMO we can achieve that by using mixin classes. Again, this is more of an implementation detail


Ideally the end user will be able to extend the `CloudEvent` class for their necessity. Ie:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extend or create a separate one from scratch and just the behavior should be the same. I guess we may benefit from having a Protocol with behavior description


* adding new extensions fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the responsibility of the libraries CloudEvent implementation to handle all the possible extensions fields as far as I can tell. There may be special extension-specific mixins that provide additional behavior beneficial for this or that implementation, but overall we should be able to digest any extension fields with just the base implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be special extension-specific mixins that provide additional behavior

Yes, I used the word "extend" in a generic meaning. It could be also a mixin with some custom implementation.

The idea is that it should be possible (and easy) to create a class MyBaseEvent(MyMixin, CloudEvent) at application level to extend the sdk functionalities.

* adding fields/methods not meant to be marshalled (internal fields)
* implementing new defaults (ie. creating a `MyEvent` class with `type="myevent"` default)

We also want a DTO that can be accepted by the most common libraries (ie. JSON) to keep the
`CloudEvent` class decoupled from the formats/bindings. Using a dictionary seems a good idea.
Comment on lines +71 to +72
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what we want is a canonical representation. That's what Event and BaseEvent are in the current implementation. I don't like the way it is right now.

But I am also not sure if we want to always keep a canonical representation as dict. Maybe it's better to have a dataclass instead or expose some immutable data structure over dict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the end user to never deal with this DTO. They should always rely on either the CloudEvent class or the specific format output, which could be a class (ie. KafkaEvent)

This DTO structure was thought to be used only internally, to move data between the base class, the formats and the bindings.

I thought about dictionaries because they are fast, and libraries we might want to use for formats and bindings will probably support them out of the box (the classic example would be the JSON library that maps JSON objects to dicts)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we'll then use Protocol definition as the base DTO. It already is immutable, well-defined in the spec as well as has a to_json and to_text implementations out of the box. The drawback here is that we're dragging Protobuf inside the core.

I'm kinda just thinking out loud here as I don't have a strong opinion on this at the moment. Also depending on whether the canonical representation is gonna be part of the CloudEvent instance or can be just created on the fly we might need to use different approaches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we'll then use Protocol definition as the base DTO. It already is immutable, well-defined in the spec as well as has a to_json and to_text implementations out of the box. The drawback here is that we're dragging Protobuf inside the core.

Ideally I think it would be more maintainable to keep formats implementation separate from the base class, so that they can be implemented and tested independently.

The more formats we implement, the bigger the class will become and the created objects as a consequence. I am thinking about performance in systems where we want to handle a high number of events/sec and allocating memory for larger objects could become impactful.

Copy link
Contributor Author

@febus982 febus982 Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also keep the existing implementation interface where we have "get_attributes" and "get_data" and use them in bindings and formats.

If we find at some point that having a dict in the middle of the pipeline might help we can introduce it.


We should implement a `to_dict`/`from_dict` functionalities in the `CloudEvent` class.

QUESTION: Can we define a data type for the `data` field after it's been marshalled?
(ie. can we say it will be either `str` or binary data?)
Comment on lines +76 to +77
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the type should be known in advance.


### The `cloudevents.formats` module

Formats do solve a single responsibility:

* Marshalling/unmarshalling the cloudevents core and extensions fields

Their implementation has to be used when other bindings implement their
structured input/output and make sure their implementation is compatible.

Each format accepts the dictionary from the `CloudEvent` as param
and produces an output based on the format itself. Ie.

```python
def to_json(event: dict) -> str: ...
def from_json(event: str) -> dict: ...
```

TODO: Review typing as JSON produces a string but other formats might be different.
Comment on lines +79 to +96
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is also a little bit tricky. I'd rather avoid doing a CloudEvent -> dict -> json conversion if we can do CloudEvent -> json without the dict part (also not sure if that's possible, most probably it'd be done somewhere under the hood anyway).

The next part is the API. As we've faced in the current implementation, it should be structured in a way that it's super easy to swap the underlying implementation (e.g. replace native json with smth else) or at least propagate some configuration options. I'm also not sure if we want to have that capability as part of the API or we just should provide an easy way to extend the API instead, e.g. have our own different version of to_json where we use different libraries and have them as optional dependencies (or maybe just as a separate library like cloudevents-ext


### The `cloudevents.bindings` module

Bindings do solve these responsibilities:

* Marshalling/unmarshalling the cloudevents core and extensions fields **(if binary format)**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bindings sometimes support structured formats as well

Copy link
Contributor Author

@febus982 febus982 Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is for structured formats to use the formats implementation, to marshall the specific fields. (The assumption here is that formats are the same for all the binding implementations)

Ie.

If I need to marshall a CloudEvent in a JSON structured HTTP event, I expect the SDK to:

  • use the JSON formatter (which would marshall the single fields as specified by the format)
  • compose headers and data fields (but not marshalling the single fields)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that assumption is not true taking into account the spec. I'm not quite sure I fully understand the example, but in HTTP event attributes are converted to headers and data is the HTTP request payload AFAIK.
So not sure how the JSON formatter is relevant here. Can you maybe expand your example a little bit?

Copy link
Contributor Author

@febus982 febus982 Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the HTTP structured mode the whole event (attributes and data) is marshalled into the body.

Headers MAY include the attributes, but attributes MUST be in the body.

For JSON conversion I'd expect something like this:

def to_http_json(cloudevent: CloudEvent):
    dict_event = cloudevent.to_dict()
    body = formats.to_json(dict_event)
    headers = [] # compose headers according to spec, eventually grabbing data from dict_event
    return headers, body

* Encode/decode the marshalled fields to/from the specific binding format

They also have the responsibility of coordinating the other modules.

This is a **concept** example for `cloudevents.bindings.http` module

```python
from cloudevents.formats.json import to_json, from_json
from cloudevents.event import CloudEvent


# Shared dataclass to help with typing, might be a dictionary or a named tuple (not important)
@dataclass
class HTTPCloudEvent:
headers: list
body: Any # unsure about typing here


def to_binary(event: CloudEvent) -> HTTPCloudEvent:
e = event.to_dict()
# logic to marshall into http binary output
return HTTPCloudEvent(headers=[], body=[])


def to_json(event: CloudEvent) -> HTTPCloudEvent:
e = to_json(event.to_dict())
# logic to compose the JSON response
return HTTPCloudEvent(headers=[], body=[])


def from_binary(headers: list, body: dict) -> CloudEvent:
e = {}
# specific logic to unmarshall raw headers into a dictionary
return CloudEvent.from_dict(e)


def from_json(body: str) -> CloudEvent:
e = from_json(body)
return CloudEvent.from_dict(e)


def from_http(headers: list, body: dict) -> CloudEvent:
# identify if structured or binary based on headers / body
# and use the proper `from_` function
# We should also handle here batches
...
```

Other bindings would behave similarly, where the end user would need only
to use the relevant module in `cloudevents.bindings` namespace.

## CloudEvent subclasses

We'll want SDK users to be able to extend the `CloudEvent` class, so they will
be able to manage marshalling/unmarshalling of the `data` field for different
event types and their python typing.

This could be easily done by introducing a mapping between the `type` field and
`CloudEvent` subclasses as a parameter in the `from_` functions. Ie.

```python
def from_binary(headers: list, body: dict, event_types: Optional[dict[str, type]] = None) -> CloudEvent:
event_as_dict = {}
# specific logic to unmarshall raw headers into a dictionary
if event_types:
# Use
return event_types.get(event_as_dict['type'], CloudEvent).from_dict(e)
else:
return CloudEvent.from_dict(e)
```

## Pydantic support

The current Pydantic implementation is implemented as a possible substitute of
HTTP Event class, by looking at its conversion module and tests, but it's used
by calling the functions in conversion module.

Pydantic offers good support and performance for data validation, defaults and
JSON serialization (some of the requirements we have identified)

We need to make a decision:

* We use pydantic as a base for `CloudEvent` class:
* PROs
* No need to implement data validation
* Performance (pydantic V2 is written in Rust)
* We can lay down some base logic on `data` field serialization (ie. nested pydantic models)
* Enable users to use pydantic for automatic documentation functionalities
* CONs
* We introduce a new package dependency, and depend on its implementation
* _minor_ If we use pydantic JSON serialization it is implemented in the model class (slightly different from what proposed but we can use wrapper functions in the JSON format module)
* We remove pydantic support:
* PROs
* We have more control over validation and defaults implementation
* We don't depend on other packages
* CONs
* Performance on validation and defaults (we can still use libraries like `orjson` for JSON performance)
* Additional effort in writing and maintaining data validation and required/default fields
...
Comment on lines +174 to +201
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though I love Pydantic, I don't think we should have it as a core dependency and I'd rather have a Pydantic-specific extension/implementation of CloudEvents.

I do agree that it kinda solves a lot of questions out of the box, but still is a dependency we can't drag in. Especially while there are multiple projects that are still using Pydantic v1 while the v2 is kinda the mainstream already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion here. I just thought it might be easier to maintain a single approach for more consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely might be, but as I mentioned, some libraries currently depend on CloudEvents and they may not be able to move forward if CloudEvents brings Pydantic as a required dependency.