Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Chronon service module to add support for serving features #873

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

piyush-zlai
Copy link
Contributor

@piyush-zlai piyush-zlai commented Nov 7, 2024

Summary

This PR adds a module to the project to spin up a service that wraps the Fetcher code and allows us to serve features. This module leverages the Vert.x project. Vert.x is a high performance service toolkit that allows us to build http / grpc services - as the serving layer can be on the hot path for some use-cases, the high perf nature of Vert.x will hopefully be a good starting point. Vert.x core is Java based and given we have 3 Scala versions to support in the Chronon project, I've chosen the path of writing the service in Java rather than dealing with the interop and the multi-version compat story for now.

The service as such adds a couple of HTTP endpoints that allow us to serve groupBys and joins (each returning a json payload). We can extend things in the future with an equivalent set of grpc endpoints if needed. Endpoints look like:

/ping -> Health check
/config -> Returns the svc config
/v1/features/join/<joinName> -> Bulk get request for a given join
/v1/features/groupby/<groupByName> -> Bulk get request for a given group by

Why / Goal

Currently users getting going on the project have to take on the additional work of writing a feature serving layer of their own (or rely on a set of disparate clients calling and using the Fetcher libs directly). This makes it painful to set up and this burden is exacerbated for users that don't natively work on JVM languages. Hooking up the fetcher code in other languages is non-trivial as it requires rewriting a lot of the core fetcher logic and maintaining parity across as things change over time. Including a service module within the project aims to lower that barrier of entry as it allows users to package the service and have their non-JVM services hit it and get feature data over the wire.

Test Plan

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested

(Readme in the service module directory has some details / instructions on how you can spin up the project and test against the quickstart mongo api)

Checklist

  • Documentation update

Reviewers

@caiocamatta-stripe @mickjermsurawong-openai

@piyush-zlai piyush-zlai changed the title Add a Chronon service module to add support for serving features [wip] Add a Chronon service module to add support for serving features Nov 7, 2024
@piyush-zlai piyush-zlai changed the title [wip] Add a Chronon service module to add support for serving features Add a Chronon service module to add support for serving features Nov 12, 2024
@caiocamatta-stripe
Copy link
Collaborator

Hey @piyush-zlai - could you comment on why Vert.x vs alternatives?

@piyush-zlai
Copy link
Contributor Author

@caiocamatta-stripe - called out in the PR description but I'll expand more:

  • Vertx is a fairly high performance web service framework. A few benchmarks have found it to be better than others such as Play - Link, Link.
  • Play is Scala native but the challenge is that as Chronon is using a bunch of different older Scala versions - 2.10, 2.11 and 2.12, we need to deal with this licensing issue - Licensing. Support for older Scala versions requires us to use the Akka fork of Play (Play 2.9.x). If Chronon were on Scala 2.13 we could go with Play 3.x (which is on Pekko).
  • Vertx being a Java based framework also makes life easier as we don't need to deal with building / cross compiling across 3 Scala versions.
  • Vertx has a fairly rich set of docs and pretty good support for tracing, testing, grpc etc so the rails from a web service framework are fairly extensible.

Copy link
Collaborator

@caiocamatta-stripe caiocamatta-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piyush-zlai thanks for the additional commentary on why Vertx.

Overall this looks good to me! This being a fairly large (albeit safe) change that adds new dependencies, you may want to get reviews from a least a couple other people.

One final thing that still wasn't clear to me - how does the service get the schemas for GroupBys it's serving? Or is that purely the Fetcher's responsibility?

Comment on lines +408 to +409
"ch.qos.logback" % "logback-classic" % "1.2.11",
"org.slf4j" % "slf4j-api" % "1.7.36",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own learning/curiosity, any comment on why logback instead of log4j2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might have hit some dependency issues on log4j2. Could give that a shot again if you'd prefer log4j2. It seems to be the newer framework (though there seem to have been a few sec issues found from time to time). The way we've configured logging currently though is to write async so perf wise we should see pretty low logging overhead.

@@ -0,0 +1,76 @@
# Chronon Feature Service
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explicitly document the endpoints available here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but I expect the docs to get outdated soon. A better approach might be something like wiring up OpenApi / Swagger annotations. Could tackle this in a follow up.

Comment on lines +14 to +21
/**
* Responsible for loading the relevant concrete Chronon Api implementation and providing that
* for use in the Web service code. We follow similar semantics as the Driver to configure this:
* online.jar - Jar that contains the implementation of the Api
* online.class - Name of the Api class
* online.api.props - Structure that contains fields that are loaded and passed to the Api implementation
* during instantiation to configure it (e.g. connection params)
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really appreciate the comments before the classes!

@@ -0,0 +1,76 @@
# Chronon Feature Service
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth documenting the existence of this service in the chronon website?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah def worth it. I was thinking of letting it bake / clearing out some nits etc and putting up a doc update PR in a follow up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus one to this.

}

@Override
public void handle(RoutingContext ctx) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just for my own knowledge, will vertx use one thread per request it receives? how does it handle concurrent requests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vertx's model is event driven. Various triggers like incoming http requests result in events that are queued. There's a pool of event threads that dequeue these events and dispatch them to get handled. Blocking code in the handler will jam up 1 of your n event threads (n is configurable and defaults to num cpu cores) so typically you kick off work like db lookups etc in a different thread in your handler to not block the precious event threads.


maybeFeatureResponses.onSuccess(resultList -> {
// as this is a bulkGet request, we might have some successful and some failed responses
// we return the responses in the same order as they come in and mark them as successful / failed based
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is order the only way of knowing what response maps to what request in the bulkget call? (still figuring out what the requests and responses look like)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the order matches exactly. Similar pattern to other bulkget apis out there (e.g. fb graph bulkget ).
This looks like:
Request - /v1/features/join/quickstart/training_set.v2

[{"user_id": "5"}, {"user_id": "6"}, {"user_id": "7"}]

Response:

{
  "results": [
      {"status": "Success", "features": { ...},
      {"status": "Success", "features": { ...},
      {"status": "Failure", "error": "Some exception",
  ]
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'd imagine calls would most often just be for a single element. I've wired it up as a bulkGet primarily cause the fetcher api is a bulk api and instead of sticking to a list of one it was easier to have things set up this way)

@piyush-zlai
Copy link
Contributor Author

@caiocamatta-stripe thanks for taking a look. I'll tag some folks in the coming days (feel free to loop in others at Stripe as well if you'd like more eyes on this).

One final thing that still wasn't clear to me - how does the service get the schemas for GroupBys it's serving? Or is that purely the Fetcher's responsibility?

Yeah this is done as part of the GroupByUpload jobs and the MetadataUpload (join schemas) - https://chronon.ai/getting_started/Tutorial.html#uploading-data (you need to have these upload to the same KV store as you're using in your fetcher)

If you have those two sets of jobs then your KV store is primed with GroupByUploads (which contains one row which is your GroupByServingInfo data) and the join schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants