Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce runners #230

Open
fracek opened this issue Aug 22, 2023 · 6 comments
Open

Introduce runners #230

fracek opened this issue Aug 22, 2023 · 6 comments
Labels
enhancement New feature or request needs-rfc Need to draft an RFC before working on it no stale

Comments

@fracek
Copy link
Contributor

fracek commented Aug 22, 2023

Is your feature request related to a problem? Please describe.

As seen in #226 and #227, we need a way to dynamically spinup indexers. This system needs to be robust enough for production usage, and support different deployment models (Kubernetes, Docker, bare metal).

Describe the solution you'd like

The idea is to introduce schedulers. A scheduler is a server (gRPC for now, + REST later) that exposes a CRUD interface for indexers, and schedules them to run in the background. The scheduler is also used to show indexing status (see #229) and logs.

Notice that the CreateIndexer operation is idempotent: if an indexer with the same indexer definition.id already exists, it will return the existing indexer without creating a new one.

See message below for a first draft of the implementation

The idea is to include at least two implementations of the scheduler:

  • one with no external dependencies used for development. it will be less robust, but for development it's good enough.
  • one based on kubernetes, that combined with the operator enables us to have a production ready setup.

After we have this, implementing #226 and #227 becomes easy:

Additional context

The idea to use an API to schedule factory indexers comes from a telegram chat with @bigherc18

@fracek fracek added the enhancement New feature or request label Aug 22, 2023
@bigherc18
Copy link
Collaborator

Overall it looks good to me, I have some remarks though

  • I wouldn't it call it scheduler, I'll prefer something like manager, a scheduler as it's name indicates will be expected to schedule tasks/processes in the future, as cron jobs, in a schedule ....
  • Why would go with RPC first ? This is a basic CRUD server, I'd say it's better to use REST in this case, it'd easier for us and other people to write plugins, tests ...

@fracek fracek changed the title Introduce schedulers Introduce runners Aug 23, 2023
@fracek
Copy link
Contributor Author

fracek commented Aug 24, 2023

Agree, I think "runner" is a better name for this component. Let's use that.

I like gRPC because it's self documenting, it forces us to version the API following best practices, and we can generate clients automatically.

@fracek
Copy link
Contributor Author

fracek commented Aug 24, 2023

I'm going to sketch out the runner gRPC Service so that we can start working on an implementation.
We follow Google's AIP guidelines when possible since they're well thought out, but we're not too strict.

Indexer resource

Terminology:

  • Resource Type: Indexer.
  • Collection Identifier: indexers.
  • Resource Id: the identifier specified by the user when creating the resource, e.g. my-indexer.
  • Resource Name: the combination of the collection identifier and resource id, e.g. indexers/my-indexer.
message Indexer {
    message Status {
    }

    // Resource name, e.g. `indexers/my-indexer`.
    string name = 1;

    // Additional labels attached to the indexer.
    // Useful to attach application-specific metadata.
    map<string, string> labels = 2;
}

Open Questions: will edit later.

  • How to specify the indexer script? In a cloud environment, the indexer must be downloaded in the container before running.
  • How to model parent-child indexers? When we delete an indexer, we need to delete all of its children as well.

Operations

Create

This method creates a new indexer if one with the same indexer_id doesn't
exist. If it already exists, it simply returns the existing indexer.

Notice that if the client provides a value for fields that are set server-side
(like name or status), they are simply ignored.

service IndexerRunner {
    rpc CreateIndexer(CreateIndexerRequest) returns (Indexer);
}

message CreateIndexeRequest {
    // Indexer id, e.g. `my-indexer`.
    string indexer_id = 1;
    Indexer indexer = 2;
}

Delete

This method deletes the indexer. If persistence for the indexers is configured,
this method must also clear the indexer state from it.

service IndexerRunner {
    rpc DeleteIndexer(DeleteIndexerRequest) returns (google.protobuf.Empty);
}

message DeleteIndexerRequest {
    // Indexer name, e.g. `indexers/my-indexer`.
    string name = 1;
}

Get

This method simply gets an indexer by its name.

service IndexerRunner {
    rpc GetIndexer(GetIndexerRequest) returns (Indexer);
}

message GetIndexerRequest {
    // Indexer name, e.g. `indexers/my-indexer`.
    string name = 1;
}

List

List all indexers according to some criteria. Returns a paginated list of indexers.

We implement filtering based on AIP 160. The idea
is to use a string as filter to allow us to change filtering easily without
breaking changes.

Since filtering is complex, we skip it at first.

service IndexerRunner {
    rpc ListIndexers(ListIndexersRequest) returns (ListIndexersResponse);
}

message ListIndexersRequest {
    // Number of indexers per page.
    int32 page_size = 1;
    // Continuation token.
    string page_token = 2;
    // Filter indexers.
    string filter = 3;
}

message ListIndexersResponse {
    repeated Indexer indexers = 1;
    string next_page_token = 2;
}

Stream Logs

This method returns a stream of logs for the indexer. It is an infinite stream
of data since we expect the indexer to keep producing logs.

service IndexerRunner {
    rpc StreamLogs(StreamLogsRequest) returns (stream StreamLogsResponse);
}

enum LogLevel {
    LOG_LEVEL_UNKNOWN = 0;
    LOG_LEVEL_TRACE = 1;
    LOG_LEVEL_DEBUG = 2;
    LOG_LEVEL_INFO = 3;
    LOG_LEVEL_WARNING = 4;
    LOG_LEVEL_ERROR = 5;
}

message StreamLogsRequest {
    // The name of the indexer, e.g. `indexers/my-indexer`.
    string parent = 1;
    LogLevel level = 2;
}

message StreamLogsResponse {
    LogLevel level = 1;
    string content = 2;
}

Indexers persistence

The runner is responsible for setting up the indexers persistence. This is for several reasons:

  • developers want to configure this once and forget about it.
  • delete operations must clear the indexer state, so the runner must know about persistence anyway.

Runner persistence

In some cases, the runner needs to keep track of the indexers it created. I
believe it would be easier if it can work with the same persistence as the
sinks.

  • --persist-to-fs: stores data in the same folder as indexers. To keep it
    simple, it dumps the Indexer object sent by CreateIndexer as json to a
    file named <indexer-id>.indexer.
  • --persist-to-etcd: stores the content of the Indexer object sent by
    CreateIndexer to the database. The key should be something like
    indexers:<indexer-id> so that ListIndexer simply scans through this key.

Some runners (like the one based on Kubernetes) can use other persistence mechanism.

Other considerations

This service does not deal with authentication or authorization. Deployments
that want to deal with it must create a facade service that adds
authentication/authorization to this service.

@fracek
Copy link
Contributor Author

fracek commented Aug 24, 2023

Re: how to specify indexer script.

We add two new properties to the indexer:

  • project_source: this is the location of the project. Can be a directory (file:///path/to/dir) or a github url (github:fracek/my-indexer).
  • project_dir: the subfolder (if any) that contains the indexer script.

The indexer path is then compute as ${project_dir}/${script}, relative from the root of project source.

In practice

apibara up

Creates new indexers as defined in the configuration. project_source is set to the path of the folder containing the config file, and project_dir is empty.

indexer factory

By default, project_source is the current directory and project_dir is empty.

in both cases

When they call the api to create an indexer, they forwardsthe current project_source/project_dir. Ideally, we let users override source and dir for any indexer (so that they can deploy from a third party repository).

@fracek
Copy link
Contributor Author

fracek commented Aug 24, 2023

re: delete an indexer and its children

The easiest solution is to add a spawned_by property to the indexer, with the name/id of the indexer that spawned the current indexer.

On delete, the runner goes through all indexers where spawned_by is the current indexer and deletes them (recursively).

Note that we cannot use the name parent because according to AIP it's a different things.

Copy link

This issue has been automatically marked as stale because it has not had activity in the six months. It will be closed in 2 weeks if no further activity occurs. Please feel free to leave a comment if you believe the issue is still relevant.

@github-actions github-actions bot added the stale label Feb 26, 2024
@fracek fracek added no stale and removed stale labels Feb 26, 2024
@fracek fracek added the needs-rfc Need to draft an RFC before working on it label Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs-rfc Need to draft an RFC before working on it no stale
Projects
None yet
Development

No branches or pull requests

2 participants