Introduce runners #230

fracek · 2023-08-22T16:27:44Z

Is your feature request related to a problem? Please describe.

As seen in #226 and #227, we need a way to dynamically spinup indexers. This system needs to be robust enough for production usage, and support different deployment models (Kubernetes, Docker, bare metal).

Describe the solution you'd like

The idea is to introduce schedulers. A scheduler is a server (gRPC for now, + REST later) that exposes a CRUD interface for indexers, and schedules them to run in the background. The scheduler is also used to show indexing status (see #229) and logs.

Notice that the CreateIndexer operation is idempotent: if an indexer with the same indexer definition.id already exists, it will return the existing indexer without creating a new one.

See message below for a first draft of the implementation

The idea is to include at least two implementations of the scheduler:

one with no external dependencies used for development. it will be less robust, but for development it's good enough.
one based on kubernetes, that combined with the operator enables us to have a production ready setup.

After we have this, implementing #226 and #227 becomes easy:

apibara up command to run multiple indexers at the same time #226: for each indexer in the config, call CreateIndexer. No need to check if the indexer exists since operations are idempotent. apibara down will call DeleteIndexer.
Indexers factories #227: the factory simply calls CreateIndexer with the returned value. This becomes a sink like any other.

Additional context

The idea to use an API to schedule factory indexers comes from a telegram chat with @bigherc18

The text was updated successfully, but these errors were encountered:

bigherc18 · 2023-08-22T23:24:37Z

Overall it looks good to me, I have some remarks though

I wouldn't it call it scheduler, I'll prefer something like manager, a scheduler as it's name indicates will be expected to schedule tasks/processes in the future, as cron jobs, in a schedule ....
Why would go with RPC first ? This is a basic CRUD server, I'd say it's better to use REST in this case, it'd easier for us and other people to write plugins, tests ...

fracek · 2023-08-24T09:19:50Z

Agree, I think "runner" is a better name for this component. Let's use that.

I like gRPC because it's self documenting, it forces us to version the API following best practices, and we can generate clients automatically.

fracek · 2023-08-24T10:14:06Z

I'm going to sketch out the runner gRPC Service so that we can start working on an implementation.
We follow Google's AIP guidelines when possible since they're well thought out, but we're not too strict.

Indexer resource

Terminology:

Resource Type: Indexer.
Collection Identifier: indexers.
Resource Id: the identifier specified by the user when creating the resource, e.g. my-indexer.
Resource Name: the combination of the collection identifier and resource id, e.g. indexers/my-indexer.

message Indexer {
    message Status {
    }

    // Resource name, e.g. `indexers/my-indexer`.
    string name = 1;

    // Additional labels attached to the indexer.
    // Useful to attach application-specific metadata.
    map<string, string> labels = 2;
}

Open Questions: will edit later.

How to specify the indexer script? In a cloud environment, the indexer must be downloaded in the container before running.
How to model parent-child indexers? When we delete an indexer, we need to delete all of its children as well.

Operations

Create

This method creates a new indexer if one with the same indexer_id doesn't
exist. If it already exists, it simply returns the existing indexer.

Notice that if the client provides a value for fields that are set server-side
(like name or status), they are simply ignored.

service IndexerRunner {
    rpc CreateIndexer(CreateIndexerRequest) returns (Indexer);
}

message CreateIndexeRequest {
    // Indexer id, e.g. `my-indexer`.
    string indexer_id = 1;
    Indexer indexer = 2;
}

Delete

This method deletes the indexer. If persistence for the indexers is configured,
this method must also clear the indexer state from it.

service IndexerRunner {
    rpc DeleteIndexer(DeleteIndexerRequest) returns (google.protobuf.Empty);
}

message DeleteIndexerRequest {
    // Indexer name, e.g. `indexers/my-indexer`.
    string name = 1;
}

Get

This method simply gets an indexer by its name.

service IndexerRunner {
    rpc GetIndexer(GetIndexerRequest) returns (Indexer);
}

message GetIndexerRequest {
    // Indexer name, e.g. `indexers/my-indexer`.
    string name = 1;
}

List

List all indexers according to some criteria. Returns a paginated list of indexers.

We implement filtering based on AIP 160. The idea
is to use a string as filter to allow us to change filtering easily without
breaking changes.

Since filtering is complex, we skip it at first.

service IndexerRunner {
    rpc ListIndexers(ListIndexersRequest) returns (ListIndexersResponse);
}

message ListIndexersRequest {
    // Number of indexers per page.
    int32 page_size = 1;
    // Continuation token.
    string page_token = 2;
    // Filter indexers.
    string filter = 3;
}

message ListIndexersResponse {
    repeated Indexer indexers = 1;
    string next_page_token = 2;
}

Stream Logs

This method returns a stream of logs for the indexer. It is an infinite stream
of data since we expect the indexer to keep producing logs.

service IndexerRunner {
    rpc StreamLogs(StreamLogsRequest) returns (stream StreamLogsResponse);
}

enum LogLevel {
    LOG_LEVEL_UNKNOWN = 0;
    LOG_LEVEL_TRACE = 1;
    LOG_LEVEL_DEBUG = 2;
    LOG_LEVEL_INFO = 3;
    LOG_LEVEL_WARNING = 4;
    LOG_LEVEL_ERROR = 5;
}

message StreamLogsRequest {
    // The name of the indexer, e.g. `indexers/my-indexer`.
    string parent = 1;
    LogLevel level = 2;
}

message StreamLogsResponse {
    LogLevel level = 1;
    string content = 2;
}

Indexers persistence

The runner is responsible for setting up the indexers persistence. This is for several reasons:

developers want to configure this once and forget about it.
delete operations must clear the indexer state, so the runner must know about persistence anyway.

Runner persistence

In some cases, the runner needs to keep track of the indexers it created. I
believe it would be easier if it can work with the same persistence as the
sinks.

--persist-to-fs: stores data in the same folder as indexers. To keep it
simple, it dumps the Indexer object sent by CreateIndexer as json to a
file named <indexer-id>.indexer.
--persist-to-etcd: stores the content of the Indexer object sent by
CreateIndexer to the database. The key should be something like
indexers:<indexer-id> so that ListIndexer simply scans through this key.

Some runners (like the one based on Kubernetes) can use other persistence mechanism.

Other considerations

This service does not deal with authentication or authorization. Deployments
that want to deal with it must create a facade service that adds
authentication/authorization to this service.

fracek · 2023-08-24T11:08:21Z

Re: how to specify indexer script.

We add two new properties to the indexer:

project_source: this is the location of the project. Can be a directory (file:///path/to/dir) or a github url (github:fracek/my-indexer).
project_dir: the subfolder (if any) that contains the indexer script.

The indexer path is then compute as ${project_dir}/${script}, relative from the root of project source.

In practice

apibara up

Creates new indexers as defined in the configuration. project_source is set to the path of the folder containing the config file, and project_dir is empty.

indexer factory

By default, project_source is the current directory and project_dir is empty.

in both cases

When they call the api to create an indexer, they forwardsthe current project_source/project_dir. Ideally, we let users override source and dir for any indexer (so that they can deploy from a third party repository).

fracek · 2023-08-24T11:18:21Z

re: delete an indexer and its children

The easiest solution is to add a spawned_by property to the indexer, with the name/id of the indexer that spawned the current indexer.

On delete, the runner goes through all indexers where spawned_by is the current indexer and deletes them (recursively).

Note that we cannot use the name parent because according to AIP it's a different things.

github-actions · 2024-02-26T02:12:47Z

This issue has been automatically marked as stale because it has not had activity in the six months. It will be closed in 2 weeks if no further activity occurs. Please feel free to leave a comment if you believe the issue is still relevant.

$@fracek$ fracek added the enhancement New feature or request label Aug 22, 2023

$@fracek$ fracek changed the title ~~Introduce schedulers~~ Introduce runners Aug 23, 2023

github-actions bot added the stale label Feb 26, 2024

$@fracek$ fracek added no stale and removed stale labels Feb 26, 2024

$@fracek$ fracek added the needs-rfc Need to draft an RFC before working on it label Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce runners #230

Introduce runners #230

fracek commented Aug 22, 2023 •

edited

Loading

bigherc18 commented Aug 22, 2023

fracek commented Aug 24, 2023

fracek commented Aug 24, 2023 •

edited

Loading

fracek commented Aug 24, 2023

fracek commented Aug 24, 2023

github-actions bot commented Feb 26, 2024

Introduce runners #230

Introduce runners #230

Comments

fracek commented Aug 22, 2023 • edited Loading

bigherc18 commented Aug 22, 2023

fracek commented Aug 24, 2023

fracek commented Aug 24, 2023 • edited Loading

Indexer resource

Operations

Create

Delete

Get

List

Stream Logs

Indexers persistence

Runner persistence

Other considerations

fracek commented Aug 24, 2023

In practice

fracek commented Aug 24, 2023

github-actions bot commented Feb 26, 2024

fracek commented Aug 22, 2023 •

edited

Loading

fracek commented Aug 24, 2023 •

edited

Loading