Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Scheduler] Add DataManagementService #5063

Merged
merged 9 commits into from
Mar 18, 2021

Conversation

style95
Copy link
Member

@style95 style95 commented Feb 10, 2021

Description

This component is in charge of storing data to ETCD.
It is based on eventual consistency.
If it fails to store data for some reason, it keeps retrying until data is stored.

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

case object GrantLease

// TBD
class LeaseKeepAliveService {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized this pr is also dependent on these modules.
It would be better to merge this PR after this kind of module is introduced.

// normally these messages will be sent when queues are created.
case request: ElectLeader =>
if (inProgressKeys.contains(request.key)) {
logging.info(this, s"save request $request into a buffer")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these really need to be info level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, it stores the request into a buffer because there is already precedent request processing. If any issue happens it would let us know if the request has processed or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm still learning what this is doing, but what does each request involve? Is it every activation or some sort of metadata setup? If it's every activation it would seem spammy to me otherwise I think it's fine

.map { res =>
parent ! FinishWork(request.key)
if (res.getSucceeded) {
logging.debug(this, s"data is stored.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what data is stored?

@style95 style95 changed the title [New Scheduler] Add DataManagementService [WIP][New Scheduler] Add DataManagementService Feb 11, 2021
@style95
Copy link
Member Author

style95 commented Feb 11, 2021

I think we need to merge a PR for CI tests for scheduler components before merging this PR.
Since some downstream is running their CI tests based on the upstream core repo, we need to differentiate scheduler tests from the others. Need to see if we can setup another dedicated pipeline for scheduler components in Travis.
And when all components contribution is over, we can merge the pipeline into existing unit/system pipelines.

* In the event any issue occurs while storing data, the actor keeps trying until the data is stored guaranteeing delivery to ETCD.
* So it guarantees the data is eventually stored.
*/
class DataManagementService(watcherService: ActorRef, workerFactory: ActorRefFactory => ActorRef)(
Copy link
Member Author

@style95 style95 Feb 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will include test cases into this PR after setting up the CI pipeline for scheduler components.

}

// normally these messages will be sent when queues are created.
case request: ElectLeader =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leader election happens when a queue is created.
This is to guarantee only one scheduler creates a certain queue.
So it happens relatively fewer times.


// normally these messages will be sent when queues are created.
case request: ElectLeader =>
if (inProgressKeys.contains(request.key)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the retry nature of this component, if there is a precedent request(being retried), it would store the new request to a buffer.

logging.info(this, s"save a request $request into a buffer")
operations.getOrElseUpdate(request.key, Queue.empty[Any]).enqueue(request)
} else {
worker ! request
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actual works would be delegated to ETCDWorker.

inProgressKeys = inProgressKeys + request.key
}

case request: RegisterInitialData =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actions under the same namespace share some data such as namespace throttling data.
So it is required to store the data if there is no data yet but not overwrite an existing one.
This case is for the case.


case request: RegisterInitialData =>
// send WatchEndpoint first as the put operation will be retried until success if failed
if (request.failoverEnabled)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the failover is enabled, it would watch the key and if the key is deleted for some reason, it would try to restore it.

inProgressKeys = inProgressKeys + request.key
}

case request: RegisterData =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will overwrite the existing data in ETCD.
Generally, this is used for data that is not shared among actions.

case WatchEndpointRemoved(_, key, value, true) =>
logging.error(this, s"unexpected data received: ${WatchEndpoint(key, value, isPrefix = true, watcherName)}")

case msg: UpdateDataOnChange =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the loads against ETCD, it does not store data if there is no change in the value.

actorSystem: ActorSystem)
extends Actor {
private implicit val ec = context.dispatcher

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is used by both schedulers and invokers to store data to ETCD.
The following kinds of data are stored to ETCD.

  1. Throttling data(Action / Namespace)
  2. Queue endpoint(where a queue is running)
  3. Scheduler endpoint.
  4. Container data(running container, warmed container, data to describe how many containers are being created)

Dependent modules are Queue, ContainerProxy, CreationJobManager, etc.

@style95
Copy link
Member Author

style95 commented Feb 11, 2021

Just comes up in my mind is it would be great to write down some documents for each component in Wiki.

@style95
Copy link
Member Author

style95 commented Mar 8, 2021

Waiting for this PR(#5067) to be merged.

@style95
Copy link
Member Author

style95 commented Mar 8, 2021

I wrote a document about this module: https://cwiki.apache.org/confluence/display/OPENWHISK/DataManagementService

@style95 style95 changed the title [WIP][New Scheduler] Add DataManagementService [New Scheduler] Add DataManagementService Mar 11, 2021
@codecov-io
Copy link

codecov-io commented Mar 11, 2021

Codecov Report

Merging #5063 (ec5f93f) into master (e05aa44) will decrease coverage by 6.59%.
The diff coverage is 44.69%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5063      +/-   ##
==========================================
- Coverage   81.63%   75.03%   -6.60%     
==========================================
  Files         205      214       +9     
  Lines       10013    10448     +435     
  Branches      442      470      +28     
==========================================
- Hits         8174     7840     -334     
- Misses       1839     2608     +769     
Impacted Files Coverage Δ
...la/org/apache/openwhisk/core/etcd/EtcdClient.scala 26.22% <ø> (ø)
...apache/openwhisk/core/service/WatcherService.scala 91.66% <ø> (+2.08%) ⬆️
...openwhisk/core/service/DataManagementService.scala 44.27% <44.27%> (ø)
.../scala/org/apache/openwhisk/core/WhiskConfig.scala 95.48% <100.00%> (+0.02%) ⬆️
...core/database/cosmosdb/RxObservableImplicits.scala 0.00% <0.00%> (-100.00%) ⬇️
...ore/database/cosmosdb/cache/CacheInvalidator.scala 0.00% <0.00%> (-100.00%) ⬇️
...e/database/cosmosdb/cache/ChangeFeedConsumer.scala 0.00% <0.00%> (-100.00%) ⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala 0.00% <0.00%> (-95.85%) ⬇️
...sk/core/database/cosmosdb/CosmosDBViewMapper.scala 0.00% <0.00%> (-93.90%) ⬇️
...tabase/cosmosdb/cache/CacheInvalidatorConfig.scala 0.00% <0.00%> (-92.31%) ⬇️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e05aa44...ec5f93f. Read the comment docs.

@style95
Copy link
Member Author

style95 commented Mar 15, 2021

It's ready to merge.

@style95
Copy link
Member Author

style95 commented Mar 18, 2021

@bdoyle0182 Do you have any other comments on this PR?

@bdoyle0182
Copy link
Contributor

Just one comment. LGTM

@style95 style95 merged commit ecb1509 into apache:master Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants