Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUMULUS-260 update pdr stats #235

Merged
merged 22 commits into from
Mar 15, 2018
Merged

CUMULUS-260 update pdr stats #235

merged 22 commits into from
Mar 15, 2018

Conversation

jennyhliu
Copy link
Contributor

@jennyhliu jennyhliu commented Mar 6, 2018

Summary: Summary of changes

Addresses CUMULUS-260: PDR page on dashboard only shows zero

Currently the pdr status stays in 'running' even if the granules are processed and completed.

Changes

  • Update lambda pdr-status-check to take pdr as additional input and output playload, so that pdr stats (granule stats in the pdr) can be updated by the cumulus api when the workflow is complete.
  • Add lambda function sf-sns-report which is copied from @cumulus/api/lambdas/sf-sns-broadcase with modification, sf-sns-report can be used to report step function status anywhere inside a step function. So add step sf-sns-report after each pdr-status-check, we will get the PDR status progress at real time.
  • Update pdr-status-check to handle ExecutionDoesNotExist error.
  • Fixed @cumulus/common/aws setGranuleStatus

Test Plan

Things that should succeed before merging.

  • Unit tests - updated
  • Adhoc testing - tested in aws

Reviewers: @scisco @laurenfrederick

@jennyhliu
Copy link
Contributor Author

Actually, we can call indexer.partialRecordUpdate to update the pdr.stats only from pdr-status-check. I will try that

Copy link
Contributor

@scisco scisco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add changelog here?

CHANGELOG.md Outdated
- **CUMULUS-260: "PDR page on dashboard only shows zeros."** The PDR stats in LPDAAC are all 0s, even if the dashboard has been fixed to retrieve the correct fields. The current version of pdr-status-check has a few issues.
- pdr is not included in the input/output schema. It's available from the input event. So the pdr status and stats are not updated when the ParsePdr workflow is complete. Adding the pdr to the input/output of the task will fix this.
- pdr-status-check doesn't update pdr stats which prevent the real time pdr progress from showing up in the dashboard. To solve this, added lambda function sf-sns-report which is copied from @cumulus/api/lambdas/sf-sns-broadcase with modification, sf-sns-report can be used to report step function status anywhere inside a step function. So add step sf-sns-report after each pdr-status-check, we will get the PDR status progress at real time.
- It's possible a execution is still in the queue and doesn't exist in sfn yet. Added code to handle 'ExecutionDoesNotExist' error when checking the execution status.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor grammar fix: It's possible an execution is still in the queue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jennyhliu jennyhliu changed the title WIP CUMULUS-260 update pdr stats CUMULUS-260 update pdr stats Mar 13, 2018
try {
const execution = await aws.sfn().describeExecution({ executionArn }).promise();
executions.push(execution);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should } catch (e) { should be on one line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eslint doesn't like the one-line style

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because of this override: https://github.com/cumulus-nasa/cumulus/blob/master/.eslintrc.json#L86.

Our eslintrc is requiring the "stroustrup" style for braces: https://eslint.org/docs/rules/brace-style

// it's ok if a execution is still in the queue and has not be executed
if (e.code === 'ExecutionDoesNotExist') {
executions.push({ executionArn: executionArn, status: 'RUNNING' });
}
Copy link
Contributor

@abarciauskas-bgse abarciauskas-bgse Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here

} else {
  throw e;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, our eslint doesn't like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as the other instance:

That's because of this override: https://github.com/cumulus-nasa/cumulus/blob/master/.eslintrc.json#L86.

Our eslintrc is requiring the "stroustrup" style for braces: https://eslint.org/docs/rules/brace-style

}
// Error and error keys are not part of the cumulus message
// and if they appear in the message something is seriously wrong
else if (event.Error || event.error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know an instance of this occurring? A better option might be to whitelist the top-level keys we expect from a cumulus message, otherwise this seems unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't. I don't know if removing this logic will break something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was reported in an earlier bug report and this is the fix for that bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok - do you recall if that bug was before or after cumulus message adapter integration?

Copy link
Contributor

@marchuffnagle marchuffnagle Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abarciauskas-bgse brings up a very good point. When you're using the message adapter, the only keys that the simplified event is going to have are "config" and "input". This is never going to trigger.

* @returns {boolean} true if there was an exception, false otherwise
*/
function eventFailed(event) {
if (event.exception) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe flatten these nested conditionals?

if (event.exception && typeof(event.exception) === 'object' && Object.keys(event.exception.length) > 0) {
  return true
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. this is neat


/**
* if the cumulus message shows that a previous step failed,
* this function extract the error message from the cumulus message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function extracts the error message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/**
* if the cumulus message shows that a previous step failed,
* this function extract the error message from the cumulus message
* and fail the function with that information. This ensures that the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and fails the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @returns {Promise} updated event object
*/
function handler(event, context, callback) {
return StepFunction.pullEvent(event).then((message) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wrap in curly brace:

return StepFunction.pullEvent(event).then((message) => {
  cumulusMessageAdapter.runCumulusTask(publishSnsMessage, message, context, callback));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

});


test('send report when sfn is finished and granule succeed', async (t) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send report when sfn is finished and granule has succeeded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@abarciauskas-bgse
Copy link
Contributor

@jennyhliu you mentioned in the PR description Add lambda function sf-sns-report which is copied from /lambdas/sf-sns-broadcase with modification should some of the shared functionality be pulled into a separate package which becomes a dependency of both to reduce duplication of code?

Copy link
Contributor

@scisco scisco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asked for a small change

@@ -40,6 +40,7 @@
]
},
"dependencies": {
"@cumulus/api": "^1.1.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used any where. Please remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -0,0 +1,55 @@
# @cumulus/queue-pdrs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is incorrect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

@scisco
Copy link
Contributor

scisco commented Mar 13, 2018

@abarciauskas-bgse I think we should retire that other lambda function in the api package. But retiring it will break existing deployments. so we should come up with a plan for that.

@jennyhliu
Copy link
Contributor Author

@abarciauskas-bgse @scisco updated based on comments.

await exports.s3().putObject(bucket, key, '', null, { executionArn, status }).promise();
const params = { Bucket: bucket, Key: key };
params.Metadata = { executionArn, status };
await exports.s3().putObject(params).promise();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed a ticket to take a look at this functionality: https://bugs.earthdata.nasa.gov/browse/CUMULUS-420

CHANGELOG.md Outdated
@@ -17,6 +17,11 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- `@cumulus/deployment` deploys DynamoDB streams for the Collections, Providers and Rules tables as well as a new lambda function called `dbIndexer`. The `dbIndexer` lambda has an event source mapping which listens to each of the DynamoDB streams. The dbIndexer lambda receives events referencing operations on the DynamoDB table and updates the elasticsearch cluster accordingly.
- The `@cumulus/api` endpoints for collections, providers and rules _only_ query DynamoDB, with the exception of LIST endpoints and the collections' GET endpoint.

- **CUMULUS-260: "PDR page on dashboard only shows zeros."** The PDR stats in LPDAAC are all 0s, even if the dashboard has been fixed to retrieve the correct fields. The current version of pdr-status-check has a few issues.
- pdr is not included in the input/output schema. It's available from the input event. So the pdr status and stats are not updated when the ParsePdr workflow is complete. Adding the pdr to the input/output of the task will fix this.
- pdr-status-check doesn't update pdr stats which prevent the real time pdr progress from showing up in the dashboard. To solve this, added lambda function sf-sns-report which is copied from @cumulus/api/lambdas/sf-sns-broadcase with modification, sf-sns-report can be used to report step function status anywhere inside a step function. So add step sf-sns-report after each pdr-status-check, we will get the PDR status progress at real time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"broadcast", not "broadcase"

function eventFailed(event) {
// event has exception
// and it is needed to avoid flagging cases like "exception: {}" or "exception: 'none'"
if (event.exception && (typeof event.exception === 'object') &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first check for event.exception is not necessary. If it isn't set, then (typeof event.exception === 'object') will be false. Can just be:

function eventFailed(event) {
  // event has exception
  // and it is needed to avoid flagging cases like "exception: {}" or "exception: 'none'"
  if ((typeof event.exception === 'object') &&
      (Object.keys(event.exception).length > 0)) return true;

  // Error and error keys are not part of the cumulus message
  // and if they appear in the message something is seriously wrong
  if (event.Error || event.error) return true;

  return false;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is also formatted so it's a little easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually ``(event.exception)check is need to take care of the caseevent.exception=null`, otherwise `Object.keys(event.exception)` will throw exception.

* @param {Object} event - aws event object
* @returns {undefined} throws an error and does not return anything
*/
function makeLambdaFunctionFail(event) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little hard to follow what this function is doing. It might be easier with something like:

function buildError(type, cause) {
  let ErrorClass;

  if (Object.keys(errors).includes(type)) ErrorClass = errors[type];
  else if (type === 'TypeError') ErrorClass = TypeError;
  else ErrorClass = Error;

  return new ErrorClass(cause);
}

function makeLambdaFunctionFail(event) {
  const error = event.exception || event.error;

  if (!error) throw new Error('Step Function failed for an unknown reason.');
  
  throw buildError(error.Error, error.Cause);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This makes the code much clear. Sorry I didn't try to understand the code.

* @returns {Promise} updated event object
*/
function handler(event, context, callback) {
return StepFunction.pullEvent(event).then((message) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling the event is handled by the message adapter and should not be required here.

"description": "Describes the config used by the sf-sns-report task",
"type": "object",
"additionalProperties": false,
"properties": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that all of these properties are required, so they should be marked as required here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the code, they are only needed when the task is the last step of the step function and a granule state is set.

* schema that is passed to the next task in the workflow
*/
async function publishSnsMessage(event) {
const config = get(event, 'config', []);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config is an Object, not an Array. That being said, message adapter will ensure that config is always set, so there's no need for a default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

*/
async function publishSnsMessage(event) {
const config = get(event, 'config', []);
const message = get(event, 'input', []);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input is an Object, not an Array. That being said, message adapter will ensure that config is always set, so there's no need for a default value.

if (granuleId) {
await setGranuleStatus(
granuleId,
get(config, 'stack', null),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of these values are not set, you're passing null in their place. What is setGranuleStatus going to do with a null bucket, or any of the other parameters for that matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code and schema, there is no good way to prevent the user from misconfiguring the parameters.

@@ -0,0 +1,87 @@
'use strict';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these tests are verifying that the correct message was sent to SNS. They are only looking at the return value from the function. Localstack has SNS support, please update the tests to verify that the correct messages are being sent to SNS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the function to return the sns response. If the MessageId is returned, then the message is sent to SNS. There is no easy way to retrieve the message sent to SNS.

Copy link
Contributor

@marchuffnagle marchuffnagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included a number of comments that need to be addressed before this can be merged.

makeLambdaFunctionFail(message);
}

return message;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jennyhliu I think we should NOT return the whole message here. If the user forgets to include the outputs key in the cumulus config, the message ends up getting larger and larger.

Could you instead just return the output of the sns aws call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code to return the output from sns response, and the README.md

@jennyhliu
Copy link
Contributor Author

@scisco @yjpa7145 made updates based on comments. please review

if (granuleId) {
await setGranuleStatus(
granuleId,
get(config, 'stack'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you aren't looking deep into an object, there is no need for get() here. Simply saying config.stack accomplishes the same thing. The same goes for the next few lines as well.

Copy link
Contributor

@scisco scisco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👏

},
"oneOf": [
{
"required": ["stack", "bucket", "stateMachine", "executionTime"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stack, bucket, and stateMachine should always be required. Should executionTime actually be executionName? I don't see executionTime anywhere in the code, but I do see executionName.

Copy link
Contributor

@marchuffnagle marchuffnagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sf-sns-report task needs to have an input.json schema file defined.

@scisco
Copy link
Contributor

scisco commented Mar 15, 2018

@jennyhliu the input.json should accept any object. Basically whatever you pass to the task it will publish to sns.

@marchuffnagle marchuffnagle dismissed their stale review March 15, 2018 14:19

The concensus is that this all looks good.

@jennyhliu jennyhliu merged commit 449c0fb into master Mar 15, 2018
@scisco scisco deleted the pdrstats branch March 15, 2018 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants