Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EDSC-4265: Develop API endpoint to dynamically create jupyter notebook #1834

Merged
merged 11 commits into from
Nov 26, 2024

Conversation

dmistry1
Copy link
Contributor

@dmistry1 dmistry1 commented Nov 20, 2024

Overview

What is the feature?

Creating a Lambda that can dynamically create a Jupyter Notebook based on parameter passed in.

What is the Solution?

Created a Node Lambda that will dynamically generate a notebook and saves it to S3 bucket.

Workflow
1. Created a POST `/generateNotebook` endpoint
2. Takes `granuleId`, `boundingBox`, `variableId`, and `referrerUrl` as parameters
3. Calls CMR GraphQL to retrieve granules information
4. Generates the Jupyter Notebook using JavaScript Handlebars library
5. Saves the notebook to an S3 bucket on AWS
6. Returns a Signed URL of the bucket as a response for download

Changes:
- Implemented new Lambda function `generateNotebook`
- Added GraphQL query to fetch granule information
- Integrated Handlebars for notebook template rendering
- Set up S3 bucket for notebook storage
- Implemented signed URL generation for secure notebook access

Testing

Endpoint: http://localhost:3001/dev/generateNotebook

{
    "granuleId": "G3269187397-POCLOUD",
    "boundingBox": "-86.44922, 24.58316, -81.03516, 30.49084",
    "variableId": "V2028632042-POCLOUD",
    "referrerUrl": "https://search.earthdata.nasa.gov/search/granules?p=C1996881146-POCLOUD&pg[0][v]=f&pg[0][gsk]=-start_date&q=GHRSST%20Level%204%20MUR%20Global%20Foundation%20Sea%20Surface%20Temperature%20Analysis%20(v4.1)&sb[0]=-90.5625%2C22.7481%2C-81.49219%2C30.85594&qt=2024-10-09T00%3A00%3A00.000Z%2C2024-10-10T23%3A59%3A59.999Z&tl=1729527755!3!!&lat=26.28792637835305&long=-92.8916015625&zoom=5"
}

Returns a 307 with URL to download the file in response header.

Also deployed my branch to SIT and verify the functionality as expected.

Checklist

  • I have added automated tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings

Copy link

codecov bot commented Nov 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.52%. Comparing base (c20fbfa) to head (6e5cd37).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1834      +/-   ##
==========================================
+ Coverage   93.50%   93.52%   +0.02%     
==========================================
  Files         772      774       +2     
  Lines       18650    18709      +59     
  Branches     4807     4806       -1     
==========================================
+ Hits        17438    17497      +59     
  Misses       1131     1131              
  Partials       81       81              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eudoroolivares2016 eudoroolivares2016 changed the title Edsc 4265 EDSC-4265: Develop API endpoint to dynamically create jupyter notebook Nov 20, 2024
serverless.yml Outdated Show resolved Hide resolved
@dmistry1 dmistry1 marked this pull request as ready for review November 21, 2024 19:45
@dmistry1 dmistry1 force-pushed the EDSC-4265 branch 3 times, most recently from 827182f to cdcd1cf Compare November 21, 2024 20:43
@eudoroolivares2016
Copy link
Contributor

eudoroolivares2016 commented Nov 21, 2024

Writing here so I don't forget. We'll want to update the Deployment section in the README with the vpc values that NGAP has given us for bucket policies

Something like:
`This application requires known VPC values from NASA Internet Services to properly setup S3 bucket policies

  • Internet_Services_East_VPC
  • Internet_Services_West_VPC`

serverless-configs/aws-resources.yml Outdated Show resolved Hide resolved
serverless/src/generateNotebook/handler.js Show resolved Hide resolved
serverless/src/generateNotebook/handler.js Outdated Show resolved Hide resolved

if (process.env.IS_OFFLINE) {
config.endpoint = 'http://localhost:4569'
config.forcePathStyle = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On MMT we set this value for both offline and deployed versions, why is it only for offline mode here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was one of the things that was pointed out by NGAP. Having forcePathStyle = true changes the way the AWS SDK generates S3 URL: https://s3.<region>.amazonaws.com/<bucket-name>/<key>. But in the deployed app, CloudFront expects the S3 origin to use virtual-hosted-style URLs https://<bucket-name>.s3.<region>.amazonaws.com/<key>.

According to AWS doc using forcePathStyle creates a conflict with CloudFront because it routes traffic based on the hostname, which must include the bucket name. Path-style URLs are incompatible with CloudFront's expected configuration.

For local where CloudFront isn't in use, forcePathStyle is needed to work to local S3.

serverless/src/generateNotebook/handler.js Show resolved Hide resolved
package.json Outdated Show resolved Hide resolved
bin/deploy-bamboo.sh Outdated Show resolved Hide resolved
bin/deploy-bamboo.sh Outdated Show resolved Hide resolved
serverless/src/generateNotebook/handler.js Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
- http:
method: post
cors: ${file(./serverless-configs/${self:provider.name}-cors-configuration.yml)}
path: generateNotebook
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We like to stick with snake case for paths, so generate-notebook would be preferred here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our lambdas actually use underscores instead of snake-case (not sure why). But lets do generate_notebook to be consistent with the other lambda paths

const parsedNotebook = JSON.parse(renderedNotebookString)

// Generates notebook key
const key = `notebook/rendered_notebook_${granuleId}.ipynb`
Copy link
Collaborator

@trevorlang trevorlang Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should tweak this file name. Users don't really know what a CMR concept id is. A timestamp would prevent collisions in the case we get downloads for the same granule with different parameters.

Something like {granule name}-sample-notebook_{timestamp} might work well.

"source": [
"# Define the bounding area\n",
"\n",
"# Select the data within the bounding box applied in Earthdata Search at the time of generation.\n",
Copy link
Collaborator

@trevorlang trevorlang Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section isnt quite right. I think we can make it a little more useful for the user.

When the user has a bounding box applied, the section should read:

# Select the data within the bounding box that was applied in Earthdata Search.
min_lon = -92.17969
min_lat = 22.19104
max_lon = -80.89453
max_lat = 31.12491

# To select data for the granule encompassing the entire globe, remove the variables above and uncomment the following variable declarations for the coordinate points.
# min_lon = -90
# min_lat = -180
# max_lon =  90
# max_lat =  180

When a custom bounding box is not applied, the section should read:

# Select the data by setting variable declarations for the coordinate points to encompass the entire globe. These values can be updated to subset the data to a different area of interest. The values can be set manually by changing the values or by setting a bounding box before generating a notebook in Earthdata Search.
min_lon = -90
min_lat = -180
max_lon =  90
max_lat =  180

"cell_type": "markdown",
"id": "508dcd76-0e18-4f37-ba4f-dd0466ddc7cb",
"metadata": {},
"source": [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to add a small section to this bit of text reading "If a bounding box is applied in Earthdata Search when generating this notebook, the bounding box coordinates will be used below."

The updated code should looks something like this:

   "source": [
    "## Select a subset of the data using `xarray.DataTree.sel()`\n",
    "\n",
    "The `xarray.DataTree.sel()` function can be used to return a new dataset which has been indexed to a specific bounding area. For large datasets, this can result in improved performance when doing analysis and plotting. \n",
    "\n",
    "If a bounding box is applied in Earthdata Search when generating this notebook, the bounding box coordinates will be used below. \n",
    "\n",
    "Find more information about `xarray.DataTree.sel()` and its parameters in the [xarray.DataTree.sel documentation](https://docs.xarray.dev/en/latest/generated/xarray.DataTree.sel.html)."
   ]

@dmistry1 dmistry1 requested a review from trevorlang November 26, 2024 16:44
EDSC-4265: Fixes quotes

EDSC-4265: Adds CLOUDFRONT_OAI_ID as an env variable

EDSC-4265: Testing S3 env and only us-east for bucket policy

EDSC-4265: Use VPC values from NGAP bucket policies

EDSC-4265: Adds missing sourceVPC

EDSC-4265: Adds support for generateNotebook Lambda
})
})

describe('when bounding field are is provided', () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably change this to something like "when a bounding box is provided"

@dmistry1 dmistry1 merged commit 2d59b50 into main Nov 26, 2024
11 checks passed
@dmistry1 dmistry1 deleted the EDSC-4265 branch November 26, 2024 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants