Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: As a devops engineer, I want an aissemble-managed helm chart for the Hive metastore service that uses a newer version of Hive, so I have access to the latest security fixes. #127

Closed
5 tasks done
peter-mcclonski opened this issue Jun 5, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@peter-mcclonski
Copy link
Contributor

peter-mcclonski commented Jun 5, 2024

Description

In order to improve usability and maintainability, we will be migrating to a v2 chart for the hive metastore service, keeping a similar usage pattern to that seen in #103. This ticket will encompass #116 as well, to update the underlying Hive metastore version.

Definition of Done

  • Update hive-metastore-service docker image to use Hive 4.0.0
  • Validate that the current v2 hive-metastore-service helm chart functions as expected
    • If not, make necessary updates to ensure functionality.
    • Refactor chart to live under extensions-helm-spark-infrastructure
  • Update generated values/Chart file in downstream projects using the v2 profile with sensible defaults

Test Strategy/Script

  1. Generate a new project using the following command:
mvn archetype:generate -B -DarchetypeGroupId=com.boozallen.aissemble \
                          -DarchetypeArtifactId=foundation-archetype \
                          -DarchetypeVersion=1.8.0-SNAPSHOT \
                          -DartifactId=test-project\
                          -DgroupId=org.test \
                          -DprojectName='Test' \
                          -DprojectGitUrl=test.org/test-project\
&& cd test-project
  1. Add the following pipeline to test-project-pipeline-models/src/main/resources/pipelines/
{
  "name": "PysparkPersist",
  "package": "com.boozallen",
  "type": {
    "name": "data-flow",
    "implementation": "data-delivery-pyspark"
  },
  "steps": [
    {
      "name": "PersistData",
      "type": "synchronous",
      "persist": {
        "type": "hive"
      }
    }
  ]
}
  1. Add the following record to test-project-pipeline-models/src/main/resources/records/
{
  "name": "CustomRecord",
  "package": "com.boozallen.aiops.mda.pattern.record",
  "description": "Example custom record for Pyspark Data Delivery Patterns",
  "fields": [
    {
      "name": "customField",
      "type": {
        "name": "customType",
        "package": "com.boozallen.aiops.mda.pattern.dictionary"
      }
    }
  ]
}
  1. Add the following dictionary to test-project-pipeline-models/src/main/resources/dictionaries/
{
  "name": "PysparkDataDeliveryDictionary",
  "package": "com.boozallen.aiops.mda.pattern.dictionary",
  "dictionaryTypes": [
    {
      "name": "customType",
      "simpleType": "string"
    }
  ]
}
  1. Execute mvn clean install -Dmaven.build.cache.skipCache=true repeatedly, resolving all presented manual actions until none remain.
  2. Within test-project-deploy/pom.xml, replace aissemble-spark-infrastructure-deploy with aissemble-spark-infrastructure-deploy-v2
  3. Delete the directory test-project-deploy/src/main/resources/apps/spark-infrastructure
  4. Delete all references to hive-metastore-service from your Tiltfile
  5. Within test-project-pipelines/test-project-data-access/src/main/resources/application.properties, set quarkus.datasource.jdbc.url to jdbc:hive2://spark-infrastructure-sts-service:10001/default;transportMode=http;httpPath=cliservice
  6. Within test-project-pipelines/pyspark-persist/src/pyspark_persist/step/persist_data.py, define the implementation for execute_step_impl as follows:
    def execute_step_impl(self) -> None:
        from ..record.custom_record import CustomRecord
        from ..schema.custom_record_schema import CustomRecordSchema
        custom_record = CustomRecord.from_dict({"customField": "foo"})
        record2 = CustomRecord.from_dict({"customField": "bar"})
        df = self.spark.createDataFrame(
            [
                custom_record,
                record2
            ],
            CustomRecordSchema().struct_type
        )
        self.save_dataset(df, "my_new_table")
  1. Replace the contents of test-project-pipelines/pyspark-persist/src/pyspark_persist/resources/apps/pyspark-persist-dev-values.yaml with the following:
sparkApp:
    spec:
      image: "test-project-spark-worker-docker:latest"
      sparkConf:
        spark.eventLog.enabled: "false"
        spark.sql.catalogImplementation: "hive"
        spark.eventLog.dir: "s3a://spark-infrastructure/spark-events"
        spark.hadoop.fs.s3a.endpoint: "http://s3-local:4566"
        spark.hadoop.fs.s3a.access.key: "123"
        spark.hadoop.fs.s3a.secret.key: "456"
        spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
        spark.hive.server2.thrift.port: "10000"
        spark.hive.server2.thrift.http.port: "10001"
        spark.hive.server2.transport.mode: "http"
        spark.hive.metastore.warehouse.dir: "s3a://spark-infrastructure/warehouse"
        spark.hadoop.fs.s3a.path.style.access: "true"
        spark.hive.server2.thrift.http.path: "cliservice"
        spark.hive.metastore.schema.verification: "false"
        spark.hive.metastore.uris: "thrift://hive-metastore-service:9083/default"
      driver:
        cores: 1
        memory: "2048m"
      executor:
        cores: 1
        memory: "2048m"
  1. Execute mvn clean install -Dmaven.build.cache.skipCache=true once.
  2. Use kubectl apply -f to apply the following yaml:
apiVersion: v1
kind: ConfigMap
metadata:
  name: spark-config
data: {}
  1. To avoid an unrelated bug, open your Tiltfile, and remove the entry for pipeline-invocation-service.
  2. Execute tilt up
  3. Once all resources are ready, trigger the pyspark-persist pipeline
  4. Use kubectl get pods | grep data-access to get the name of the data access pod.
  5. Use kubectl exec -it <DATA_ACCESS_POD_NAME> -- bash to enter the data access pod
  6. Execute curl -X POST localhost:8080/graphql -H "Content-Type: application/json" -d '{ "query": "{ CustomRecord(table: \"my_new_table\") { customField } }" }' and ensure that data including two records is returned, ie: {"data":{"CustomRecord":[{"customField":null},{"customField":null}]}}
  7. Note on step 19: If you don't get any values back, in a fresh prompt, execute kubectl get svc | grep sts. It can take a minute or two to provision the service.

References/Additional Context

@peter-mcclonski peter-mcclonski added the enhancement New feature or request label Jun 5, 2024
@peter-mcclonski peter-mcclonski self-assigned this Jun 5, 2024
peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 7, 2024
@Cho-William
Copy link
Contributor

OTS completed

peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 11, 2024
peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 11, 2024
peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 11, 2024
peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 11, 2024
peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 11, 2024
peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 11, 2024
@ewilkins-csi ewilkins-csi added this to the 1.8.0 milestone Jun 12, 2024
peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 12, 2024
peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 12, 2024
peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 12, 2024
peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 12, 2024
peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 13, 2024
peter-mcclonski added a commit that referenced this issue Jun 13, 2024
…ce-v2

#127 #116 Hive Metastore Service v2 chart and Hive upgrade
peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 13, 2024
peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 13, 2024
peter-mcclonski added a commit to peter-mcclonski/aissemble that referenced this issue Jun 13, 2024
peter-mcclonski added a commit that referenced this issue Jun 13, 2024
@csun-cpointe
Copy link
Contributor

final test passed!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants