Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate array item types from SQL to Avro schema #931

Merged
merged 7 commits into from
Dec 6, 2024

Conversation

shnapz
Copy link
Contributor

@shnapz shnapz commented Nov 26, 2024

Before the change it generated this:

  }, {
    "name" : "arr1",
    "type" : [ "null", {
      "type" : "array"
    } ],
    "doc" : "From sqlType 2003 ARRAY (java.sql.Array)",
    "default" : null,
    "typeName" : "ARRAY",
    "columnClassName" : "java.sql.Array",
    "sqlCode" : "2003",
    "columnName" : "arr1"
  } ],

after the change it generetes:

  }, {
    "name" : "arr1",
    "type" : [ "null", {
      "type" : "array",
      "items" : "string"
    } ],
    "doc" : "From sqlType 2003 ARRAY (java.sql.Array)",
    "default" : null,
    "typeName" : "ARRAY",
    "columnClassName" : "java.sql.Array",
    "sqlCode" : "2003",
    "columnName" : "arr1"
  } ],
  • Executed e2e (Postgres) and manually inspected output
  • Modified existing unit tests to include typed arrays and assert the correct behavior on in-memory H2 DB

Checklist for PR author(s)

  • Changes are covered by unit tests (no major decrease in code coverage %) and/or integration tests.
  • Ensure code formating (use mvn com.coveo:fmt-maven-plugin:format org.codehaus.mojo:license-maven-plugin:update-file-header)
  • Document any relevant additions/changes in the appropriate spot in javadocs/docs/README.

@shnapz shnapz requested a review from labianchin November 26, 2024 22:03
@@ -74,6 +74,8 @@ public static Schema createSchemaByReadingOneRow(
try (Statement statement = connection.createStatement()) {
final ResultSet resultSet = statement.executeQuery(queryBuilderArgs.sqlQueryWithLimitOne());

resultSet.next();
Copy link
Contributor Author

@shnapz shnapz Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is needed to call resultSet.getArray(i). Also it corresponds to the method name

@@ -225,10 +236,12 @@ private static SchemaBuilder.FieldAssembler<Schema> createAvroFields(
} else {
return field.bytesType();
}
case ARRAY:
return setAvroColumnType(arrayItemType, null, precision, columnClassName,
useLogicalTypes, field.array().items());
Copy link
Contributor Author

@shnapz shnapz Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you run e2e tests this change will result in:

  }, {
    "name" : "arr1",
    "type" : [ "null", {
      "type" : "array",
      "items" : "string"
    } ],
    "doc" : "From sqlType 2003 ARRAY (java.sql.Array)",
    "default" : null,
    "typeName" : "ARRAY",
    "columnClassName" : "java.sql.Array",
    "sqlCode" : "2003",
    "columnName" : "arr1"
  } ],

Copy link

codecov bot commented Nov 26, 2024

Codecov Report

Attention: Patch coverage is 97.22222% with 1 line in your changes missing coverage. Please review.

Project coverage is 91.65%. Comparing base (9c72eac) to head (e913ef4).
Report is 1 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master     #931      +/-   ##
============================================
+ Coverage     91.47%   91.65%   +0.17%     
- Complexity      243      252       +9     
============================================
  Files            26       26              
  Lines           927      947      +20     
  Branches         67       71       +4     
============================================
+ Hits            848      868      +20     
  Misses           52       52              
  Partials         27       27              

e2e/e2e.sh Outdated
time docker run --interactive --rm \
--net="$DOCKER_NETWORK" \
--mount="type=bind,source=$PROJECT_PATH/dbeam-core/target,target=/dbeam" \
$OUTPUT_MOUNT_EXP \
Copy link
Contributor Author

@shnapz shnapz Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed to get test output from container to the local folder. This is how I verified the change

Copy link
Collaborator

@labianchin labianchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

  1. Note this might be a schema breaking change for exports from ARRAY. I guess that is not too frequent, so it might be fine.
  2. Consider adding/improving tests to JdbcAvroSchemaTest, JdbcAvroRecordTest. And perhaps adding array type to dbeam-core/src/test/java/com/spotify/dbeam/Coffee.java fixture.

e2e/e2e.sh Outdated Show resolved Hide resolved
@RustedBones
Copy link
Contributor

Looking at the beam jdbc schema conversion here, it relies on the getColumnTypeName to infer the element type.

See in the tests:

  • getColumnType -> ARRAY
  • getColumnTypeName -> INTEGER

@shnapz
Copy link
Contributor Author

shnapz commented Nov 27, 2024

Looking at the beam jdbc schema conversion here, it relies on the getColumnTypeName to infer the element type.

See in the tests:

  • getColumnType -> ARRAY
  • getColumnTypeName -> INTEGER

I am not in favor of Beam's approach, because getColumnTypeName does:
Retrieves the designated column's database-specific type name.
that in fact corresponds to Postgres's native types https://www.postgresql.org/docs/current/datatype.html like text, point, inet
and when Beam calls:
JDBCType.valueOf(md.getColumnTypeName(index));
it coincidentally works for all types that are common between JDBCType and database specific like integer, bigint, but won't work for anything else.
So there should be a logic like:

if (db == "postgres") {
  // match Postgres types type
} else if (db == "mysql") {
  //
}

but the solution I made doesn't need this at all, because resultSet.getArray(i).getBaseType() already returns the kind of type that our code already works with

Copy link
Contributor

@RustedBones RustedBones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the doc here ?

e2e/e2e.sh Outdated Show resolved Hide resolved
e2e/e2e.sh Outdated Show resolved Hide resolved
shnapz and others added 4 commits December 6, 2024 11:31
@shnapz shnapz merged commit addacc3 into master Dec 6, 2024
8 checks passed
@shnapz shnapz deleted the akabas/ArrayItemTypes branch December 6, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants