Skip to content

Conversation

@gaborgsomogyi
Copy link
Contributor

@gaborgsomogyi gaborgsomogyi commented Apr 14, 2020

What changes were proposed in this pull request?

When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it.

This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues.

In this PR I've added DB2 support (other supported databases will come in later PRs).

What this PR contains:

  • Added DB2ConnectionProvider
  • Added DB2ConnectionProviderSuite
  • Added DB2KrbIntegrationSuite docker integration test
  • Changed DB2 JDBC driver to use the latest (test scope only)
  • Changed test table data type to a type which is supported by all the databases
  • Removed double connection creation on test side
  • Increased connection timeout in docker tests because DB2 docker takes quite a time to start

Why are the changes needed?

Missing JDBC kerberos support.

Does this PR introduce any user-facing change?

Yes, now user is able to connect to DB2 using kerberos.

How was this patch tested?

  • Additional + existing unit tests
  • Additional + existing integration tests
  • Test on cluster manually



@DockerTest
@Ignore // AMPLab Jenkins needs to be updated before shared memory works on docker
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant change. Since docker tests are not integrated into jenkins we can turn this on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing that other test suites for other DBMS don't have this, so good to remove to make it consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really following the discussion; are you guys saying this line should be removed? Because there's nothing changing here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as other tests don't have this. It's not a kind of "should be", but "can be".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it's removed in code diff as of now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it removed. It's still there.

Copy link
Contributor

@HeartSaVioR HeartSaVioR Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion, I should be more clearer - the change is removed, in other words, rolled back. No change.

/**
* Parameter whether the container should run privileged.
*/
val privileged: Boolean = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB2 docker requires privileged run.

val conn = java.sql.DriverManager.getConnection(jdbcUrl)
conn.close()
var conn: Connection = null
eventually(timeout(2.minutes), interval(1.second)) {
Copy link
Contributor Author

@gaborgsomogyi gaborgsomogyi Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single connection simplification + timeout increase.


override def dataPreparation(conn: Connection): Unit = {
conn.prepareStatement("CREATE TABLE bar (c0 text)").executeUpdate()
conn.prepareStatement("CREATE TABLE bar (c0 VARCHAR(8))").executeUpdate()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB2 doesn't support text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be surprised if this change affects others, but it may be worth to test others manually and mention the result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When doing modifications I'm always re-executing all of them. This has happened here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And not to avoid the question all passed :)

USERPROFILE=/database/config/db2inst1/sqllib/userprofile
echo "export DB2_KRB5_PRINCIPAL=db2/__IP_ADDRESS_REPLACE_ME__@EXAMPLE.COM" >> $USERPROFILE
echo "export KRB5_KTNAME=/var/custom/db2.keytab" >> $USERPROFILE
su - db2inst1 -c "db2set DB2ENVLIST=KRB5_KTNAME"
Copy link
Contributor Author

@gaborgsomogyi gaborgsomogyi Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trick is needed because DB2 forwards environment variables automatically only if it's starting with DB2 (KRB5_KTNAME doesn't fit).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add this as "comment" to reduce the hops to finally reach this comment on understanding this trick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, added.

@gaborgsomogyi
Copy link
Contributor Author

cc @HeartSaVioR

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121279 has finished for PR 28215 at commit 59d34fa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

Looks unrelated.

@gaborgsomogyi
Copy link
Contributor Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121287 has finished for PR 28215 at commit 59d34fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks great in overall. As I commented in earlier PRs, still not be able to run the tests (and it's tricky to modify my local env. only for running tests), so just assumed you've run the tests manually.

USERPROFILE=/database/config/db2inst1/sqllib/userprofile
echo "export DB2_KRB5_PRINCIPAL=db2/__IP_ADDRESS_REPLACE_ME__@EXAMPLE.COM" >> $USERPROFILE
echo "export KRB5_KTNAME=/var/custom/db2.keytab" >> $USERPROFILE
su - db2inst1 -c "db2set DB2ENVLIST=KRB5_KTNAME"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add this as "comment" to reduce the hops to finally reach this comment on understanding this trick.


override def dataPreparation(conn: Connection): Unit = {
conn.prepareStatement("CREATE TABLE bar (c0 text)").executeUpdate()
conn.prepareStatement("CREATE TABLE bar (c0 VARCHAR(8))").executeUpdate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be surprised if this change affects others, but it may be worth to test others manually and mention the result.

@gaborgsomogyi
Copy link
Contributor Author

Re-executed the docker tests and passed.

@SparkQA
Copy link

SparkQA commented Apr 16, 2020

Test build #121369 has finished for PR 28215 at commit eb4878c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Apr 16, 2020

Test build #121374 has finished for PR 28215 at commit eb4878c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming manual tests were passed

@gaborgsomogyi
Copy link
Contributor Author

cc @dongjoon-hyun @vanzin since you know this area well

@dongjoon-hyun
Copy link
Member

Hi, @gaborgsomogyi . It would be great if you proceed Updating and enabling DB2IntegrationSuite in a separate PR first.

@Ignore // AMPLab Jenkins needs to be updated before shared memory works on docker
class DB2IntegrationSuite extends DockerJDBCIntegrationSuite {
override val db = new DatabaseOnDocker {
override val imageName = "lresende/db2express-c:10.5.0.5-3.10.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this image to the official one like ibmcom/db2:11.5.0.0a. It would be great if we use the same DB2 version in both DB2IntegrationSuite and DB2KrbIntegrationSuite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above comment is for the new PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worth a separate PR because it's a non-trivial change.

@SparkQA
Copy link

SparkQA commented Apr 20, 2020

Test build #121526 has finished for PR 28215 at commit e56c7bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Apr 22, 2020

The check failure is valid, it's failing to compile in that environment.

@HeartSaVioR
Copy link
Contributor

Oh, nice catch. That compilation error seems to only come with JDK11.

That said, Spark PR builder may need to run at least two different environments; that's lucky we found the issue from Github Action, but it has been the source of false alarm hence not considered seriously in many case. E.g. we may not able to find this if we see failure in Linter build side on Github Action, as building Spark will be cancelled if Linter build fails.

@gaborgsomogyi
Copy link
Contributor Author

Ah gosh! Fixed it.

@SparkQA
Copy link

SparkQA commented Apr 22, 2020

Test build #121616 has finished for PR 28215 at commit 1ecacec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Apr 23, 2020

Merging to master.

@dongjoon-hyun
Copy link
Member

Is there additional required for this? All the other test (including new DB2IntegreationSuite in #28325) passed, but this one fail.

...
DB2KrbIntegrationSuite:
org.apache.spark.sql.jdbc.DB2KrbIntegrationSuite *** ABORTED ***
  Exception encountered when invoking run on a nested suite - The code passed to eventually never returned normally. Attempted 128 times over 2.016876443166667 minutes. Last failure message: Login failure for db2/10.0.0.6@EXAMPLE.COM from keytab /Users/dongjoon/PRS/SPARK-PR-28325/external/docker-integration-tests/target/tmp/spark-35512019-33ef-4cad-a54a-2bec69a3d4c2/db2.keytab. (DockerJDBCIntegrationSuite.scala:158)
...
Run completed in 4 minutes, 27 seconds.
Total number of tests run: 42
Suites: completed 8, aborted 1
Tests: succeeded 42, failed 0, canceled 0, ignored 0, pending 0
*** 1 SUITE ABORTED ***

dongjoon-hyun pushed a commit that referenced this pull request Apr 25, 2020
… the DB2 docker inside

### What changes were proposed in this pull request?
This is a followup PR discussed [here](#28215 (comment)).

### Why are the changes needed?
It would be good to re-enable `DB2IntegrationSuite` and upgrade the docker image inside to use the latest.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing docker integration tests.

Closes #28325 from gaborgsomogyi/SPARK-31533.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

Hi, Guys.
Currently, all the other suites including DB2IntegrationSuite passed . Only DB2KrbIntegrationSuite fails. This PR seems to need more verification. Although this was verified manually, the newly added integration test should be valid.

How can we handle this commit?

@gaborgsomogyi
Copy link
Contributor Author

Is it fine to check it on Monday as a start?

@gaborgsomogyi
Copy link
Contributor Author

Logs would be helpful from test side and from docker instance side to see what have gone wrong.

@dongjoon-hyun
Copy link
Member

Sure, no rush for this because this is an integration test we will not revert this urgently. I just wondered this IT was tested or not when this PR was merged.

@gaborgsomogyi
Copy link
Contributor Author

I'm re-executing them after each change because it's only couple of minutes. I've just tried it out and works on my machine. On Monday I'll put it into a loop but logs would speed up the process because such case we dont have to spend time on reproduction.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 26, 2020

Got it. Could you share your work environment? Then, I can try your way if possible. I tried master branch in my two separate machines, but there was no luck. I'm using Docker Desktop community (2.3.0.0) on both MacBook Pro and MacPro.

For the record, in two machines, this test suite consecutively fails several times (I also increased the resources up to 10 cores and 22GB memory) and never succeeds in this environment.

@dongjoon-hyun
Copy link
Member

Hi, @gatorsmile and @maropu .
Could you verify this new JDBC integration test suite additionally, please?
I'm asking your help since it might be a dev environment issue and I failed so far.

override protected def setAuthentication(keytabFile: String, principal: String): Unit = {
val config = new SecureConnectionProvider.JDBCConfiguration(
Configuration.getConfiguration, "JaasClient", keytabFile, principal)
Configuration.setConfiguration(config)
Copy link
Member

@maropu maropu Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe when scanning tables in different secure databases ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I'll create a separate jira to handle config synchronisation globally...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark can scan different JDBC relations concurrently though, could we synchronized them easily?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the solution shouldn't be complicated but it effects all other providers which change the configuration (not just DB2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed https://issues.apache.org/jira/browse/SPARK-31575 to handle the issue.

@gaborgsomogyi
Copy link
Contributor Author

@dongjoon-hyun here is my environment (just checked the latest master and still works):

  • MacBook Pro
  • macOS Catalina 10.15.3 (19D76)
  • openjdk version "1.8.0_242"
  • Docker desktop community 2.1.0.5 (40693)

@gaborgsomogyi
Copy link
Contributor Author

That said maybe my environment is just a lucky one and would be good to take a look at the logs from test side and from container side as well.

@maropu
Copy link
Member

maropu commented Apr 27, 2020

I run it by myself and I checked it passed on my MacOS (Sierra, Docker Desktop community v2.3.0.0).

@dongjoon-hyun
Copy link
Member

Thank you, @maropu !

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 27, 2020

Thank you, @gaborgsomogyi .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants