Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement][Spark] Try to make neo4j generate DataFrame with the correct data type #353

Merged
merged 2 commits into from
Feb 1, 2024

Conversation

acezen
Copy link
Contributor

@acezen acezen commented Feb 1, 2024

Proposed changes

related issue #347

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING doc
  • I have signed the CLA
  • Lint and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc...

@acezen acezen requested a review from lixueclaire February 1, 2024 03:34
val person_df = spark.read
.format("org.neo4j.spark.DataSource")
.option("query", "MATCH (n:Person) RETURN n.name AS name, n.born as born")
.option("schema.flatten.limit", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

Copy link
Contributor Author

@acezen acezen Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism
of neo4j, this is the best solution we can provide for example.(except use the APOC)

What case you suggest to add a test case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)

What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

Copy link
Contributor Author

@acezen acezen Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).

I think a possible way is to apply the cast method (of spark) with DataType on the column of the DataFrame. If implementing this solution is deemed too complex for our current scope, we can record this issue for future reference and potential enhancement.

Copy link
Contributor Author

@acezen acezen Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).

I think a possible way is to apply the cast method (of spark) with DataType on the column of the DataFrame. If implementing this solution is deemed too complex for our current scope, we can record this issue for future reference and potential enhancement.

In my opinion, it's Neo4j's duty to ensure the schema's correct( and they put the feature to enterprise edition, is reasonable, to make user to pay money), we should not put these part to GraphAr and it's not GraphAr's scope. GraphAr just need to make sure that the schema is consistent from DataFrame to the graphar files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).

I think a possible way is to apply the cast method (of spark) with DataType on the column of the DataFrame. If implementing this solution is deemed too complex for our current scope, we can record this issue for future reference and potential enhancement.

In my opinion, it's Neo4j's duty to ensure the schema's correct( and they put the feature to enterprise edition, is reasonable, to make user to pay money), we should not put these part to GraphAr and it's not GraphAr's scope. GraphAr just need to make sure that the schema is consistent from DataFrame to the graphar files.

Understand you perspective. With respect to the schema.flatten.limit setting, although adjusting it might help in this particular scenario, and I agree with modifying it in this example, I think it does not signify a bug within GraphAr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, I would retitle the PR to improvement.

@acezen acezen changed the title [BugFix][Spark] Fix neo4j generate DataFrame got wrong data type bug [BugFix][Spark] Try to fix neo4j generate DataFrame got wrong data type bug Feb 1, 2024
@acezen acezen changed the title [BugFix][Spark] Try to fix neo4j generate DataFrame got wrong data type bug [Improvement][Spark] Try to make neo4j generate DataFrame with the correct data type Feb 1, 2024
@acezen acezen merged commit 4889596 into apache:main Feb 1, 2024
4 checks passed
@acezen acezen deleted the 347-neo4j2graph-fix branch February 1, 2024 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants