-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Improvement][Spark] Try to make neo4j generate DataFrame with the correct data type #353
Merged
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism
of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema.strategy would make neo4j sample
schema.flatten.limit
records to infer type. This change is just minish theschema.flatten.limit
(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a possible way is to apply the cast method (of spark) with DataType on the column of the DataFrame. If implementing this solution is deemed too complex for our current scope, we can record this issue for future reference and potential enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, it's Neo4j's duty to ensure the schema's correct( and they put the feature to enterprise edition, is reasonable, to make user to pay money), we should not put these part to GraphAr and it's not GraphAr's scope. GraphAr just need to make sure that the schema is consistent from DataFrame to the graphar files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understand you perspective. With respect to the schema.flatten.limit setting, although adjusting it might help in this particular scenario, and I agree with modifying it in this example, I think it does not signify a bug within GraphAr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, I would retitle the PR to
improvement
.