Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement][Spark] Try to make neo4j generate DataFrame with the correct data type #353

Merged
merged 2 commits into from
Feb 1, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -72,9 +72,12 @@ object Neo4j2GraphAr {
spark: SparkSession
): Unit = {
// read vertices with label "Person" from Neo4j as a DataFrame
// Note: set "schema.flatten.limit" to 1 to not sample null record infer type as string as far as possible,
// if you want a perfect type inference, consider to user APOC.
val person_df = spark.read
.format("org.neo4j.spark.DataSource")
.option("query", "MATCH (n:Person) RETURN n.name AS name, n.born as born")
.option("schema.flatten.limit", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

Copy link
Contributor Author

@acezen acezen Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism
of neo4j, this is the best solution we can provide for example.(except use the APOC)

What case you suggest to add a test case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)

What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

Copy link
Contributor Author

@acezen acezen Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).

I think a possible way is to apply the cast method (of spark) with DataType on the column of the DataFrame. If implementing this solution is deemed too complex for our current scope, we can record this issue for future reference and potential enhancement.

Copy link
Contributor Author

@acezen acezen Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).

I think a possible way is to apply the cast method (of spark) with DataType on the column of the DataFrame. If implementing this solution is deemed too complex for our current scope, we can record this issue for future reference and potential enhancement.

In my opinion, it's Neo4j's duty to ensure the schema's correct( and they put the feature to enterprise edition, is reasonable, to make user to pay money), we should not put these part to GraphAr and it's not GraphAr's scope. GraphAr just need to make sure that the schema is consistent from DataFrame to the graphar files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).

I think a possible way is to apply the cast method (of spark) with DataType on the column of the DataFrame. If implementing this solution is deemed too complex for our current scope, we can record this issue for future reference and potential enhancement.

In my opinion, it's Neo4j's duty to ensure the schema's correct( and they put the feature to enterprise edition, is reasonable, to make user to pay money), we should not put these part to GraphAr and it's not GraphAr's scope. GraphAr just need to make sure that the schema is consistent from DataFrame to the graphar files.

Understand you perspective. With respect to the schema.flatten.limit setting, although adjusting it might help in this particular scenario, and I agree with modifying it in this example, I think it does not signify a bug within GraphAr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, I would retitle the PR to improvement.

.load()
// put into writer, vertex label is "Person"
writer.PutVertexData("Person", person_df)
Expand All @@ -86,6 +89,7 @@ object Neo4j2GraphAr {
"query",
"MATCH (n:Movie) RETURN n.title AS title, n.tagline as tagline"
)
.option("schema.flatten.limit", 1)
.load()
// put into writer, vertex label is "Movie"
writer.PutVertexData("Movie", movie_df)
Expand All @@ -97,6 +101,7 @@ object Neo4j2GraphAr {
"query",
"MATCH (a:Person)-[r:PRODUCED]->(b:Movie) return a.name as src, b.title as dst"
)
.option("schema.flatten.limit", 1)
.load()
// put into writer, source vertex label is "Person", edge label is "PRODUCED"
// target vertex label is "Movie"
Expand All @@ -109,6 +114,7 @@ object Neo4j2GraphAr {
"query",
"MATCH (a:Person)-[r:ACTED_IN]->(b:Movie) return a.name as src, b.title as dst"
)
.option("schema.flatten.limit", 1)
.load()
// put into writer, source vertex label is "Person", edge label is "ACTED_IN"
// target vertex label is "Movie"
Expand All @@ -121,6 +127,7 @@ object Neo4j2GraphAr {
"query",
"MATCH (a:Person)-[r:DIRECTED]->(b:Movie) return a.name as src, b.title as dst"
)
.option("schema.flatten.limit", 1)
.load()
// put into writer, source vertex label is "Person", edge label is "DIRECTED"
// target vertex label is "Movie"
Expand All @@ -133,6 +140,7 @@ object Neo4j2GraphAr {
"query",
"MATCH (a:Person)-[r:FOLLOWS]->(b:Person) return a.name as src, b.name as dst"
)
.option("schema.flatten.limit", 1)
.load()
// put into writer, source vertex label is "Person", edge label is "FOLLOWS"
// target vertex label is "Person"
Expand All @@ -145,6 +153,7 @@ object Neo4j2GraphAr {
"query",
"MATCH (a:Person)-[r:REVIEWED]->(b:Movie) return a.name as src, b.title as dst, r.rating as rating, r.summary as summary"
)
.option("schema.flatten.limit", 1)
.load()
// put into writer, source vertex label is "Person", edge label is "REVIEWED"
// target vertex label is "Movie"
Expand All @@ -157,6 +166,7 @@ object Neo4j2GraphAr {
"query",
"MATCH (a:Person)-[r:WROTE]->(b:Movie) return a.name as src, b.title as dst"
)
.option("schema.flatten.limit", 1)
.load()
// put into writer, source vertex label is "Person", edge label is "WROTE"
// target vertex label is "Movie"
Expand Down
Loading