[SPARK-27712][PySpark][SQL] Returns correct schema even under different column order when creating dataframe#24614
[SPARK-27712][PySpark][SQL] Returns correct schema even under different column order when creating dataframe#24614viirya wants to merge 1 commit intoapache:masterfrom
Conversation
|
Test build #105414 has finished for PR 24614 at commit
|
|
This is more interesting, as we allow something like: data = [Row(key=i, value=str(i)) for i in range(100)]
rdd = spark.sparkContext.parallelize(data, 5)
# field names can differ.
df = rdd.toDF(" a: int, b: string ")So, the question is, in Currently,
It is inconsistent in two cases, obviously. This difference is also seen in following case. Field names can't differ, if from local list of Row. >>> spark.createDataFrame([Row(A="1", B="2")], "B string, a string").first()
Traceback (most recent call last):
File "/Users/viirya/repos/spark-1/python/pyspark/sql/types.py", line 1527, in __getitem__
idx = self.__fields__.index(item)
ValueError: 'a' is not in list |
|
yea, I think actually I discussed this with @BryanCutler somewhere before. I forget what we ended up with. Bryan, do you remember which one we considered as the correct case? I remember we considered |
|
Ah yes, there are all kinds of inconsistencies in the PySpark Row class. I think this is a duplicate of SPARK-22232 and we discussed in the PR here #20280. The fix there was to also pickle the |
|
The problem I when using a positional schema when constructing This works data = [Row(k=i, v=str(i)) for i in range(100)]
rdd = spark.sparkContext.parallelize(data, 5)
# field names can differ.
df = rdd.toDF(" a: int, b: string ")This fails data = [Row(z=i, y=str(i)) for i in range(100)]
rdd = spark.sparkContext.parallelize(data, 5)
# field names can differ.
df = rdd.toDF(" a: int, b: string ")where the only difference is the field name from |
|
This is a duplicate issue and there was discussion before, I close this. |
What changes were proposed in this pull request?
In PySpark,
Row's__from_dict__is lost after pickle. But we rely on__from_dict__when convertingRows to internal by callingtoInternal. It causes a weird behavior:This patch tried to fix the issue.
How was this patch tested?
Added test.