Skip to content

Conversation

@linhr
Copy link
Contributor Author

linhr commented Apr 3, 2025

After this change, the error decoding response body error has disappeared when running benchmarks involving large datasets (q18 and sometimes q21 for derived TPC-H benchmark at SF=1000).

But it's unclear to me whether there are performance issues after this change. The benchmark execution time has a high variance, so more analysis may be needed to ensure there is no performance degradation in other parts of the system after this change.

}

pub fn next(&mut self) -> Option<WorkerId> {
self.slots.shuffle(&mut rng());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried randomizing task assignment and it didn't seem to help. I guess future work is needed for more sophisticated task assignment logic.

@github-actions
Copy link

github-actions bot commented Apr 5, 2025

Spark Test Report

Commit Information

Commit Revision Branch
After 8a486e5 refs/pull/432/merge
Before cb555bc refs/heads/main

Test Summary

Suite Commit Failed Passed Skipped Warnings Time (s)
doctest-catalog After 13 12 3 5.84
Before 13 12 3 5.58
doctest-column After 1 32 3 6.26
Before 1 32 3 6.10
doctest-dataframe After 29 77 1 4 8.74
Before 29 77 1 4 8.71
doctest-functions After 139 263 7 8 13.80
Before 139 263 7 8 13.20
test-connect After 235 801 135 282 138.70
Before 235 801 135 282 135.83

Test Details

Error Counts
          417 Total
          218 Total Unique
-------- ---- ----------------------------------------------------------------------------------------------------------
           25 DocTestFailure
           15 UnsupportedOperationException: streaming query manager command
           14 UnsupportedOperationException: lambda function
           13 AssertionError: AnalysisException not raised
           10 PySparkAssertionError: [DIFFERENT_PANDAS_DATAFRAME] DataFrames are not almost equal:
           10 handle add artifacts
            8 AssertionError: False is not true
            8 UnsupportedOperationException: hint
            7 UnsupportedOperationException: unsupported data source format: "text"
            6 UnsupportedOperationException: function: window
            6 UnsupportedOperationException: write stream operation start
            5 AnalysisException: 'Utf8("INTERVAL '0 00:00:00.000123' DAY TO SECOND") = CAST(#1 AS Utf8)' is not tr...
            5 AnalysisException: Cannot cast to Decimal128(14, 7). Overflowing on NaN
            5 UnsupportedOperationException: function: monotonically_increasing_id
            5 UnsupportedOperationException: sample
            4 AssertionError: "TABLE_OR_VIEW_NOT_FOUND" does not match "No table named 'v'"
            4 PySparkNotImplementedError: [NOT_IMPLEMENTED] rdd() is not implemented.
            4 UnsupportedOperationException: data reader option: linesep
            4 UnsupportedOperationException: sample by
            4 UnsupportedOperationException: unknown aggregate function: hll_sketch_agg
            4 UnsupportedOperationException: unpivot
            3 AnalysisException: Unable to find factory for TEXT
            3 ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of type day_time_interval ...
            3 IllegalArgumentException: invalid argument: empty data source paths
            3 UnsupportedOperationException: PlanNode::CacheTable
            3 UnsupportedOperationException: data reader option: multiline
            3 UnsupportedOperationException: function: input_file_name
            3 UnsupportedOperationException: function: ~
            3 UnsupportedOperationException: handle analyze input files
            3 ValueError: Converting to Python dictionary is not supported when duplicate field names are present
            2 AnalysisException: Could not find config namespace "spark"
            2 AnalysisException: map requires all value types to be the same
            2 AnalysisException: two values expected: [Column(Column { relation: None, name: "#2" }), Column(Colum...
            2 AssertionError
            2 AssertionError: AnalysisException not raised by <lambda>
            2 AssertionError: Lists differ: [Row([22 chars](key=1, value='1'), Row(key=10, value='10'), R[2402 cha...
            2 IllegalArgumentException: expected value at line 1 column 1
            2 IllegalArgumentException: invalid argument: found FUNCTION at 5:13 expected 'DATABASE', 'SCHEMA', 'T...
            2 KeyError: 22
            2 PythonException:  ZeroDivisionError: division by zero
            2 SparkRuntimeException: start_from index out of bounds
            2 UnsupportedOperationException: Aggregate can not be used as a sliding accumulator because `retract_b...
            2 UnsupportedOperationException: approx quantile
            2 UnsupportedOperationException: collect metrics
            2 UnsupportedOperationException: freq items
            2 UnsupportedOperationException: function: bitmap_bit_position
            2 UnsupportedOperationException: function: crc32
            2 UnsupportedOperationException: function: format_number
            2 UnsupportedOperationException: function: from_csv
            2 UnsupportedOperationException: function: from_json
            2 UnsupportedOperationException: function: inline
            2 UnsupportedOperationException: function: map_entries
            2 UnsupportedOperationException: function: sec
            2 UnsupportedOperationException: function: shiftrightunsigned
            2 UnsupportedOperationException: handle analyze is local
            2 UnsupportedOperationException: handle analyze same semantics
            2 UnsupportedOperationException: list functions
            2 UnsupportedOperationException: pivot
            2 UnsupportedOperationException: position with 3 arguments is not supported yet
            2 UnsupportedOperationException: rebalance partitioning by expression
            2 UnsupportedOperationException: unknown aggregate function: collect_set
            2 UnsupportedOperationException: unresolved regex
            2 UnsupportedOperationException: unsupported data source format: "orc"
            2 UnsupportedOperationException: user defined data type should only exist in a field
            2 handle artifact statuses
            2 received metadata size exceeds hard limit (19714 vs. 16384);  :status:42B content-type:60B grpc-stat...
            1 AnalysisException: 'Utf8("1970-01-01 00:00:00") = CAST(#1 AS Utf8)' is not true!
            1 AnalysisException: 'Utf8("2012-02-02 02:02:02") = CAST(#1 AS Utf8)' is not true!
            1 AnalysisException: Cannot cast string 'abc' to value of Float64 type
            1 AnalysisException: Cannot cast value 'abc' to value of Boolean type
            1 AnalysisException: Cannot infer common argument type for comparison operation Boolean = Float64
            1 AnalysisException: Error parsing timestamp from '2023-01-01' using format '%d-%m-%Y': input contains...
            1 AnalysisException: Failed to coerce arguments to satisfy a call to 'nth_value' function: coercion fr...
            1 AnalysisException: Failed to parse placeholder id: cannot parse integer from empty string
            1 AnalysisException: Inconsistent data type across values list at row 1 column 1. Was Map(Field { name...
            1 AnalysisException: Table 'tbl1' already exists
            1 AnalysisException: UNION queries have different number of columns: left has 3 columns whereas right ...
            1 AnalysisException: view not found: tab2
            1 AssertionError: "2000000" does not match "raise_error expects a single UTF-8 string argument"
            1 AssertionError: "CSV header does not conform to the schema" does not match "data reader option: enfo...
(+1)        1 AssertionError: "Database 'memory:8bacda52-68fd-406b-acec-bf947cd179cd' dropped." does not match "in...
(+1)        1 AssertionError: "Database 'memory:f4171bdb-8f91-485c-b677-d6d13b9abd6f' dropped." does not match "in...
            1 AssertionError: "TABLE_OR_VIEW_NOT_FOUND" does not match "The table test_table already exists"
            1 AssertionError: "attribute.*missing" does not match "cannot resolve attribute: ObjectName([Identifie...
            1 AssertionError: "foobar" does not match "raise_error expects a single UTF-8 string argument"
            1 AssertionError: "timestamp values are not equal (timestamp='1968-12-31 17:01:01': data[0][1]='1969-0...
            1 AssertionError: '+---[17 chars]-----+\n|                        x|\n+--------[132 chars]-+\n' != '+-...
            1 AssertionError: ArrayIndexOutOfBoundsException not raised
            1 AssertionError: Exception not raised
            1 AssertionError: Lists differ: [Row([23 chars](2019, 1, 1, 8, 0), aware=datetime.datetime(2019, 1, 1,...
            1 AssertionError: Lists differ: [Row(id=90, name='90'), Row(id=91, name='91'), Ro[176 chars]99')] != [...
            1 AssertionError: Lists differ: [Row(key='0'), Row(key='1'), Row(key='10'), Row(ke[1435 chars]99')] !=...
            1 AssertionError: Lists differ: [Row(ln(id)=0.0, ln(id)=0.0, struct(id, name)=Row(id=[1232 chars]0'))]...
            1 AssertionError: Row(point=ExamplePoint([,1), pypoint=ExamplePoint([,3)) != Row(point='(1.0, 2.0)', p...
            1 AssertionError: StorageLevel(False, True, True, False, 1) != StorageLevel(False, False, False, False...
            1 AssertionError: Struc[31 chars]stampNTZType(), True), StructField('val', Inte[13 chars]ue)]) != Stru...
            1 AssertionError: Struc[32 chars]e(), False), StructField('b', DoubleType(), Fa[158 chars]ue)]) != Str...
            1 AssertionError: Struc[40 chars]ue), StructField('val', ArrayType(DoubleType(), False), True)]) != St...
            1 AssertionError: Struc[64 chars]Type(), True), StructField('i', StringType(), True)]), False)]) != St...
            1 AssertionError: Struc[69 chars]e(), True), StructField('name', StringType(), True)]), True)]) != Str...
            1 AssertionError: YearMonthIntervalType(0, 1) != YearMonthIntervalType(0, 0)
            1 AssertionError: [1.0, 2.0] != ExamplePoint(1.0,2.0)
            1 AssertionError: datetime.datetime(1970, 1, 1, 0, 0) != datetime.datetime(1970, 1, 1, 8, 0)
            1 AssertionError: {} != {'max_age': 5}
            1 AttributeError: 'DataFrame' object has no attribute '_ipython_key_completions_'
            1 AttributeError: 'DataFrame' object has no attribute '_joinAsOf'
(+1)        1 FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp5rm9s9n8'
            1 IllegalArgumentException: 83140 is too large to store in a Decimal128 of precision 4. Max is 9999
            1 IllegalArgumentException: column types must match schema types, expected Int64 but found List(Field ...
            1 IllegalArgumentException: column types must match schema types, expected LargeUtf8 but found Utf8 at...
            1 IllegalArgumentException: invalid argument: found FUNCTION at 7:15 expected 'DATABASE', 'SCHEMA', 'O...
            1 IllegalArgumentException: invalid argument: invalid digit found in string
            1 KeyError: 'max'
            1 PySparkNotImplementedError: [NOT_IMPLEMENTED] foreach() is not implemented.
            1 PySparkNotImplementedError: [NOT_IMPLEMENTED] foreachPartition() is not implemented.
            1 PySparkNotImplementedError: [NOT_IMPLEMENTED] localCheckpoint() is not implemented.
            1 PySparkNotImplementedError: [NOT_IMPLEMENTED] sparkContext() is not implemented.
            1 PySparkNotImplementedError: [NOT_IMPLEMENTED] toJSON() is not implemented.
            1 PythonException:  AttributeError: 'NoneType' object has no attribute 'partitionId'
            1 PythonException:  AttributeError: 'list' object has no attribute 'x'
            1 PythonException:  AttributeError: 'list' object has no attribute 'y'
            1 SparkRuntimeException: Failed due to a difference in schemas, original schema: DFSchema { inner: Sch...
            1 UnknownTimeZoneError: 'PST'
            1 UnsupportedOperationException: Aggregate can not be used as a sliding accumulator because `retract_b...
            1 UnsupportedOperationException: Aggregate can not be used as a sliding accumulator because `retract_b...
            1 UnsupportedOperationException: COUNT DISTINCT with multiple arguments
            1 UnsupportedOperationException: CSV writer line seperator
            1 UnsupportedOperationException: Insert into not implemented for this table
            1 UnsupportedOperationException: PlanNode::ClearCache
            1 UnsupportedOperationException: PlanNode::IsCached
            1 UnsupportedOperationException: SHOW FUNCTIONS
            1 UnsupportedOperationException: bucketing
            1 UnsupportedOperationException: data reader option: primitivesasstring
            1 UnsupportedOperationException: data writer option: ignoreleadingwhitespace
            1 UnsupportedOperationException: deduplicate within watermark
            1 UnsupportedOperationException: function exists
            1 UnsupportedOperationException: function: array_insert
            1 UnsupportedOperationException: function: array_sort
            1 UnsupportedOperationException: function: arrays_zip
            1 UnsupportedOperationException: function: bit_count
            1 UnsupportedOperationException: function: bit_get
            1 UnsupportedOperationException: function: bitmap_bucket_number
            1 UnsupportedOperationException: function: bitmap_count
            1 UnsupportedOperationException: function: bround
            1 UnsupportedOperationException: function: conv
            1 UnsupportedOperationException: function: convert_timezone
            1 UnsupportedOperationException: function: csc
            1 UnsupportedOperationException: function: elt
            1 UnsupportedOperationException: function: format_string
            1 UnsupportedOperationException: function: getbit
            1 UnsupportedOperationException: function: inline_outer
            1 UnsupportedOperationException: function: java_method
            1 UnsupportedOperationException: function: json_object_keys
            1 UnsupportedOperationException: function: json_tuple
            1 UnsupportedOperationException: function: make_dt_interval
            1 UnsupportedOperationException: function: make_interval
            1 UnsupportedOperationException: function: make_timestamp_ltz
            1 UnsupportedOperationException: function: map_concat
            1 UnsupportedOperationException: function: map_from_entries
            1 UnsupportedOperationException: function: months_between
            1 UnsupportedOperationException: function: parse_url
            1 UnsupportedOperationException: function: printf
            1 UnsupportedOperationException: function: reflect
            1 UnsupportedOperationException: function: regexp_extract
            1 UnsupportedOperationException: function: regexp_extract_all
            1 UnsupportedOperationException: function: regexp_instr
            1 UnsupportedOperationException: function: regexp_substr
            1 UnsupportedOperationException: function: schema_of_csv
            1 UnsupportedOperationException: function: schema_of_json
            1 UnsupportedOperationException: function: sentences
            1 UnsupportedOperationException: function: session_window
            1 UnsupportedOperationException: function: sha
            1 UnsupportedOperationException: function: sha1
            1 UnsupportedOperationException: function: soundex
            1 UnsupportedOperationException: function: spark_partition_id
            1 UnsupportedOperationException: function: split
            1 UnsupportedOperationException: function: stack
            1 UnsupportedOperationException: function: str_to_map
            1 UnsupportedOperationException: function: to_char
            1 UnsupportedOperationException: function: to_csv
            1 UnsupportedOperationException: function: to_json
            1 UnsupportedOperationException: function: to_number
            1 UnsupportedOperationException: function: to_unix_timestamp
            1 UnsupportedOperationException: function: to_utc_timestamp
            1 UnsupportedOperationException: function: to_varchar
            1 UnsupportedOperationException: function: try_add
            1 UnsupportedOperationException: function: try_divide
            1 UnsupportedOperationException: function: try_multiply
            1 UnsupportedOperationException: function: try_subtract
            1 UnsupportedOperationException: function: try_to_number
            1 UnsupportedOperationException: function: url_decode
            1 UnsupportedOperationException: function: url_encode
            1 UnsupportedOperationException: function: width_bucket
            1 UnsupportedOperationException: function: xpath
            1 UnsupportedOperationException: function: xpath_boolean
            1 UnsupportedOperationException: function: xpath_double
            1 UnsupportedOperationException: function: xpath_float
            1 UnsupportedOperationException: function: xpath_int
            1 UnsupportedOperationException: function: xpath_long
            1 UnsupportedOperationException: function: xpath_number
            1 UnsupportedOperationException: function: xpath_short
            1 UnsupportedOperationException: function: xpath_string
            1 UnsupportedOperationException: handle analyze semantic hash
            1 UnsupportedOperationException: make_timestamp with timezone is not yet implemented
            1 UnsupportedOperationException: partitioning columns
            1 UnsupportedOperationException: unknown aggregate function: bitmap_or_agg
            1 UnsupportedOperationException: unknown aggregate function: count_if
            1 UnsupportedOperationException: unknown aggregate function: count_min_sketch
            1 UnsupportedOperationException: unknown aggregate function: grouping_id
            1 UnsupportedOperationException: unknown aggregate function: histogram_numeric
            1 UnsupportedOperationException: unknown aggregate function: percentile
            1 UnsupportedOperationException: unknown aggregate function: try_avg
            1 UnsupportedOperationException: unknown aggregate function: try_sum
            1 UnsupportedOperationException: unknown function: distributed_sequence_id
            1 UnsupportedOperationException: unknown function: product
            1 ValueError: Code in Status proto (StatusCode.INTERNAL) doesn't match status code (StatusCode.RESOURC...
            1 ValueError: The column label 'id' is not unique.
            1 ValueError: The column label 'struct' is not unique.
(-1)        0 AssertionError: "Database 'memory:3c5b9998-8816-4fb6-8458-8910d3e2aea6' dropped." does not match "in...
(-1)        0 AssertionError: "Database 'memory:a7ce7367-8636-4dc9-a399-1bad9cdf61d4' dropped." does not match "in...
(-1)        0 FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpn9pg7mvg'
Passed Tests Diff

(empty)

@linhr
Copy link
Contributor Author

linhr commented Apr 14, 2025

I decided to put the feature behind configuration so that we can enable the separate Tokio runtime explicitly and continue experimenting with it.

This PR now also has the following changes.

  1. Consolidate Tokio runtime creation logic for various entrypoints (Python library, PySpark shell, CLI, etc.).
  2. Avoid creating a SessionContext for every task in the worker.

@linhr linhr requested a review from shehabgamin April 14, 2025 10:18
@linhr linhr marked this pull request as ready for review April 14, 2025 10:18
Copy link
Contributor

@shehabgamin shehabgamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!!

@linhr linhr merged commit 8769120 into main Apr 16, 2025
7 checks passed
@linhr linhr deleted the object-store-runtime branch April 16, 2025 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants