Assertion failed on taxi and plasticc benchmarks #458

gshimansky · 2023-05-04T00:52:11Z

When running taxi (taxi modified to work with header csv input) and plasticc benchmarks on HDK debug build I encounter failed assertion QueryFragmentDescriptor.cpp:297 Check failed: table_info. This happens in the same place both Linux and Windows so it is not a windows specific problem.

Code to reproduce:

import sys
import time
import json
from collections import OrderedDict
import modin.pandas as pd


def read(filename):
    column_types = {
        "trip_id": "int64",
        "vendor_id": "string",
        "pickup_datetime": "timestamp",
        "dropoff_datetime": "timestamp",
        "store_and_fwd_flag": "string",
        "rate_code_id": "int64",
        "pickup_longitude": "float64",
        "pickup_latitude": "float64",
        "dropoff_longitude": "float64",
        "dropoff_latitude": "float64",
        "passenger_count": "int64",
        "trip_distance": "float64",
        "fare_amount": "float64",
        "extra": "float64",
        "mta_tax": "float64",
        "tip_amount": "float64",
        "tolls_amount": "float64",
        "ehail_fee": "float64",
        "improvement_surcharge": "float64",
        "total_amount": "float64",
        "payment_type": "string",
        "trip_type": "float64",
        "pickup": "string",
        "dropoff": "string",
        "cab_type": "string",
        "precipitation": "float64",
        "snow_depth": "int64",
        "snowfall": "float64",
        "max_temperature": "int64",
        "min_temperature": "int64",
        "average_wind_speed": "float64",
        "pickup_nyct2010_gid": "float64",
        "pickup_ctlabel": "float64",
        "pickup_borocode": "float64",
        "pickup_boroname": "string",
        "pickup_ct2010": "float64",
        "pickup_boroct2010": "float64",
        "pickup_cdeligibil": "string",
        "pickup_ntacode": "string",
        "pickup_ntaname": "string",
        "pickup_puma": "float64",
        "dropoff_nyct2010_gid": "float64",
        "dropoff_ctlabel": "float64",
        "dropoff_borocode": "float64",
        "dropoff_boroname": "string",
        "dropoff_ct2010": "float64",
        "dropoff_boroct2010": "float64",
        "dropoff_cdeligibil": "string",
        "dropoff_ntacode": "string",
        "dropoff_ntaname": "string",
        "dropoff_puma": "float64",
    }

    all_but_dates = {
        col: valtype
        for (col, valtype) in column_types.items()
        if valtype not in ["timestamp"]
    }
    dates_only = [
        col for (col, valtype) in column_types.items() if valtype in ["timestamp"]
    ]

    df = pd.read_csv(
        filename,
        header=0,
        dtype=all_but_dates,
        parse_dates=dates_only,
    )

    df.shape  # to trigger real execution on omnisci
    return df


def q1_omnisci(df):
    q1_pandas_output = df.groupby("cab_type").size()
    q1_pandas_output.shape  # to trigger real execution on omnisci
    return q1_pandas_output


def q2_omnisci(df):
    q2_pandas_output = df.groupby("passenger_count").agg({"total_amount": "mean"})
    q2_pandas_output.shape  # to trigger real execution on omnisci
    return q2_pandas_output


def q3_omnisci(df):
    df["pickup_datetime"] = df["pickup_datetime"].dt.year
    q3_pandas_output = df.groupby(["passenger_count", "pickup_datetime"]).size()
    q3_pandas_output.shape  # to trigger real execution on omnisci
    return q3_pandas_output


def q4_omnisci(df):
    df["pickup_datetime"] = df["pickup_datetime"].dt.year
    df["trip_distance"] = df["trip_distance"].astype("int64")
    q4_pandas_output = (
        df.groupby(["passenger_count", "pickup_datetime", "trip_distance"], sort=False)
        .size()
        .reset_index()
        .sort_values(
            by=["pickup_datetime", 0], ignore_index=True, ascending=[True, False]
        )
    )
    q4_pandas_output.shape  # to trigger real execution on omnisci
    return q4_pandas_output


def measure(func, *args, **kw):
    t0 = time.time()
    res = func(*args, **kw)
    t1 = time.time()
    return res, t1 - t0


def run(input_file):
    res = OrderedDict()
    df, res["Reading"] = measure(read, input_file)
    _, res["Q1"] = measure(q1_omnisci, df)
    _, res["Q2"] = measure(q2_omnisci, df)
    _, res["Q3"] = measure(q3_omnisci, df.copy())
    _, res["Q4"] = measure(q4_omnisci, df.copy())
    return res


def main():
    if len(sys.argv) != 2:
        print(
            f"USAGE: python taxi.py <data file name>"
        )
        return
    result = run(sys.argv[1])
    json.dump(result, sys.stdout, indent=4)


if __name__ == "__main__":
    main()

Datafile that contains a header and just one line:

trip_id,pickup_datetime,dropoff_datetime,rate_code_id,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,trip_type,cab_type,precipitation,snow_depth,snowfall,max_temperature,min_temperature,average_wind_speed,pickup_nyct2010_gid,pickup_ctlabel,pickup_borocode,pickup_ct2010,pickup_boroct2010,pickup_puma,dropoff_nyct2010_gid,dropoff_ctlabel,dropoff_borocode,dropoff_ct2010,dropoff_boroct2010,dropoff_puma
728589253,2013-03-23 19:00:03,2015-02-09 23:53:19,227,-3384.6504885319628,-1768.4547705633008,899.3313637135943,-3567.838610520537,0,247.59167806792436,283313.3109043807,11958.290443405911,37.04835519254301,285575.9549971816,5595.173567982231,34.00429082233286,611256.349785416,1.2010861125030665,green,2.737725308004807,8,26.25521086711126,27,15,7.959928315627986,1540.786264683745,1.926857212032628,3.528783788433975,73450.27117442139,4394007.527430953,4035.594680159191,429.3012304384143,5400.04327998597,2.3620759492837418,146869.91997436076,1023130.9389877942,3812.462431689403

The text was updated successfully, but these errors were encountered:

gshimansky · 2023-05-04T00:54:52Z

Modin version used is git revision 8817093cb2f16be3709513b1517d8a4bc679900c.

Garra1980 · 2023-05-04T01:01:33Z

Regression or happens always with a debug build?

gshimansky · 2023-05-04T01:16:07Z

I didn't try Release, but I could imagine that if it works in the same way as Debug, failed assertion means NULL pointer, so on the next line code using this pointer should crash https://github.com/intel-ai/hdk/blob/main/omniscidb/QueryEngine/Descriptors/QueryFragmentDescriptor.cpp#L297-L299
I am not sure that this is a regression because we don't run taxi and plasticc in HDK CI (which we probably should).

Garra1980 · 2023-05-04T01:21:56Z

We run them in the internal performance benchmarking system

gshimansky · 2023-05-04T01:25:47Z

Benchmarks with reduced size datasets could be used for additional smoke testing in CI. They should pass very quickly. My dataset generator script allows generating datasets for our benchmarks of any required size.

ienkovich · 2023-05-04T03:24:04Z

This is a known issue that is going to be fixed by #442

Note, that this bug only appears when the first query you run in your process has multiple steps. We don't see it in our Modin tests and benchmarks (and also HDK CI) because the first test query is usually simple and benchmarks use a warm-up query having a single step.

Here is the part of the patch that actually fixes the problem:

diff --git a/omniscidb/ResultSetRegistry/ResultSetRegistry.h b/omniscidb/ResultSetRegistry/ResultSetRegistry.h
index d3c3c7c02..ab52008ad 100644
--- a/omniscidb/ResultSetRegistry/ResultSetRegistry.h
+++ b/omniscidb/ResultSetRegistry/ResultSetRegistry.h
@@ -51,6 +51,8 @@ class ResultSetRegistry : public SimpleSchemaProvider,

   const DictDescriptor* getDictMetadata(int dict_id, bool load_dict = true) override;

+  std::vector<int> listDatabases() const override { return {{DB_ID}}; }
+
  private:
   bool useColumnarResults(const ResultSet& rs) const;

But you will have to add a warmup query anyway to avoid unexpected execution time for Taxi Q1.

Also, it's surprising that Taxi Q1 triggers the problem. As I said, the problem only appears for queries having more than one step, which means Taxi Q1 on Modin has more than one step. Looking into Modin plans for these queries I see, that for Q1, Q3, and Q4 there is an additional projection after aggregation. This projection replaces NULL values with 0 values for COUNT aggregates. I don't understand why we need this transformation, I don't think COUNT aggregate can ever return NULL. It might seem that the aggregation result is small and additional projection is not a big deal, but whatever the table size is, we would pay the full price of dynamic module compilation which used to be ~60ms for simple modules IIRC.

UPD: the fix has been merged

ienkovich · 2023-05-04T04:03:18Z

Resolved by #442

AndreyPavlenko · 2023-05-08T12:18:45Z

Looking into Modin plans for these queries I see, that for Q1, Q3, and Q4 there is an additional projection after aggregation.

Probably, the projection is added by this fillna() - https://github.com/modin-project/modin/blob/master/modin/pandas/groupby.py#L899 . If I remove fillna() here, the additional projection disappears.

dchigarev · 2023-05-15T15:58:40Z

Probably, the projection is added by this fillna() - https://github.com/modin-project/modin/blob/master/modin/pandas/groupby.py#L899 . If I remove fillna() here, the additional projection disappears.

I've investigated why we initially added this .fillna() in modin and found out that we don't need it anymore: modin-project/modin#6126

So I'll soon create a PR to remove it from modin

ienkovich closed this as completed May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion failed on taxi and plasticc benchmarks #458

Assertion failed on taxi and plasticc benchmarks #458

gshimansky commented May 4, 2023

gshimansky commented May 4, 2023

Garra1980 commented May 4, 2023

gshimansky commented May 4, 2023

Garra1980 commented May 4, 2023

gshimansky commented May 4, 2023

ienkovich commented May 4, 2023 •

edited

Loading

ienkovich commented May 4, 2023

AndreyPavlenko commented May 8, 2023 •

edited

Loading

dchigarev commented May 15, 2023

Assertion failed on taxi and plasticc benchmarks #458

Assertion failed on taxi and plasticc benchmarks #458

Comments

gshimansky commented May 4, 2023

gshimansky commented May 4, 2023

Garra1980 commented May 4, 2023

gshimansky commented May 4, 2023

Garra1980 commented May 4, 2023

gshimansky commented May 4, 2023

ienkovich commented May 4, 2023 • edited Loading

ienkovich commented May 4, 2023

AndreyPavlenko commented May 8, 2023 • edited Loading

dchigarev commented May 15, 2023

ienkovich commented May 4, 2023 •

edited

Loading

AndreyPavlenko commented May 8, 2023 •

edited

Loading