chore(ibis): refactor the type mapping table#1302
Conversation
WalkthroughCentralizes data type mappings across multiple metadata modules by introducing module-level mapping dictionaries, adding type hints to transformer methods, normalizing inputs, and logging warnings for unknown types. Some transformers now return union types when complex types are passed through unchanged. Changes
Sequence Diagram(s)sequenceDiagram
participant Metadata as MetadataModule
participant Mapping as TYPE_MAPPING (module-level)
participant Logger as loguru.logger
Metadata->>Mapping: normalize(data_type) → key
alt key is complex (array/struct) or passthrough
Mapping-->>Metadata: return original string (complex)
else mapped
Mapping-->>Metadata: return RustWrenEngineColumnType
else unknown
Mapping-->>Metadata: return UNKNOWN
Mapping->>Logger: warning("unknown type: ...")
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (9)
ibis-server/app/model/metadata/snowflake.py (1)
163-184: Strip size/precision and whitespace when normalizing typesSnowflake emits types with params (e.g., NUMBER(38,0), VARCHAR(16777216)). Without stripping, these resolve to UNKNOWN.
- # Convert to lowercase for comparison - normalized_type = data_type.lower() + # Remove parameter specs and normalize (e.g., "NUMBER(38,0)" -> "number") + normalized_type = re.sub(r"\(.*\)", "", data_type).strip().lower()ibis-server/app/model/metadata/redshift.py (1)
176-199: Strip parameterization to catch e.g., character varying(50), numeric(10,2)Without this, many types end up UNKNOWN.
- # Convert to lowercase for comparison - normalized_type = data_type.lower() + # Normalize by removing parameter specs and trimming spaces + normalized_type = re.sub(r"\(.*\)", "", data_type).strip().lower()ibis-server/app/model/metadata/object_storage.py (2)
83-97: Fix column type extraction from DuckDB relation (current code will fail)
conn.read_*returns a DuckDBPyRelation;df[col].dtypes[0]is a pandas-ism and will error. Use relation.columns/types (or DESCRIBE). Referencing your prior learning about DuckDB DESCRIBE.- columns = [] - try: - for col in df.columns: - duckdb_type = df[col].dtypes[0] - columns.append( - Column( - name=col, - type=self._to_column_type(duckdb_type.__str__()), - notNull=False, - ) - ) - except Exception as e: - logger.debug(f"Failed to read column types: {e}") - continue + # Extract schema from DuckDB relation without materializing data + try: + columns = [] + col_names = list(df.columns) + col_types = [str(t) for t in df.types] + for name, typ in zip(col_names, col_types): + columns.append( + Column( + name=name, + type=self._to_column_type(typ), + notNull=False, + ) + ) + except Exception as e: + logger.debug(f"Failed to read column types: {e}") + continue
151-182: Make DECIMAL/STRUCT/array guards case-insensitive and reuse normalized valueAvoid case fragility; ensures consistent UNKNOWN handling for STRUCT/arrays.
- if col_type.startswith("DECIMAL"): + lower = col_type.lower() + if lower.startswith("decimal"): return RustWrenEngineColumnType.DECIMAL @@ - if col_type.startswith("STRUCT"): + if lower.startswith("struct"): return RustWrenEngineColumnType.UNKNOWN @@ - if col_type.endswith("[]"): + if lower.endswith("[]"): return RustWrenEngineColumnType.UNKNOWN - # Convert to lowercase for comparison - normalized_type = col_type.lower() + # Use the lower-cased value for mapping + normalized_type = loweribis-server/app/model/metadata/clickhouse.py (2)
91-101: Set notNull based on Nullable(...) wrapper (currently always False)ClickHouse encodes nullability in the type. This is a correctness issue.
- unique_tables[schema_table].columns.append( - Column( - name=row["column_name"], - type=self._transform_column_type(row["data_type"]), - notNull=False, - description=row["column_comment"], - properties=None, - ) - ) + dtype = row["data_type"] + unique_tables[schema_table].columns.append( + Column( + name=row["column_name"], + type=self._transform_column_type(dtype), + notNull=not dtype.lower().startswith("nullable("), + description=row["column_comment"], + properties=None, + ) + )
112-133: Handle Nullable/LowCardinality/Array wrappers and strip params (DateTime64, Decimal, FixedString)Without unwrapping, most real-world types map to UNKNOWN.
- # Convert to lowercase for comparison - normalized_type = data_type.lower() - - # Use the module-level mapping table - mapped_type = CLICKHOUSE_TYPE_MAPPING.get( - normalized_type, RustWrenEngineColumnType.UNKNOWN - ) + # Unwrap wrappers and strip parameters: + # - Nullable(T), LowCardinality(T), SimpleAggregateFunction(..., T) + # - Array(T) -> T (array element type; array-ness currently ignored) + # - Decimal(p,s), FixedString(n), DateTime64(s) + t = data_type.strip() + # unwrap known wrappers + wrap_pat = re.compile(r'^(Nullable|LowCardinality|SimpleAggregateFunction)\((.+)\)$', re.IGNORECASE) + while True: + m = wrap_pat.match(t) + if not m: + break + # for SimpleAggregateFunction(func, T) keep T + inner = m.group(2) + # if inner still contains a comma, keep the last part as the base type + if m.group(1).lower() == "simpleaggregatefunction" and "," in inner: + inner = inner.split(",")[-1].strip() + t = inner + # unwrap Array(T) -> T + m = re.match(r'^Array\((.+)\)$', t, re.IGNORECASE) + if m: + t = m.group(1) + # drop parameter lists: Decimal(10,2) -> Decimal, FixedString(16) -> FixedString, DateTime64(3) -> DateTime64 + t = re.sub(r'\([^()]*\)', '', t).lower() + + mapped_type = CLICKHOUSE_TYPE_MAPPING.get(t, RustWrenEngineColumnType.UNKNOWN)ibis-server/app/model/metadata/postgres.py (2)
121-131: Bug: PostGIS handler updates only the first column and forces string type.
- It ignores the target column name and breaks after the first column.
- It coerces enum to str, diverging from get_table_list which sets the enum.
Apply:
- table = tables[table_name] - for column in table.columns: - column.type = str( - self._transform_postgres_column_type(row["column_type"]) - ) - break + table = tables[table_name] + target = next((c for c in table.columns if c.name == row["column_name"]), None) + if target is None: + logger.warning(f"PostGIS column not found: {table_name}.{row['column_name']}") + continue + target.type = self._transform_postgres_column_type(row["column_type"])
69-101: Correct type hints for tables and extensions.augment/postgis_handler operate on a dict[str, Table]; get_extensions returns a list of dicts. Fix annotations to match usage and avoid confusion for static analysis.
- def augment(self, tables: list[Table]) -> list[Table]: + def augment(self, tables: dict[str, Table]) -> dict[str, Table]: ... - def get_extensions(self) -> list[str]: + def get_extensions(self) -> list[dict[str, str]]: ... - def postgis_handler(self, tables: list[Table], schema_name: str) -> list[Table]: + def postgis_handler(self, tables: dict[str, Table], schema_name: str) -> dict[str, Table]:ibis-server/app/model/metadata/bigquery.py (1)
77-86: Wire the transformer into get_table_list.Currently Column.type is the raw BigQuery string, so the new mapping/logging is unused.
- return Column( + return Column( # field_path supports both column & nested column name=row["field_path"], - type=row["data_type"], + type=self._transform_column_type(row["data_type"]), notNull=row["is_nullable"].lower() == "no", description=row["column_description"], properties={}, nestedColumns=[] if has_nested_columns(row) else None, )
🧹 Nitpick comments (11)
ibis-server/app/model/metadata/canner.py (2)
240-246: Add type hints and docstring for consistency with other adaptersAlign this method signature/docstring with athena/mysql/trino for readability and static checks.
Example outside-range change:
def _transform_column_type(self, data_type: str) -> RustWrenEngineColumnType: """Transform Canner data type to RustWrenEngineColumnType.""" ...
41-44: Verify FLOAT mapping semantics (“float” -> FLOAT8)MySQL-style “float” is typically 32-bit; here it maps to 64-bit. If Canner surfaces MySQL-like types, consider mapping to FLOAT4/REAL for accuracy, or document the intentional widening.
- "float": RustWrenEngineColumnType.FLOAT8, + "float": RustWrenEngineColumnType.FLOAT4,Do you want me to audit float/double mappings across all adapters for consistency?
ibis-server/app/model/metadata/mysql.py (2)
178-180: Align comment with implementationWe don’t strip params here; DATA_TYPE already excludes them in MySQL. Update the comment to avoid confusion.
- # Remove parameter specifications like VARCHAR(255) -> VARCHAR - normalized_type = data_type.strip().lower() + # MySQL information_schema.COLUMNS.DATA_TYPE is already base type (no length/precision) + normalized_type = data_type.strip().lower()
46-48: Double-check TIMESTAMP -> TIMESTAMPTZMySQL TIMESTAMP stores UTC but has no timezone field; many systems map it to TIMESTAMP (no tz). Please confirm intended semantics across adapters. I can batch-update mappings/tests if needed.
ibis-server/app/model/metadata/snowflake.py (1)
17-51: Expand Snowflake type coverage (LTZ/TIME/BINARY/VARIANT/OBJECT/GEOGRAPHY/GEOMETRY)Common Snowflake types are missing, leading to unnecessary UNKNOWNs and noisy logs.
SNOWFLAKE_TYPE_MAPPING = { @@ # Date and Time Types "date": RustWrenEngineColumnType.DATE, - "datetime": RustWrenEngineColumnType.TIMESTAMP, + "datetime": RustWrenEngineColumnType.TIMESTAMP, "timestamp": RustWrenEngineColumnType.TIMESTAMP, "timestamp_ntz": RustWrenEngineColumnType.TIMESTAMP, "timestamp_tz": RustWrenEngineColumnType.TIMESTAMPTZ, + "timestamp_ltz": RustWrenEngineColumnType.TIMESTAMPTZ, + "time": RustWrenEngineColumnType.TIME, + # Semi-structured / binary + "variant": RustWrenEngineColumnType.JSON, + "object": RustWrenEngineColumnType.JSON, + "binary": RustWrenEngineColumnType.BYTES, + # Spatial + "geography": RustWrenEngineColumnType.GEOGRAPHY, + "geometry": RustWrenEngineColumnType.GEOMETRY, }ibis-server/app/model/metadata/redshift.py (1)
15-51: Add Redshift SUPER type mappingSUPER is common; mapping to JSON avoids UNKNOWN noise.
REDSHIFT_TYPE_MAPPING = { @@ "inet": RustWrenEngineColumnType.INET, "oid": RustWrenEngineColumnType.OID, + "super": RustWrenEngineColumnType.JSON, }ibis-server/app/model/metadata/object_storage.py (1)
25-51: Align DuckDB scalar mappings with existing conventionsUse consistent integer/float targets across modules; avoid lossy/int-size mismatches.
DUCKDB_TYPE_MAPPING = { - "bigint": RustWrenEngineColumnType.INT64, - "bit": RustWrenEngineColumnType.INT2, + "bigint": RustWrenEngineColumnType.BIGINT, "blob": RustWrenEngineColumnType.BYTES, "boolean": RustWrenEngineColumnType.BOOL, "date": RustWrenEngineColumnType.DATE, "double": RustWrenEngineColumnType.DOUBLE, - "float": RustWrenEngineColumnType.FLOAT, - "integer": RustWrenEngineColumnType.INT, + "float": RustWrenEngineColumnType.FLOAT4, + "real": RustWrenEngineColumnType.REAL, + "integer": RustWrenEngineColumnType.INTEGER, @@ "interval": RustWrenEngineColumnType.INTERVAL, "json": RustWrenEngineColumnType.JSON, - "smallint": RustWrenEngineColumnType.INT2, + "smallint": RustWrenEngineColumnType.SMALLINT, "time": RustWrenEngineColumnType.TIME, "timestamp": RustWrenEngineColumnType.TIMESTAMP, "timestamp with time zone": RustWrenEngineColumnType.TIMESTAMPTZ, - "tinyint": RustWrenEngineColumnType.INT2, + "tinyint": RustWrenEngineColumnType.TINYINT, @@ "uuid": RustWrenEngineColumnType.UUID, - "varchar": RustWrenEngineColumnType.STRING, + "varchar": RustWrenEngineColumnType.VARCHAR, }ibis-server/app/model/metadata/postgres.py (3)
25-36: Normalize integer/float aliases to a single canonical target (consistency).Today, some aliases resolve to INTEGER/DOUBLE while enum exposes INT/INT4/INT8, FLOAT/FLOAT4/FLOAT8. Pick one canonical target per family to avoid mixed semantics across engines. If you keep INTEGER/DOUBLE here, mirror that everywhere; otherwise switch these to INT/DOUBLE or INT4/FLOAT8 consistently.
145-147: Trim input before lowercasing.Be defensive against stray whitespace from drivers.
- normalized_type = data_type.lower() + normalized_type = data_type.strip().lower()Also applies to: 290-292
298-301: Good: unknown-type warning added. Consider richer context later.If/when you can pass table/column into the transformer, include them in the warning to speed up triage.
ibis-server/app/model/metadata/bigquery.py (1)
204-206: Trim input before lowercasing.- normalized_type = data_type.lower() + normalized_type = data_type.strip().lower()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (12)
ibis-server/app/model/metadata/athena.py(3 hunks)ibis-server/app/model/metadata/bigquery.py(3 hunks)ibis-server/app/model/metadata/canner.py(3 hunks)ibis-server/app/model/metadata/clickhouse.py(3 hunks)ibis-server/app/model/metadata/mssql.py(3 hunks)ibis-server/app/model/metadata/mysql.py(3 hunks)ibis-server/app/model/metadata/object_storage.py(3 hunks)ibis-server/app/model/metadata/oracle.py(3 hunks)ibis-server/app/model/metadata/postgres.py(4 hunks)ibis-server/app/model/metadata/redshift.py(3 hunks)ibis-server/app/model/metadata/snowflake.py(3 hunks)ibis-server/app/model/metadata/trino.py(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-01-07T03:56:21.741Z
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
Applied to files:
ibis-server/app/model/metadata/object_storage.py
🧬 Code graph analysis (12)
ibis-server/app/model/metadata/trino.py (2)
ibis-server/app/model/metadata/dto.py (1)
RustWrenEngineColumnType(13-57)ibis-server/app/model/metadata/athena.py (1)
_transform_column_type(140-160)
ibis-server/app/model/metadata/oracle.py (2)
ibis-server/app/model/metadata/dto.py (1)
RustWrenEngineColumnType(13-57)ibis-server/app/model/metadata/athena.py (1)
_transform_column_type(140-160)
ibis-server/app/model/metadata/athena.py (2)
ibis-server/app/model/metadata/dto.py (1)
RustWrenEngineColumnType(13-57)ibis-server/app/model/metadata/trino.py (1)
_transform_column_type(138-158)
ibis-server/app/model/metadata/canner.py (1)
ibis-server/app/model/metadata/dto.py (1)
RustWrenEngineColumnType(13-57)
ibis-server/app/model/metadata/snowflake.py (2)
ibis-server/app/model/metadata/dto.py (1)
RustWrenEngineColumnType(13-57)ibis-server/app/model/metadata/athena.py (1)
_transform_column_type(140-160)
ibis-server/app/model/metadata/redshift.py (1)
ibis-server/app/model/metadata/dto.py (1)
RustWrenEngineColumnType(13-57)
ibis-server/app/model/metadata/clickhouse.py (2)
ibis-server/app/model/metadata/dto.py (1)
RustWrenEngineColumnType(13-57)ibis-server/app/model/metadata/athena.py (1)
_transform_column_type(140-160)
ibis-server/app/model/metadata/postgres.py (1)
ibis-server/app/model/metadata/dto.py (1)
RustWrenEngineColumnType(13-57)
ibis-server/app/model/metadata/mssql.py (3)
ibis-server/app/model/metadata/dto.py (1)
RustWrenEngineColumnType(13-57)ibis-server/app/model/metadata/athena.py (1)
_transform_column_type(140-160)ibis-server/app/model/metadata/mysql.py (1)
_transform_column_type(169-189)
ibis-server/app/model/metadata/bigquery.py (2)
ibis-server/app/model/metadata/dto.py (1)
RustWrenEngineColumnType(13-57)ibis-server/app/model/metadata/athena.py (1)
_transform_column_type(140-160)
ibis-server/app/model/metadata/mysql.py (3)
ibis-server/app/model/metadata/dto.py (1)
RustWrenEngineColumnType(13-57)ibis-server/app/model/metadata/athena.py (1)
_transform_column_type(140-160)ibis-server/app/model/metadata/trino.py (1)
_transform_column_type(138-158)
ibis-server/app/model/metadata/object_storage.py (1)
ibis-server/app/model/metadata/dto.py (1)
RustWrenEngineColumnType(13-57)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: ci
🔇 Additional comments (5)
ibis-server/app/model/metadata/athena.py (2)
18-51: Mapping table looks consistent with the approach across modulesNo blockers here.
140-160: Confirm Athena TIMESTAMP semantics (tz vs no-tz)Athena/Trino historically differentiate TIMESTAMP vs TIMESTAMP WITH TIME ZONE. Mapping plain "timestamp" to TIMESTAMPTZ may be inaccurate depending on engine settings. Please verify and align.
Do you want me to switch "timestamp" -> TIMESTAMP and add "timestamp with time zone" -> TIMESTAMPTZ if that matches your environment?
ibis-server/app/model/metadata/postgres.py (1)
54-58: LGTM: Extension mapping table is clear.ibis-server/app/model/metadata/bigquery.py (2)
215-219: Verify TIMESTAMP mapping consistency across engines.BigQuery TIMESTAMP (UTC point-in-time) is mapped to TIMESTAMPTZ here. Ensure this aligns with other connectors (e.g., Postgres TIMESTAMPTZ) and downstream expectations.
Would you like me to scan the repo for other connectors’ TIMESTAMP mappings and summarize the current consistency matrix?
195-203: Union-type annotation supported – no changes needed.
Project requires Python ≥3.11, sostr | RustWrenEngineColumnTypeis valid syntax.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
ibis-server/app/model/metadata/bigquery.py (3)
79-88: Type mapping isn’t applied; unknown-type logging won’t trigger.get_column sets Column.type from raw data_type, bypassing _transform_column_type. Wire it in so the mapping and warnings are effective.
def get_column(row) -> Column: return Column( # field_path supports both column & nested column name=row["field_path"], - type=row["data_type"], + type=self._transform_column_type(row["data_type"]), notNull=row["is_nullable"].lower() == "no", description=row["column_description"], properties={}, nestedColumns=[] if has_nested_columns(row) else None, )Also applies to: 197-221
67-72: Quote dataset identifier to support hyphenated project IDs.Without backticks, projects like my-proj will break. Quote {dataset_id} in INFORMATION_SCHEMA refs.
- FROM {dataset_id}.INFORMATION_SCHEMA.COLUMNS c - JOIN {dataset_id}.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS cf + FROM `{dataset_id}`.INFORMATION_SCHEMA.COLUMNS c + JOIN `{dataset_id}`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS cf ON cf.table_name = c.table_name AND cf.column_name = c.column_name - LEFT JOIN {dataset_id}.INFORMATION_SCHEMA.TABLE_OPTIONS table_options + LEFT JOIN `{dataset_id}`.INFORMATION_SCHEMA.TABLE_OPTIONS table_options ON c.table_name = table_options.table_name AND table_options.OPTION_NAME = 'description'
167-171: Syntax error: trailing comma before FROM.SELECT list ends with a comma, which invalidates the query.
- kcu.table_name as constraintedTable, kcu.column_name as constraintedColumn, + kcu.table_name as constraintedTable, kcu.column_name as constraintedColumn
♻️ Duplicate comments (1)
ibis-server/app/model/metadata/bigquery.py (1)
15-32: Boolean mapping added — resolves the prior review.Adding "bool"/"boolean" → BOOL fixes the unknown-type noise for booleans.
🧹 Nitpick comments (3)
ibis-server/app/model/metadata/bigquery.py (3)
15-32: Consider a few common BigQuery aliases.Optionally cover INTEGER → INT64 and DECIMAL → NUMERIC to be defensive against variant sources that may emit synonyms.
BIGQUERY_TYPE_MAPPING = { # GEOGRAPHY and RANGE columns were filtered out "bool": RustWrenEngineColumnType.BOOL, "boolean": RustWrenEngineColumnType.BOOL, + "integer": RustWrenEngineColumnType.INT64, + "decimal": RustWrenEngineColumnType.NUMERIC, "bytes": RustWrenEngineColumnType.BYTES,
106-108: Make nested-type detection case-insensitive.Defensive against any casing variance.
- def has_nested_columns(row) -> bool: - return "STRUCT" in row["data_type"] + def has_nested_columns(row) -> bool: + return "struct" in row["data_type"].lower()
218-219: Prefer loguru’s lazy formatting.Avoid f-string evaluation when the log is filtered.
- if mapped_type == RustWrenEngineColumnType.UNKNOWN: - logger.warning(f"Unknown BigQuery data type: {data_type}") + if mapped_type == RustWrenEngineColumnType.UNKNOWN: + logger.warning("Unknown BigQuery data type: {}", data_type)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
ibis-server/app/model/metadata/bigquery.py(3 hunks)ibis-server/app/model/metadata/canner.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- ibis-server/app/model/metadata/canner.py
🧰 Additional context used
🧬 Code graph analysis (1)
ibis-server/app/model/metadata/bigquery.py (2)
ibis-server/app/model/metadata/dto.py (1)
RustWrenEngineColumnType(13-57)ibis-server/app/model/metadata/clickhouse.py (1)
_transform_column_type(112-132)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: ci
🔇 Additional comments (2)
ibis-server/app/model/metadata/bigquery.py (2)
1-1: Logger import is fine.Adopting loguru here is consistent with other backends.
197-197: No action needed: Python ≥3.11 supports PEP 604 unions.
The project’srequires-python = ">=3.11"(pyproject.toml:6) ensuresstr | RustWrenEngineColumnTypeis valid.
|
BigQuery has been tested locally. |
Description
Summary by CodeRabbit
New Features
Refactor
Bug Fixes
Chores