Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightning on x86_64 will crash when importing an empty table via Parquet #52518

Closed
kennytm opened this issue Apr 12, 2024 · 5 comments · Fixed by #52519
Closed

Lightning on x86_64 will crash when importing an empty table via Parquet #52518

kennytm opened this issue Apr 12, 2024 · 5 comments · Fixed by #52519
Labels
affects-7.1 affects-7.5 affects-8.1 component/lightning This issue is related to Lightning of TiDB. report/customer Customers have encountered this bug. severity/major type/bug The issue is confirmed as a bug.

Comments

@kennytm
Copy link
Contributor

kennytm commented Apr 12, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. On Amazon Aurora, create an empty table.

    create database account;
    create table audits_log (id int primary key);
  2. Create an AWS Aurora Snapshot (or directly unzip the attached file → account.audits_log.parquet.zip)

  3. Import the snapshot using https://docs.pingcap.com/tidb/stable/migrate-aurora-to-tidb

2. What did you expect to see? (Required)

Import success

3. What did you see instead (Required)

TiDB Lightning panicked with

panic: counter cannot decrease in value

goroutine 1 [running]:
github.com/prometheus/client_golang/prometheus.(*counter).Add(0xc000af4a98?, 0xc000cdc5c0?)
	/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/counter.go:121 +0xf4
github.com/pingcap/tidb/br/pkg/lightning/importer.(*Controller).estimateChunkCountIntoMetrics(0xc000d26e00, {0x5a35368, 0xc0000e4eb0})
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/importer/import.go:1032 +0x1c2
github.com/pingcap/tidb/br/pkg/lightning/importer.(*Controller).initCheckpoint(0xc000d26e00, {0x5a35368, 0xc0000e4eb0})
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/importer/import.go:878 +0x165
github.com/pingcap/tidb/br/pkg/lightning/importer.(*Controller).Run(0xc000d26e00, {0x5a35368, 0xc0000e4eb0})
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/importer/import.go:511 +0x250
github.com/pingcap/tidb/br/pkg/lightning.(*Lightning).run(0xc0006d6b60, {0x5a353a0, 0xc000080098}, 0xc000abe000, 0xc000abc070)
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/lightning.go:595 +0x12ab
github.com/pingcap/tidb/br/pkg/lightning.(*Lightning).RunOnceWithOptions(0xc0006d6b60, {0x5a353a0, 0xc000080098}, 0xc000abe000, {0x0, 0x0, 0xc000d98880?})
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/lightning.go:388 +0x2b7
main.main.func2(0xc000d93680, 0x51c7c66?)
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/cmd/tidb-lightning/main.go:94 +0x74
main.main()
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/cmd/tidb-lightning/main.go:95 +0x6b3

4. What is your TiDB version? (Required)

Release Version: v7.1.4
Git Commit Hash: a8b650f8372ede73b49dcc99289828c676075fb6
Git Branch: heads/refs/tags/v7.1.4
Go Version: go1.20.12
UTC Build Time: 2024-03-07 08:16:32
Race Enabled: false

CPU: x86_64, OS: Linux

@kennytm kennytm added type/bug The issue is confirmed as a bug. component/lightning This issue is related to Lightning of TiDB. affects-7.1 labels Apr 12, 2024
@kennytm
Copy link
Contributor Author

kennytm commented Apr 12, 2024

The panic is caused by wrong file size calculation in

size := int64(float64(totalRowCount) / float64(rowCount) * float64(rowSize))

when a parquet file contains 0 rows, this line will be reached with totalRowCount == rowCount == rowSize == 0, so Lightning will be effectively computing

size := int64(0.0 / 0.0 * 0.0)

Result of converting NaN to int64 in Go is architecture dependent. On arm64, the result is 0, but on x86_64 it is -9223372036854775808 (int64min).

This then later caused estimatedEngineCnt below to be loaded with a very large negative number,

estimatedEngineCnt += ((tableMeta.TotalSize + int64(batchSize) - 1) / int64(batchSize)) + 1

and finally feed into Prometheus that caused the panic

m.ProcessedEngineCounter.WithLabelValues(metric.ChunkStateEstimated, metric.TableResultSuccess).
Add(float64(estimatedEngineCnt))

@kennytm
Copy link
Contributor Author

kennytm commented Apr 12, 2024

BTW, I think it is too resource intensive to estimate the Parquet file size through the current implementation of SampleParquetDataSize. Look at what it does:

  1. Get the totalRowCount through ReadParquetFileRowCountByFile(), which will open the file and read the .Footer.NumRows
  2. Open the file again, and read the first 8 KiB or 500 rows (whichever smaller) to get the size/row ratio.

This means we need to open the file twice to guess the size. For external storage like S3 this means Lightning will need to download every Parquet file twice at the "load data source" process.

On the customer that triggers this bug, it has taken 1h42m for a process that should have been just a very quick walk-dir (similar issue for SampleFileCompressRatio). IMO this is unacceptable. The "load data source" process should be O(number of files) rather than O(total file size). It's better to just over-estimate the total TableSize.

@lance6716
Copy link
Contributor

lance6716 commented Apr 12, 2024

The "load data source" process should be O(number of files) rather than O(total file size)

To get the totalRowCount, lightning open each file, seek to footer and read. To get some sample row count and row size, lightning reads constant size (first 8 KiB). It's not proportionate to total file size?

BTW I think the uncompressed size in parquet header should be usable but the PR auther said no #46984 (comment) . A bit suspicious.

@kennytm
Copy link
Contributor Author

kennytm commented Apr 12, 2024

@lance6716 the first open used (*S3Storage).Open() which does not provide a byte range, so it will try to fetch the entire file from S3. Later the .ReadFooter() will Seek() to the footer offset. If the current buffer does not include the new offset, it will reopen the file with bytes=xxxx-.

In any case, perhaps even O(number of files) is still too heavy, comparing to the later sampledIndexRatio procedure which is O(number of tables). It should not waste 2 hours before pre-check is able to run.

@seiya-annie
Copy link

/found customer

@ti-chi-bot ti-chi-bot bot added the report/customer Customers have encountered this bug. label Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.1 affects-7.5 affects-8.1 component/lightning This issue is related to Lightning of TiDB. report/customer Customers have encountered this bug. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants