Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] FE is not starting #37177

Open
3 tasks done
cuttingedge1109 opened this issue Jul 2, 2024 · 1 comment
Open
3 tasks done

[Bug] FE is not starting #37177

cuttingedge1109 opened this issue Jul 2, 2024 · 1 comment

Comments

@cuttingedge1109
Copy link

cuttingedge1109 commented Jul 2, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

Version

2.1.2

What's Wrong?

FE pods are all crashing with the following error.

2024-07-02 13:13:25,802 WARN (replayer|87) [Backend.handleHbResponse():731] Backend [id=10057, host=test-be-2.test-be-internal.doris.svc.cluster.local, heartbeatPort=9050, alive=false, lastStartTime=2024-07-02 07:30:41, process epoch=1719905441277, tags: {location=default}] is dead,
2024-07-02 13:13:25,803 INFO (replayer|87) [Replica.updateReplicaInfo():491] change replica last failed version from '< 0' to '> 0', replica [replicaId=10176, BackendId=10057, version=638, dataSize=30327, rowCount=54, lastFailedVersion=639, lastSuccessVersion=638, lastFailedTimestamp=1719926005803, schemaHash=-1, state=NORMAL], old last failed version -1
2024-07-02 13:13:25,803 INFO (replayer|87) [Replica.updateReplicaInfo():491] change replica last failed version from '< 0' to '> 0', replica [replicaId=10200, BackendId=10057, version=638, dataSize=33344, rowCount=112, lastFailedVersion=639, lastSuccessVersion=638, lastFailedTimestamp=1719926005803, schemaHash=-1, state=NORMAL], old last failed version -1
2024-07-02 13:13:25,803 INFO (replayer|87) [Replica.updateReplicaInfo():491] change replica last failed version from '< 0' to '> 0', replica [replicaId=10202, BackendId=10057, version=638, dataSize=19125, rowCount=58, lastFailedVersion=639, lastSuccessVersion=638, lastFailedTimestamp=1719926005803, schemaHash=-1, state=NORMAL], old last failed version -1
2024-07-02 13:13:25,803 INFO (replayer|87) [Replica.updateReplicaInfo():491] change replica last failed version from '< 0' to '> 0', replica [replicaId=10203, BackendId=10057, version=638, dataSize=32002, rowCount=80, lastFailedVersion=639, lastSuccessVersion=638, lastFailedTimestamp=1719926005803, schemaHash=-1, state=NORMAL], old last failed version -1
2024-07-02 13:13:25,803 INFO (replayer|87) [Replica.updateReplicaInfo():491] change replica last failed version from '< 0' to '> 0', replica [replicaId=10177, BackendId=10057, version=638, dataSize=35468, rowCount=69, lastFailedVersion=639, lastSuccessVersion=638, lastFailedTimestamp=1719926005803, schemaHash=-1, state=NORMAL], old last failed version -1
2024-07-02 13:13:25,803 INFO (replayer|87) [Replica.updateReplicaInfo():491] change replica last failed version from '< 0' to '> 0', replica [replicaId=10207, BackendId=10057, version=638, dataSize=34593, rowCount=56, lastFailedVersion=639, lastSuccessVersion=638, lastFailedTimestamp=1719926005803, schemaHash=-1, state=NORMAL], old last failed version -1
2024-07-02 13:13:25,803 INFO (replayer|87) [Replica.updateReplicaInfo():491] change replica last failed version from '< 0' to '> 0', replica [replicaId=10178, BackendId=10057, version=638, dataSize=30804, rowCount=65, lastFailedVersion=639, lastSuccessVersion=638, lastFailedTimestamp=1719926005803, schemaHash=-1, state=NORMAL], old last failed version -1
2024-07-02 13:13:25,803 INFO (replayer|87) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a COMMITTED transaction TransactionState. transaction id: 1831133, label: label_f42ab2bb810b4755_aeb962138405d8eb, db id: 10002, table id list: 10114, callback id: -1, coordinator: FE: test-fe-1.test-fe-internal.doris.svc.cluster.local, transaction status: COMMITTED, error replicas num: 7, replica ids: 10176,10177,10178,10200,10202, prepare time: 1719911001529, commit time: 1719911002419, finish time: -1, reason: 
2024-07-02 13:13:25,803 INFO (replayer|87) [OlapTable.updateVisibleVersionAndTime():2591] updateVisibleVersionAndTime, tableName: column_statistics, visibleVersion, 639, visibleVersionTime: 1719911003747
2024-07-02 13:13:25,803 INFO (replayer|87) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a VISIBLE transaction TransactionState. transaction id: 1831133, label: label_f42ab2bb810b4755_aeb962138405d8eb, db id: 10002, table id list: 10114, callback id: -1, coordinator: FE: test-fe-1.test-fe-internal.doris.svc.cluster.local, transaction status: VISIBLE, error replicas num: 7, replica ids: 10176,10177,10178,10200,10202, prepare time: 1719911001529, commit time: 1719911002419, finish time: 1719911003747, reason: 
2024-07-02 13:13:25,803 INFO (replayer|87) [LoadManager.replayCreateLoadJob():191] LOAD_JOB=3576204, msg={replay create load job}
2024-07-02 13:13:25,804 INFO (replayer|87) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a COMMITTED transaction TransactionState. transaction id: 1831134, label: label_6ca665869b794ec7_995bc126973a6e4d, db id: 10002, table id list: 10114, callback id: -1, coordinator: FE: test-fe-1.test-fe-internal.doris.svc.cluster.local, transaction status: COMMITTED, error replicas num: 7, replica ids: 10176,10177,10178,10200,10202, prepare time: 1719911005551, commit time: 1719911005819, finish time: -1, reason: 
2024-07-02 13:13:25,804 INFO (replayer|87) [OlapTable.updateVisibleVersionAndTime():2591] updateVisibleVersionAndTime, tableName: column_statistics, visibleVersion, 640, visibleVersionTime: 1719911006142
2024-07-02 13:13:25,804 INFO (replayer|87) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a VISIBLE transaction TransactionState. transaction id: 1831134, label: label_6ca665869b794ec7_995bc126973a6e4d, db id: 10002, table id list: 10114, callback id: -1, coordinator: FE: test-fe-1.test-fe-internal.doris.svc.cluster.local, transaction status: VISIBLE, error replicas num: 7, replica ids: 10176,10177,10178,10200,10202, prepare time: 1719911005551, commit time: 1719911005819, finish time: 1719911006142, reason: 
2024-07-02 13:13:25,804 INFO (replayer|87) [LoadManager.replayCreateLoadJob():191] LOAD_JOB=3576224, msg={replay create load job}
2024-07-02 13:13:25,804 INFO (replayer|87) [Env.setMaster():4125] setMaster MasterInfo:MasterInfo: host=test-fe-1.test-fe-internal.doris.svc.cluster.local httpPort=8030 rpcPort=9020
2024-07-02 13:13:25,805 INFO (replayer|87) [Backend.handleHbResponse():705] Backend [id=10057, host=test-be-2.test-be-internal.doris.svc.cluster.local, heartbeatPort=9050, alive=false, lastStartTime=2024-07-02 07:30:41, process epoch=1719905441277, tags: {location=default}] is back to alive, update start time from 2024-07-02 07:30:41 to 2024-07-02 09:04:31, update be epoch from 1719905441277 to 1719911071949.
2024-07-02 13:13:25,805 WARN (replayer|87) [Backend.handleHbResponse():731] Backend [id=10169, host=test-be-1.test-be-internal.doris.svc.cluster.local, heartbeatPort=9050, alive=false, lastStartTime=2024-07-02 05:13:41, process epoch=1719897221605, tags: {location=default}] is dead,
2024-07-02 13:13:25,805 INFO (replayer|87) [Backend.handleHbResponse():705] Backend [id=10169, host=test-be-1.test-be-internal.doris.svc.cluster.local, heartbeatPort=9050, alive=false, lastStartTime=2024-07-02 05:13:41, process epoch=1719897221605, tags: {location=default}] is back to alive, update start time from 2024-07-02 05:13:41 to 2024-07-02 09:06:16, update be epoch from 1719897221605 to 1719911176797.
2024-07-02 13:13:25,806 INFO (replayer|87) [Env.setMaster():4125] setMaster MasterInfo:MasterInfo: host=test-fe-1.test-fe-internal.doris.svc.cluster.local httpPort=8030 rpcPort=9020
2024-07-02 13:13:25,806 ERROR (replayer|87) [CatalogRecycleBin.replayErasePartition():572] replayErasePartition: partitionInfo is null for partitionId[13762]
2024-07-02 13:13:25,806 ERROR (replayer|87) [EditLog.loadJournal():1231] Operation Type 16
java.lang.NullPointerException: null
	at org.apache.doris.catalog.CatalogRecycleBin.replayErasePartition(CatalogRecycleBin.java:575) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.datasource.InternalCatalog.replayErasePartition(InternalCatalog.java:1813) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.catalog.Env.replayErasePartition(Env.java:3090) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.persist.EditLog.loadJournal(EditLog.java:289) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.catalog.Env.replayJournal(Env.java:2759) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.catalog.Env$4.runOneCycle(Env.java:2533) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.common.util.Daemon.run(Daemon.java:116) ~[doris-fe.jar:1.2-SNAPSHOT]

What You Expected?

FE pods should be running.

How to Reproduce?

We have been operating doris clusteer(2FE and 3BE) without any issues for a few weeks. We deployed it using selectdb/doris-operator.
We faced some issues with BE pods because of volume filling up and computing resource lack. We used to increase the resource spec whenever this kinda alerts fired and it worked after that.

This time, all resources are enough i think but FE pods are not up.

Anything Else?

Resource spec;

  be:
      limits:
        cpu: 4
        memory: 16Gi
      requests:
        cpu: 4
        memory: 16Gi
  fe:
      requests:
        cpu: 2
        memory: 8Gi
      limits:
        cpu: 2
        memory: 8Gi

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@josedev-union
Copy link

same issue for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants