Working deloyment stopped working #32

philip-kuhn · 2024-01-17T06:37:35Z

Hi guys. We've been running a stock standard Azure Container Apps deployment successfully since December 4th 2023. Its been running fine with successful data querying, as of COB last Friday. Since Monday morning the container is crashing and cant start up. There is nothing I'm aware of that ran or was done on our side as part of an automated or human process that did anything to the resource. This is the second deployment that this has occurred with (ran well for a few weeks then suddenly the container started crashing) and I'm struggling to understand why. The log stream shows:

2024-01-17T06:32:52.25782 Connecting to the container 'qdrantapicontainerapp'...
2024-01-17T06:32:52.27576 Successfully Connected to container: 'qdrantapicontainerapp' [Revision: 'sygniasynapseqdranthttp--0tfisge-567f7bd697-5hr52', Replica: 'sygniasynapseqdranthttp--0tfisge']
2024-01-17T06:32:37.835011814Z 2: std::panicking::rust_panic_with_hook
2024-01-17T06:32:37.835016242Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:735:13
2024-01-17T06:32:37.835020700Z 3: std::panicking::begin_panic_handler::{{closure}}
2024-01-17T06:32:37.835024728Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:609:13
2024-01-17T06:32:37.835028695Z 4: std::sys_common::backtrace::__rust_end_short_backtrace
2024-01-17T06:32:37.835032312Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:170:18
2024-01-17T06:32:37.835037161Z 5: rust_begin_unwind
2024-01-17T06:32:37.835041559Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:597:5
2024-01-17T06:32:37.835046238Z 6: core::panicking::panic_fmt
2024-01-17T06:32:37.835049925Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/panicking.rs:72:14
2024-01-17T06:32:37.835055635Z 7: collection::shards::shard_holder::ShardHolder::load_shards::{{closure}}.110038
2024-01-17T06:32:37.835059663Z 8: storage::content_manager::toc::TableOfContent::new
2024-01-17T06:32:37.835063911Z 9: qdrant::main
2024-01-17T06:32:37.835067928Z 10: std::sys_common::backtrace::__rust_begin_short_backtrace
2024-01-17T06:32:37.835072066Z 11: main
2024-01-17T06:32:37.835075943Z 12:
2024-01-17T06:32:37.835079880Z 13: __libc_start_main
2024-01-17T06:32:37.835083507Z 14: _start
2024-01-17T06:32:37.835086994Z
2024-01-17T06:32:37.835092183Z 2024-01-17T06:32:37.834886Z ERROR qdrant::startup: Panic occurred in file /qdrant/lib/collection/src/shards/replica_set/mod.rs at line 246: Failed to load local shard "./storage/collections/[redacted]/0": Service internal error: RocksDB open error: IO error: No such file or directory: while unlink() file: ./storage/collections/[redacted]/0/segments/23a17757-59d1-4649-acbb-7b5b183af4bb/LOG.old.1705084754144915: No such file or directory

If I browse to the file its looking for, in the Azure portal, its reported as being marked for deletion by an SMB client. As far as I know there was no human action that did this, and all other files are accessible. This is also the only fine that has contents. All the other LOG.old files are 0 size. We cant delete the file because its already marked for deletion, so I cant upload any sort of replacement file, so short of redeploying everything, I'm not sure where to go from here. I set the soft delete period to the minimum (1 day) in the hopes that once the file deleted it would sort itself out, but the file hasn't deleted and is still present but inaccessible. I'm really hoping I don't have to do a complete redeploy to fix this, so any assistance you can give to help understand why this has happened, would be highly appreciated.

Thanks so much

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [X] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

No idea. It was working fine and then something marked the log.OLD file for deletion

Any log messages given by the failure

Expected/desired behavior

That the working deployment continues to work

OS and Version?

Azure Container Service , so probably Linux

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

The text was updated successfully, but these errors were encountered:

tawalke · 2024-03-11T20:25:43Z

@philip-kuhn Is this still an issue? My team and I also added Qdrant as an add-on to ACA so you can just use that as well: https://learn.microsoft.com/en-us/azure/container-apps/add-ons-qdrant

If it is running or not working on ACA as a current deployment please file a support case

philip-kuhn · 2024-03-14T09:36:42Z

Hi Tara

When last I looked at it, it was. We needed to get our chat service up and running again, and I wasnt prepared to recreate the resource and restore the data for a second time, only for this to happen again, for a third time, in a month. We moved to Azure AI Search.

Thanks for your efforts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working deloyment stopped working #32

Working deloyment stopped working #32

philip-kuhn commented Jan 17, 2024

Please provide us with the following information:

tawalke commented Mar 11, 2024 •

edited

Loading

philip-kuhn commented Mar 14, 2024

Working deloyment stopped working #32

Working deloyment stopped working #32

Comments

philip-kuhn commented Jan 17, 2024

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Versions

Mention any other details that might be useful

tawalke commented Mar 11, 2024 • edited Loading

philip-kuhn commented Mar 14, 2024

This issue is for a: (mark with an `x`)

tawalke commented Mar 11, 2024 •

edited

Loading