Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working deloyment stopped working #32

Open
philip-kuhn opened this issue Jan 17, 2024 · 2 comments
Open

Working deloyment stopped working #32

philip-kuhn opened this issue Jan 17, 2024 · 2 comments

Comments

@philip-kuhn
Copy link

Hi guys. We've been running a stock standard Azure Container Apps deployment successfully since December 4th 2023. Its been running fine with successful data querying, as of COB last Friday. Since Monday morning the container is crashing and cant start up. There is nothing I'm aware of that ran or was done on our side as part of an automated or human process that did anything to the resource. This is the second deployment that this has occurred with (ran well for a few weeks then suddenly the container started crashing) and I'm struggling to understand why. The log stream shows:

2024-01-17T06:32:52.25782 Connecting to the container 'qdrantapicontainerapp'...
2024-01-17T06:32:52.27576 Successfully Connected to container: 'qdrantapicontainerapp' [Revision: 'sygniasynapseqdranthttp--0tfisge-567f7bd697-5hr52', Replica: 'sygniasynapseqdranthttp--0tfisge']
2024-01-17T06:32:37.835011814Z 2: std::panicking::rust_panic_with_hook
2024-01-17T06:32:37.835016242Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:735:13
2024-01-17T06:32:37.835020700Z 3: std::panicking::begin_panic_handler::{{closure}}
2024-01-17T06:32:37.835024728Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:609:13
2024-01-17T06:32:37.835028695Z 4: std::sys_common::backtrace::__rust_end_short_backtrace
2024-01-17T06:32:37.835032312Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:170:18
2024-01-17T06:32:37.835037161Z 5: rust_begin_unwind
2024-01-17T06:32:37.835041559Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:597:5
2024-01-17T06:32:37.835046238Z 6: core::panicking::panic_fmt
2024-01-17T06:32:37.835049925Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/panicking.rs:72:14
2024-01-17T06:32:37.835055635Z 7: collection::shards::shard_holder::ShardHolder::load_shards::{{closure}}.110038
2024-01-17T06:32:37.835059663Z 8: storage::content_manager::toc::TableOfContent::new
2024-01-17T06:32:37.835063911Z 9: qdrant::main
2024-01-17T06:32:37.835067928Z 10: std::sys_common::backtrace::__rust_begin_short_backtrace
2024-01-17T06:32:37.835072066Z 11: main
2024-01-17T06:32:37.835075943Z 12:
2024-01-17T06:32:37.835079880Z 13: __libc_start_main
2024-01-17T06:32:37.835083507Z 14: _start
2024-01-17T06:32:37.835086994Z
2024-01-17T06:32:37.835092183Z 2024-01-17T06:32:37.834886Z ERROR qdrant::startup: Panic occurred in file /qdrant/lib/collection/src/shards/replica_set/mod.rs at line 246: Failed to load local shard "./storage/collections/[redacted]/0": Service internal error: RocksDB open error: IO error: No such file or directory: while unlink() file: ./storage/collections/[redacted]/0/segments/23a17757-59d1-4649-acbb-7b5b183af4bb/LOG.old.1705084754144915: No such file or directory

If I browse to the file its looking for, in the Azure portal, its reported as being marked for deletion by an SMB client. As far as I know there was no human action that did this, and all other files are accessible. This is also the only fine that has contents. All the other LOG.old files are 0 size. We cant delete the file because its already marked for deletion, so I cant upload any sort of replacement file, so short of redeploying everything, I'm not sure where to go from here. I set the soft delete period to the minimum (1 day) in the hopes that once the file deleted it would sort itself out, but the file hasn't deleted and is still present but inaccessible. I'm really hoping I don't have to do a complete redeploy to fix this, so any assistance you can give to help understand why this has happened, would be highly appreciated.

Thanks so much

Please provide us with the following information:

This issue is for a: (mark with an x)

- [X] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

No idea. It was working fine and then something marked the log.OLD file for deletion

Any log messages given by the failure

Expected/desired behavior

That the working deployment continues to work

OS and Version?

Azure Container Service , so probably Linux

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon.

@tawalke
Copy link
Contributor

tawalke commented Mar 11, 2024

@philip-kuhn Is this still an issue? My team and I also added Qdrant as an add-on to ACA so you can just use that as well: https://learn.microsoft.com/en-us/azure/container-apps/add-ons-qdrant

If it is running or not working on ACA as a current deployment please file a support case

@philip-kuhn
Copy link
Author

Hi Tara

When last I looked at it, it was. We needed to get our chat service up and running again, and I wasnt prepared to recreate the resource and restore the data for a second time, only for this to happen again, for a third time, in a month. We moved to Azure AI Search.

Thanks for your efforts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants