Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improved warning for ulimit -n 4096 needed #1758

Open
jsha opened this issue Jul 12, 2022 · 10 comments
Open

improved warning for ulimit -n 4096 needed #1758

jsha opened this issue Jul 12, 2022 · 10 comments

Comments

@jsha
Copy link
Contributor

jsha commented Jul 12, 2022

The README says:

All tests are failing or timing out

Our test setup needs a certain about of file descriptors.

At least 4096 should be enough, you can set it via:

$ ulimit -n 4096

But I've forgotten and relearned this a number of times, and it's cost me a lot of frustration each time, since the errors (and especially the hangs) are opaque. Is there a way we can improve on this? Maybe each test can check for available FDs and fast-fail if they're not available?

@jyn514
Copy link
Member

jyn514 commented Jul 12, 2022

tbh if we're going to spend time on this I think we should just require fewer FDs in tests

I think most of them are because we write files to disk just to immediately read them back into the database? That should be pretty easy to mock.

@jsha
Copy link
Contributor Author

jsha commented Jul 12, 2022

I'm not sure it's about files. TCP connections also consume an FD. This is very unscientific but I ran cargo test and while it was running, repeatedly ran netstat -peanut in another tab. It started out small and then at some point blossomed into 939 connections; I'm guessing at a later point it might get bigger.

Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name    
tcp        0      0 127.0.0.1:39446         127.0.0.1:9000          ESTABLISHED 1000       1206618    276569/docs_rs-3f1c
tcp        0      0 127.0.0.1:41046         127.0.0.1:9000          ESTABLISHED 1000       1207059    276569/docs_rs-3f1c
tcp        0      0 127.0.0.1:40650         127.0.0.1:9000          ESTABLISHED 1000       1195934    276569/docs_rs-3f1c
tcp        0      0 127.0.0.1:39568         127.0.0.1:9000          ESTABLISHED 1000       1211485    276569/docs_rs-3f1c
tcp        0      0 127.0.0.1:39318         127.0.0.1:9000          ESTABLISHED 1000       1205172    276569/docs_rs-3f1c
tcp        0      0 127.0.0.1:39768         127.0.0.1:9000          ESTABLISHED 1000       1211658    276569/docs_rs-3f1c
tcp        0      0 127.0.0.1:39766         127.0.0.1:9000          ESTABLISHED 1000       1211657    276569/docs_rs-3f1c
tcp        0      0 127.0.0.1:40664         127.0.0.1:9000          ESTABLISHED 1000       1195939    276569/docs_rs-3f1c
tcp        0      0 127.0.0.1:39222         127.0.0.1:9000          ESTABLISHED 1000       1206478    276569/docs_rs-3f1c
tcp        0      0 127.0.0.1:39266         127.0.0.1:9000          ESTABLISHED 1000       1209753    276569/docs_rs-3f1c
tcp        0      0 127.0.0.1:39890         127.0.0.1:9000          ESTABLISHED 1000       1211748    276569/docs_rs-3f1c
tcp        0      0 127.0.0.1:39864         127.0.0.1:9000          ESTABLISHED 1000       1211725    276569/docs_rs-3f1c
tcp        0      0 127.0.0.1:56042         127.0.0.1:15432         ESTABLISHED 1000       1194883    276569/docs_rs-3f1c
tcp        0      0 127.0.0.1:39998         127.0.0.1:9000          ESTABLISHED 1000       1211802    276569/docs_rs-3f1c
tcp        0      0 127.0.0.1:40554         127.0.0.1:9000          ESTABLISHED 1000       1212960    276569/docs_rs-3f1c

I killed the tests after seeing that, so I think one or more of the tests late in this list were responsible:

running 246 tests
test build_queue::tests::test_queued_crates ... ok
test db::add_package::test::new_owners ... ok
test build_queue::tests::test_add_duplicate_doesnt_fail_last_priority_wins ... ok
test db::add_package::test::add_new_owners_and_delete_old ... ok
test db::add_package::test::update_owner_detais ... ok
test build_queue::tests::test_add_duplicate_resets_attempts_and_priority ... ok
test build_queue::tests::test_prioritized_count ... ok
test build_queue::tests::test_add_and_process_crates ... ok
test build_queue::tests::test_pending_count ... ok
test build_queue::tests::test_failed_count ... ok
test db::add_package::test::keyword_conflict_when_rebuilding_release ... ok
test db::add_package::test::new_keywords ... ok
test docbuilder::rustwide_builder::tests::test_build_binary_crate ... ignored
test docbuilder::rustwide_builder::tests::test_build_crate ... ignored
test docbuilder::rustwide_builder::tests::test_cross_compile_non_host_default ... ignored
test docbuilder::rustwide_builder::tests::test_locked_fails_unlocked_needs_new_deps ... ignored
test docbuilder::rustwide_builder::tests::test_proc_macro ... ignored
test repositories::github::tests::get_repository_info ... ok
test repositories::github::tests::not_found ... ok
test repositories::github::tests::test_rate_limit_fail ... ok
test repositories::github::tests::test_rate_limit_manual ... ok
test repositories::gitlab::tests::get_repository_info ... ok
test repositories::gitlab::tests::not_found ... ok
test repositories::gitlab::tests::test_rate_limit ... ok
test db::blacklist::tests::test_add_to_and_remove_from_blacklist ... ok
test db::blacklist::tests::test_add_twice_to_blacklist ... ok
test storage::archive_index::tests::index_create_save_load ... ok
test db::blacklist::tests::test_remove_non_existing_crate ... ok
test db::blacklist::tests::test_list_blacklist ... ok
test repositories::updater::test::test_repository_name ... ok
test docbuilder::limits::test::retrieve_limits ... ok
test db::add_package::test::updated_keywords ... ok
test docbuilder::limits::test::targets_default_to_one_with_timeout ... ok
test db::delete::tests::test_delete_crate::false_expects ... ok
test db::delete::tests::test_delete_crate::true_expects ... ok
test storage::backend_tests::database::test_delete_percent ... ok
test storage::backend_tests::database::test_delete_prefix_without_matches ... ok
test storage::backend_tests::database::test_delete_prefix ... ok
test storage::backend_tests::database::test_exists ... ok
test storage::backend_tests::database::test_exists_without_remote_archive ... ok
test storage::backend_tests::database::test_get_object ... ok
test storage::backend_tests::database::test_get_range ... ok
test storage::backend_tests::database::test_get_too_big ... ok
test db::delete::tests::test_delete_version::false_expects ... ok
test repositories::updater::test::test_icon_name ... ok
test db::delete::tests::test_delete_version::true_expects ... ok
test storage::backend_tests::database::test_store_all ... ok
test storage::backend_tests::database::test_store_blobs ... ok
test storage::backend_tests::database::test_store_all_in_archive ... ok
test storage::backend_tests::s3::test_delete_percent ... ok
test storage::backend_tests::s3::test_delete_prefix ... ok
test storage::compression::tests::test_compression ... ok
test storage::compression::tests::test_decompression_too_big ... ok
test storage::backend_tests::s3::test_delete_prefix_without_matches ... ok
test storage::test::test_mime_types ... ok
test utils::copy::test::test_copy_doc_dir ... ok
test storage::test::test_get_file_list ... ok
test utils::queue::tests::get_priority ... FAILED
test utils::queue::tests::remove_priority ... FAILED
test utils::queue::tests::set_priority ... FAILED
test utils::queue_builder::tests::remove_existing_tempdirs ... FAILED
test utils::rustc_version::test_get_correct_docsrs_style_file ... ok
test storage::backend_tests::s3::test_exists ... ok
test utils::sized_buffer::tests::test_sized_buffer ... ok
test web::build_details::tests::both_build_logs ... FAILED
test web::build_details::tests::db_build_logs ... FAILED
test web::build_details::tests::non_existing_build::_42_expects ... FAILED
test web::build_details::tests::non_existing_build::_nan_expects ... FAILED
test web::build_details::tests::s3_build_logs ... FAILED
test web::builds::tests::build_list ... FAILED
test web::builds::tests::build_list_json ... FAILED
test web::builds::tests::crate_version_not_found ... FAILED
test web::builds::tests::invalid_semver ... FAILED
test storage::backend_tests::s3::test_get_object ... ok
test storage::backend_tests::s3::test_exists_without_remote_archive ... ok
test utils::rustc_version::test_parse_rustc_version ... ok
test storage::backend_tests::s3::test_get_range ... ok
test storage::backend_tests::s3::test_store_all ... ok
test storage::backend_tests::s3::test_store_all_in_archive ... ok
test storage::backend_tests::s3::test_get_too_big ... ok

@jyn514
Copy link
Member

jyn514 commented Jul 12, 2022

@jsha a more reliable way to see the file descriptors used is to run strace and track open, connect, and bind syscalls.

@jsha
Copy link
Contributor Author

jsha commented Jul 12, 2022

This seems to be one contributor, but not the sole cause:

const MAX_CONCURRENT_UPLOADS: usize = 1000;

And there's a test case that intentionally tests that max:

docs.rs/src/storage/mod.rs

Lines 866 to 879 in 846add3

fn test_batched_uploads(storage: &Storage) -> Result<()> {
let now = Utc::now();
let uploads: Vec<_> = (0..=MAX_CONCURRENT_UPLOADS + 1)
.map(|i| {
let content = format!("const IDX: usize = {};", i).as_bytes().to_vec();
Blob {
mime: "text/rust".into(),
content,
path: format!("{}.rs", i),
date_updated: now,
compression: None,
}
})
.collect();

But removing that one test case, or reducing MAX_CONCURRENT_UPLOADS to 10, doesn't solve the problem.

@Nemo157
Copy link
Member

Nemo157 commented Jul 12, 2022

I think part of the problem is that the storage is never dropped because the strong count on the Arc<Storage> after the test finishes is commonly still 5 (in ~17% of tests it's only 1 so gets cleaned up). That probably keeps a tokio runtime inside the database pool alive, which is why I see about 1k eventfd slowly accumulating during the test run.

I'm not sure why we are keeping it alive, I didn't think there were any loops in the Arcs; that will take some more investigation.

@Nemo157
Copy link
Member

Nemo157 commented Jul 12, 2022

Oh, there's a comment on why

docs.rs/src/storage/mod.rs

Lines 494 to 496 in 846add3

// We're using `&self` instead of consuming `self` or creating a Drop impl because during tests
// we leak the web server, and Drop isn't executed in that case (since the leaked web server
// still holds a reference to the storage).

we do shutdown the database underneath the storage so that shouldn't be it.

@Nemo157
Copy link
Member

Nemo157 commented Jul 12, 2022

I added some logging and during the complete test suite we create 2347 tokio runtimes and drop 2066 of them, so there's 281 leaked from somewhere... (also, that's like 10 runtimes per test, why are we creating so many 😰, I would expect like 4 max (one for database, one for S3, one for reqwest, maybe one somewhere else)).

@Nemo157
Copy link
Member

Nemo157 commented Jul 12, 2022

Ok, I tracked down the leaked runtimes, they're from the RepositoryStatsUpdater's reqwest clients. Those are referenced via the MainHandler from the server, so I hacked in a way to completely remove the main handler after the test completes before we leak the iron server and that lets the total open files hover around 200-400 instead of going over 1000 (except when the 1000 files test runs). That also lets all the other stuff referenced from the web server like the storage and database fully cleanup, so doing something like this might let us remove some of the other cleanup hacks.

@jsha
Copy link
Contributor Author

jsha commented Oct 19, 2022

It took me a minute to figure out why we leak the webserver, so copying the comment here (from web/mod.rs):

    /// Iron is bugged, and it never closes the server even when the listener is dropped. To
    /// avoid never-ending tests this method forgets about the server, leaking it and allowing the
    /// program to end.
    ///
    /// The OS will then close all the dangling servers once the process exits.
    ///
    /// https://docs.rs/iron/0.5/iron/struct.Listening.html#method.close
    #[cfg(test)]
    pub(crate) fn leak(self) {
        std::mem::forget(self.inner);
    }

That's a bummer! It would be so nice to fix that underlying problem, but that would necessitate getting off of iron, which I realize is a huge task.

Given that, your proposed solution - make the MainHandler droppable - makes sense. It's a bit of a bummer to have two really different paths gated by cfg(test), though. Of course we don't want to have a Mutex on MainHandler in prod because that would reduce us to single-threaded performance. What about changing that to an Rc? And then we would only call get_mut and drop it in tests.

Another approach: what about stubbing out the RepositoryStatsUpdater in tests? By my reading it looks like it's making live internet requests during unittests, which is a problem anyhow.

@syphar
Copy link
Member

syphar commented Oct 19, 2022

I'm relatively near to a first PR that starts the axum migration (and a branch where all handlers work with axum + sqlx),
IMO we shouldn't invest too much time into fixing iron-specific things.

Another approach: what about stubbing out the RepositoryStatsUpdater in tests? By my reading it looks like it's making live internet requests during unittests, which is a problem anyhow.

I don't think it's doing live requests, I remember @GuillaumeGomez mocked the requests via mockito.

We could stub out the updater when it's an issue, otherwise it would probably be fixed when we only have one runtime and most things are async

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants