Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSVC: Spurious PDB error on the bots #33145

Closed
alexcrichton opened this issue Apr 22, 2016 · 23 comments
Closed

MSVC: Spurious PDB error on the bots #33145

alexcrichton opened this issue Apr 22, 2016 · 23 comments

Comments

@alexcrichton
Copy link
Member

Bots are very frequently nowadays failing with:

note: LINK : fatal error LNK1318: Unexpected PDB error; RPC (23) '(0x000006BA)'

Normally on some run-pass test. The test that fails is almost always different.

No idea what's going on, and the internet seems to think that this is either one of:

  • Something running in parallel
  • OOM
  • Stale cache files
  • maybe more?

I'm pretty sure the bots aren't hitting OOM as these are tiny tests, I'm also pretty sure that nothing is stale b/c we clean everything (although I could be wrong), and I find it hard to believe that you can't run link.exe in parallel.

No idea what's going on :(

@aturon
Copy link
Member

aturon commented Apr 22, 2016

cc @vadimcn @retep998

@alexcrichton
Copy link
Member Author

Theory: We have two concurrent builds running on the same machine. They both hit the run-pass test suite at the same time. They then try to compile the same test case at the same time. Finally, this chokes mspdbsrv.

That seems to align with the vague descriptions on the internet about parallel builds, explains why we started seeing this recently (an uptick in the amount of things we test), explains the nondeterminism, and explains the different test cases.

Testing locally seems to confirm this as well. A bunch of parallel compilers compiling the same fn main() {} in main.rs will fault eventually on 32-bit MSVC. They do not, however, fault on 64-bit MSVC. Confirming now we've never retried 64-bit MSVC because of this bug and then I'll just limit at most one 32-bit MSVC build per buildslave.

@alexcrichton
Copy link
Member Author

To update, looks like the 32-bit linker on MSVC will simply choke if it is ever invoked in parallel. We only have one 32-bit MSVC bot which confused me for a bit, but this'd explain why we're failing in run-pass, we're invoking a lot of things in parallel...

alexcrichton added a commit to alexcrichton/rust that referenced this issue Apr 22, 2016
Looks like `link.exe` can't be called in parallel when it is targeting 32-bit.
There's some weird complications with `mspdbsrv.exe` which can't seem to be
worked around. Instead, let's just globally lock all invocations of the linker
to ensure there's only one linker running at a time.

For more detail, see the comments in the commit itself.

Closes rust-lang#33145
@alexcrichton
Copy link
Member Author

For posterity, this was my test program. This fails on 32-bit MSVC today (the 1.8 release) and will hopefully pass once #33155 lands.

use std::process::*;
use std::io::*;
use std::path::*;
use std::fs::File;
use std::fs;
use std::thread;

const N: usize = 2;
const M: usize = 100;

fn main() {
    let dirs = (0..N).map(|n| {
        PathBuf::from(format!("dir{}", n))
    }).collect::<Vec<_>>();

    for dir in dirs.iter() {
        fs::create_dir_all(&dir).unwrap();
        File::create(dir.join("main.rs")).unwrap()
            .write_all(b"fn main() {}").unwrap();
    }

    let threads = dirs.iter().cloned().map(|m| {
        thread::spawn(move || {
            for _ in 0..M {
                let mut cmd = Command::new("rustc");
                cmd.current_dir(&m).arg("--crate-name").arg(&m).arg("main.rs");
                println!("{:?}", cmd);
                let status = cmd.status().unwrap();
                assert!(status.success());
            }
        })
    }).collect::<Vec<_>>();

    for thread in threads {
        thread.join().unwrap();
    }
}

@vadimcn
Copy link
Contributor

vadimcn commented Apr 22, 2016

From what I could glean on the internet, it seems that the culprit is mspdbsrv running out of memory. Why only on 32-bit builder? Because it probably uses 32-bit tools, so mspdbsrv is limited to 2G of memory. Care to check? (I don't have a 32-bit Windows handy).

@alexcrichton
Copy link
Member Author

From what I could glean on the internet, it seems that the culprit is mspdbsrv running out of memory.

Yeah unfortunately I found the internet pretty unreliably in diagnosis here. I could definitely repro locally and didn't see OOM anywhere (maybe I misssed it?).

Why only on 32-bit builder?

No idea! But actually, I'm not sure :(. The program above will succeed for x86_64-pc-windows-msvc and fail for i686-pc-windows-msvc, though.

Care to check?

Unfortunately I don't have access to a 32-bit only machine either :(

@retep998
Copy link
Member

Did you know, the point of mspdbsrv.exe is actually to serialize requests to PDB files to avoid issues that arising when invoking the compiler in parallel?

@alexcrichton
Copy link
Member Author

Ok, after some more investigation, I think I see what's going on now. While the observations here have been, true, they've basically all been red herrings unfortunately. I can safely say that link.exe can indeed be run in parallel, it just apparently handles crashes and exits of mspdbsrv.exe really badly.

So as a diagnosis, here's what I think is happening:

  1. When 32-bit link.exe is run, it wants to talk to mspdbsrv.exe for some reason (presumably to generate a pdb file).
  2. If mspdbsrv.exe isn't running, link.exe appears to spawn it. Not only that, it appears to do so in a synchronized fashion. That is if two concurrent link.exe invocations are run in parallel on a system only one mspdbsrv.exe seems to get spawned. You can notice this in that the first link.exe command will take a lot longer than the next.
  3. If mspdbsrv.exe is killed, all "connected" link.exe instances will die horribly. I've seen lots of the RPC errors mentioned here and even segfaults.

Given that information, why did I think that concurrent link.exe invocations are bad? Well it turns out that I was always using rustc via rustup.rs (even when I was testing on the bots), which uses job objects. This means that the entire process tree was killed whenever rustup's rustc wrapper exited. So what I was witnessing was:

  1. I ran rustc in parallel.
  2. They all spawned link.exe.
  3. One of the link.exe instances decided to spawn mspdbsrv.exe, and all the others patiently waited while that happened.
  4. The first link.exe finished, causing the first rustc to finish, which in turn caused the rustup rustc wrapper to finish, when then tore down the spawned mspdbsrv.exe.
  5. Other linkers continued, but hit RPC errors as they apparently don't handle mspdbsrv.exe exiting very well.

Ok, so that explains why I thought you can't run rustc in parallel. It also explains why sometimes I could reproduce the bug locally and others I could not (sometimes I had mspdbsrv.exe running around and sometimes I didn't). So how do we explain the buildbots? And how do we explain why this has all been happening recently? Let's take a look!

  • First up, our bots are running an older version of Windows that only supports one job object. We use up that job object for the entire build. That is, the entire make process and following process tree are all in one job object to make sure that processes are not persistent between builds.
  • Next, the rustbuild build system (and cargo) use some job objects, but they're not used because Windows only supports one job object.
  • Consequently, mspdbsrv.exe, if spawned during our build, will get killed when a build finishes. We currently have only 2 builds that I know of that spawn link.exe in 32-bit mode auto-win-msvc-32-opt and auto-win-msvc-32-cross-opt. I think the cargotest builder would, but it's primarily 64-bit and disables cross-compiled tests. Looking at some logs, a curious trend emerges. Both the opt and cross-opt builders are always running on the same machine when opt fails. Even more interestingly, they always fail at the same time!

Presumably what's happening is that the cross-opt builder process spawned mspdbsrv.exe, and then opt started using it. When cross-opt exits it kills mspdbsrv.exe and then opt dies along with it as it was trying to use link.exe. The failure doesn't happen if opt spawns mspdbsrv.exe (it takes longer) or if cross-opt exits between tests on opt.

Note that this also explains why we just started seeing these failures recently. The cross-opt builder is relatively new.


Funny story, first hit on google for "isolate mspdbsrv.exe" is this article which is hilariously exactly what's happening to us. Unfortunately there aren't many replies...

As for a solution... well... I guess we can run one 32-bit build per machine? Have a whole slave dedicated to the cross-opt builder? Drawing some blanks...

@strega-nil
Copy link
Contributor

Could we just leave mspdbsrv.exe running? It seems to me like we are killing it, so... (note, I know nothing)

@steveklabnik
Copy link
Member

@alexcrichton well done. wow that's tricky.

@vadimcn
Copy link
Contributor

vadimcn commented Apr 26, 2016

Apparently, there is an undocumented environment variable _MSPDBSRV_ENDPOINT_, which allows to start multiple instances of mspdbsrv. I did a quick test, and it appears to work.
(found here https://codereview.chromium.org/83803003)

@retep998
Copy link
Member

retep998 commented Apr 26, 2016

Well it turns out that I was always using rustc via rustup.rs (even when I was testing on the bots), which uses job objects. This means that the entire process tree was killed whenever rustup's rustc wrapper exited.

A couple days ago on #rust-internals...

(13:58:16) WindowsBunnyEngineer: if mspdbsrv is shared between concurrent builds
(13:58:35) WindowsBunnyEngineer: since mspdbsrv is added to the job object of the process that started it...
(13:58:57) WindowsBunnyEngineer: when that build terminates, the mspdbsrv it started will get killed
(13:59:03) WindowsBunnyEngineer: while another build could be relying on it still

@alexcrichton
Copy link
Member Author

Thanks for the link @vadimcn! For now I've opted to modify our job object management to just "leak" mspdbsrv.exe (basically what @ubsan mentioned), but it's good to have a solution like that in our back pocket.

This has been deployed to the bots, and the current build passed, so let's just pray we never see this again.

@lilith
Copy link

lilith commented Mar 11, 2017

I'm seeing this when running cargo test (link fails when compiling hyper).

cargo 0.18.0-nightly (5db6d64 2017-03-03)
rustc 1.17.0-nightly (b1e31766d 2017-03-03)

cargo test --target i686-pc-windows-msvc

error: linking with `link.exe` failed: exit code: 1318
  |
  = note: "C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\link.
exe" "/NOLOGO" "/NXCOMPAT" "/LARGEADDRESSAWARE" "/SAFESEH" "/LIBPATH:C:\\Users\\
n\\.rustup\\toolchains\\nightly-2017-03-04-i686-pc-windows-msvc\\lib\\rustlib\\i
686-pc-windows-msvc\\lib" "C:\\Users\\n\\Documents\\imageflow\\target\\debug\\bu
ild\\hyper-d6d1e087e41dda9f\\build_script_build-d6d1e087e41dda9f.build_script_bu
ild.o" "/OUT:C:\\Users\\n\\Documents\\imageflow\\target\\debug\\build\\hyper-d6d
1e087e41dda9f\\build_script_build-d6d1e087e41dda9f.exe" "/OPT:REF,NOICF" "/DEBUG
" "/LIBPATH:C:\\Users\\n\\Documents\\imageflow\\target\\debug\\deps" "/LIBPATH:C
:\\Users\\n\\.rustup\\toolchains\\nightly-2017-03-04-i686-pc-windows-msvc\\lib\\
rustlib\\i686-pc-windows-msvc\\lib" "C:\\Users\\n\\Documents\\imageflow\\target\
\debug\\deps\\librustc_version-1324ff5c3ace67a8.rlib" "C:\\Users\\n\\Documents\\
imageflow\\target\\debug\\deps\\libsemver-f260b12f97c06fb2.rlib" "C:\\Users\\n\\
.rustup\\toolchains\\nightly-2017-03-04-i686-pc-windows-msvc\\lib\\rustlib\\i686
-pc-windows-msvc\\lib\\libstd-13ee61d82f393677.rlib" "C:\\Users\\n\\.rustup\\too
lchains\\nightly-2017-03-04-i686-pc-windows-msvc\\lib\\rustlib\\i686-pc-windows-
msvc\\lib\\libpanic_unwind-48ea9b793aac0c7c.rlib" "C:\\Users\\n\\.rustup\\toolch
ains\\nightly-2017-03-04-i686-pc-windows-msvc\\lib\\rustlib\\i686-pc-windows-msv
c\\lib\\libunwind-d47ebac29f9bd14f.rlib" "C:\\Users\\n\\.rustup\\toolchains\\nig
htly-2017-03-04-i686-pc-windows-msvc\\lib\\rustlib\\i686-pc-windows-msvc\\lib\\l
iblibc-c9fd286480cea820.rlib" "C:\\Users\\n\\.rustup\\toolchains\\nightly-2017-0
3-04-i686-pc-windows-msvc\\lib\\rustlib\\i686-pc-windows-msvc\\lib\\librand-83ea
7d0fdec3fdde.rlib" "C:\\Users\\n\\.rustup\\toolchains\\nightly-2017-03-04-i686-p
c-windows-msvc\\lib\\rustlib\\i686-pc-windows-msvc\\lib\\libcollections-6abea59f
9a3a2e81.rlib" "C:\\Users\\n\\.rustup\\toolchains\\nightly-2017-03-04-i686-pc-wi
ndows-msvc\\lib\\rustlib\\i686-pc-windows-msvc\\lib\\liballoc-18b5b7bd96a82034.r
lib" "C:\\Users\\n\\.rustup\\toolchains\\nightly-2017-03-04-i686-pc-windows-msvc
\\lib\\rustlib\\i686-pc-windows-msvc\\lib\\liballoc_system-a1b08813244b60c5.rlib
" "C:\\Users\\n\\.rustup\\toolchains\\nightly-2017-03-04-i686-pc-windows-msvc\\l
ib\\rustlib\\i686-pc-windows-msvc\\lib\\libstd_unicode-1caf05b3f5a5e05c.rlib" "C
:\\Users\\n\\.rustup\\toolchains\\nightly-2017-03-04-i686-pc-windows-msvc\\lib\\
rustlib\\i686-pc-windows-msvc\\lib\\libcore-046c133680ee59d2.rlib" "C:\\Users\\n
\\.rustup\\toolchains\\nightly-2017-03-04-i686-pc-windows-msvc\\lib\\rustlib\\i6
86-pc-windows-msvc\\lib\\libcompiler_builtins-056daeb08ac9c69e.rlib" "advapi32.l
ib" "ws2_32.lib" "userenv.lib" "shell32.lib" "msvcrt.lib"
  = note: LINK : fatal error LNK1318: Unexpected PDB error; RPC (23) '(0x000006B
A)'


error: aborting due to previous error

error: Could not compile `hyper`.
Build failed, waiting for other jobs to finish...
error: build failed

@lilith
Copy link

lilith commented Mar 11, 2017

This appears to be part of a larger pattern of issues when --target is specified, but has the same value as the default target displayed by rustup show.

Likely a separate bug, but is has the same manifestation.

@alexcrichton
Copy link
Member Author

@nathanaeljones are you using rustup? If so you may be running into rust-lang/rustup#746

@lilith
Copy link

lilith commented Mar 11, 2017 via email

@lilith
Copy link

lilith commented Mar 13, 2017

FWIW, this goes away if I drop --target.

@danielniccoli
Copy link

danielniccoli commented Mar 17, 2017

I see the same errors with rustc 1.16.0 (30cf806 2017-03-10) and cargo-0.17.0-nightly (f9e5481 2017-03-03) when doing a simple cargo build, cargo run, or cargo publish. Does not happen always and on the second or third try it mostly works.

error: linking with link.exe failed: exit code: 1318
= note: LINK : fatal error LNK1318: Unexpected PDB error; RPC (23) '(0x000006BA)
= note: LINK : fatal error LNK1318: Unexpected PDB error; RPC (23) '(0x00000005) (sometimes)

@Geobert
Copy link

Geobert commented May 15, 2018

We're having this issue with Rust 1.24, 1.25 and 1.26 at least, on Cobalt and Gutenberg

@Keats
Copy link

Keats commented May 15, 2018

@alexcrichton

More details: https://ci.appveyor.com/project/Keats/gutenberg/build/1.0.540

I only had this problem recently, after 1.26 was released and it was happening to 100% of my Windows builds.
From the matrix above, it seems that it does work for Gutenberg for 1.24 and 1.25 at least (I'm re-running that PR now to ensure it's repeatable).
The appveyor config is https://github.com/Keats/gutenberg/blob/6b99fce795c373575493455a83e5abcfa5bbe653/appveyor.yml and it does use the --target, I'll try removing it once the first re-run is done

Edit: second run fails only on 1.26 as well https://ci.appveyor.com/project/Keats/gutenberg/build/1.0.541

@alexcrichton
Copy link
Member Author

Sorry I don't have a lot of time to dig into this right now, can you open a new dedicated issue for this with updated information? Any way to reproduce it locally and/or minimization on AppVeyor would also be greatly appreciated!

@Keats
Copy link

Keats commented May 15, 2018

Done as #50778

I'll try a bit more tomorrow to figure it out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants