Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2023q1 release #995

Merged
merged 9 commits into from
Apr 8, 2023
Merged

2023q1 release #995

merged 9 commits into from
Apr 8, 2023

Conversation

rleh
Copy link
Member

@rleh rleh commented Apr 8, 2023

@rleh rleh requested review from salkinium and chris-durand April 8, 2023 13:01
Copy link
Member

@salkinium salkinium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll do some hardware testing now.

docs/release/2023q1.md Outdated Show resolved Hide resolved
@rleh
Copy link
Member Author

rleh commented Apr 8, 2023

❌ Nucleo-F091RC (Assertion 'nvic.undef' @ 0xFFFFFFF3 (4294967283) failed!␊Abandoning...␊, both A and B test parts)

The test runs into HardFault handler here

testName = name;

before executing the first test.

p name
$11 = {address = 0x800afbf <(anonymous namespace)::ad7280aTestName> "ad7280a_test"}
{"output":"","token":1079,"outOfBandRecord":[],"resultRecords":{"resultClass":"done","results":[]}}
p testName
$12 = {address = 0x8000f61 <Undefined_Handler> "\265\357\363\005\201\0209J\262\v\006$\324\034H\223\b\233"}

What...?

@rleh rleh added this to the 2023q1 milestone Apr 8, 2023
@rleh rleh force-pushed the release/2023q1 branch from cf64683 to cf210f3 Compare April 8, 2023 15:18
@salkinium
Copy link
Member

salkinium commented Apr 8, 2023

I can reproduce this. It's not a stack overflow, the issue seems that the this pointer is zero, which then tries to write the name pointer to an address in flash, which bus faults. Is this related to the linkerscript issue?

(gdb) b unittest::Reporter::nextTestSuite
(gdb) c
(gdb) bt
#0  unittest::Reporter::nextTestSuite (this=0x0, name=...) at modm/src/unittest/reporter.cpp:42
#1  0x080024f6 in unittest::Controller::nextTestSuite (this=this@entry=0x2000113c <unittest::Controller::instance()::controller>, name=...) a
t modm/src/unittest/controller.cpp:37
#2  0x08000798 in unittest::Controller::run (reporter=...) at unittest_runner.cpp:152
#3  0x08003c7a in main () at main.cpp:69
(gdb) p this
$1 = (unittest::Reporter * const) 0x0
(gdb)

@salkinium
Copy link
Member

WAT https://github.com/modm-io/modm/blob/develop/src/unittest/controller.cpp#L18

@chris-durand
Copy link
Member

Something is fundamentally broken here. unittest::Controller::reporter can't be null. It has been set to an address on the stack the line before:

int unittest::Controller::run(unittest::Reporter reporter)
{
    using namespace modm::accessor;
    instance().setReporter(reporter);


    instance().nextTestSuite(asFlash(ad7280aTestName));

with setReporter():

void
unittest::Controller::setReporter(Reporter& reporter)
{
	this->reporter = &reporter;
}

setReporter() takes the address of the value function argument of run which is on the stack. That can't be null.

@salkinium
Copy link
Member

Single stepping through the setReporter function does copy the address, but the instruction doesn't actually perform the write. I'm currently thinking it might be an issue with the static member variables initialization guard.
There's also the "issue" that the reporter was constructed temporarily on the stack in the Controller::run function, which is ugly (but wasn't the problem).

I have an alternative solution that just removes the controller class and uses the reporter directly. But I'm still investigating the original problem.

@rleh rleh force-pushed the release/2023q1 branch from cf210f3 to 13d7d2d Compare April 8, 2023 16:33
@salkinium
Copy link
Member

It is very suspicious that this is happening only on the Cortex-M0 device, which uses a special std::atomic for the static initialization guard.

@chris-durand
Copy link
Member

It is very suspicious that this is happening only on the Cortex-M0 device, which uses a special std::atomic for the static initialization guard.

Could you try running the test with the safe_statics option set to false?

@salkinium
Copy link
Member

Could you try running the test with the safe_statics option set to false?

Well… that works fine! So it has something to do with __cxa_guard_*!

@rleh
Copy link
Member Author

rleh commented Apr 8, 2023

Could you try running the test with the safe_statics option set to false?

Well… that works fine! So it has something to do with __cxa_guard_*!

You were a minute faster, but I can confirm the observation 😆

@chris-durand
Copy link
Member

chris-durand commented Apr 8, 2023

The implementation of the static initialization guards is undefined behaviour.

extern "C" int __cxa_guard_acquire(int *guard)
{
    std::atomic_int *atomic_guard = reinterpret_cast<std::atomic_int *>(guard);

It's not allowed to cast an int* to std::atomic_int* and then use it (if the original object it points to is of type int). The standard library has std::atomic_ref for that use case.

The correct version would be

extern "C" int __cxa_guard_acquire(int *guard)
{
    auto atomic_guard = std::atomic_ref(*guard);

I don't think it will fix the original issue but we should change it anyway.

@salkinium
Copy link
Member

The standard library has std::atomic_ref for that use case.

Oh, I fixed that. It doesn't fix the behavior sadly.

@salkinium
Copy link
Member

Oh no… The static guard initialization somehow initializes on every call… 😱

@chris-durand
Copy link
Member

chris-durand commented Apr 8, 2023

Oh no… The static guard initialization somehow initializes on every call… scream

extern "C" int __cxa_guard_acquire ( __int64_t *guard_object );

    Returns 1 if the initialization is not yet complete; 0 otherwise. This function is called before initialization takes place. If this function returns 1, either __cxa_guard_release or __cxa_guard_abort must be called with the same argument.

https://itanium-cxx-abi.github.io/cxx-abi/abi.html

EDIT: you were faster 😆

@salkinium
Copy link
Member

I think I thought that the compiler would check the variable first, and only call acquire if it was zero. Guess it was broken for years and nobody cared?

@chris-durand
Copy link
Member

I think I thought that the compiler would check the variable first, and only call acquire if it was zero. Guess it was broken for years and nobody cared?

Seems like it.

Have you verified that the change actually fixes the original issue if we keep the function local static variable? Not that we have another issue somewhere.

@salkinium
Copy link
Member

Have you verified that the change actually fixes the original issue if we keep the function local static variable? Not that we have another issue somewhere.

Yes! I quickly backported the fix to the commit before my refactoring and the constructor now is only called once and doesn't overwrite the reporter to zero anymore. We should probably still keep the refactoring.

@salkinium
Copy link
Member

Actually, let me add a unit test for this code!

Copy link
Member

@chris-durand chris-durand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@salkinium
Copy link
Member

Once the CI passes, I'll rebase and merge. Thanks for the great release notes, @rleh!

@salkinium salkinium merged commit 47bff54 into develop Apr 8, 2023
@salkinium salkinium deleted the release/2023q1 branch April 8, 2023 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants