Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with running NodeJS compiled for RISC-V #160

Open
emcifuntik opened this issue Jun 17, 2024 · 12 comments
Open

Issue with running NodeJS compiled for RISC-V #160

emcifuntik opened this issue Jun 17, 2024 · 12 comments

Comments

@emcifuntik
Copy link

Hi guys. I really love your project and it's awesome.
Currently I'm trying to run NodeJS in libriscv emulator.
Here is my fork with changed configs https://github.com/emcifuntik/unofficial-builds/tree/riscv-toolchain-update
It took me some time to compile NodeJS 20 with riscv toolchain, but now after disabling crypto, i18n and using --fully-static - I'm able to run node --version and I can see correct output, but when I try to simply get --help message it exit with exit code 0. From my investigation it can be related to pthread implementation because exit from program happens on this line https://github.com/nodejs/node/blob/af614c5795851faa5f983f64a6592e270e907ebc/deps/v8/src/libsampler/sampler.cc#L589

To reach this state I needed to implement clock_getres, capget and sigprocmask syscalls.
image
image
I'm not sure that it's right, but I simply proxy calls to my host system right now. (Not the best way for emulator) so it looks like this:

template <int W>
static void syscall_clock_getres(Machine<W>& machine)
{
    const auto clkid = machine.template sysarg<int>(0);
    const auto buffer = machine.sysarg(1);
    SYSPRINT("SYSCALL clock_getres, clkid: %x buffer: 0x%lX\n",
        clkid, (long)buffer);

    struct timespec ts;
    const int res = clock_getres(clkid, &ts);
    if (res >= 0) {
        if constexpr (W == 4) {
            int32_t ts32[2] = {(int) ts.tv_sec, (int) ts.tv_nsec};
            machine.copy_to_guest(buffer, &ts32, sizeof(ts32));
        } else {
            machine.copy_to_guest(buffer, &ts, sizeof(ts));
        }
    }
    machine.set_result_or_error(res);
}

template<int W>
static void syscall_capget(Machine<W>& machine) {
    const auto header_ptr = machine.sysarg(0);
    const auto data_ptr = machine.sysarg(1);

    SYSPRINT("SYSCALL capget, header: 0x%lX, data: 0x%lX\n", 
             (long)header_ptr, (long)data_ptr);

    struct __user_cap_header_struct {
        uint32_t version;
        int pid;
    };

    struct __user_cap_data_struct {
        uint32_t effective;
        uint32_t permitted;
        uint32_t inheritable;
    };

    __user_cap_header_struct header;
    __user_cap_data_struct data;

    // Copy the header from guest to host
    machine.copy_from_guest(&header, header_ptr, sizeof(header));

    // Initialize the header structure
    if (header.version != 0x20080522) {
        // Unsupported version, set error
        machine.set_result_or_error(-EINVAL);
        return;
    }

    // Here you would typically interact with the capability subsystem of the
    // emulated environment to get the actual capabilities.
    // For simplicity, let's assume no capabilities:
    data.effective = 0;
    data.permitted = 0;
    data.inheritable = 0;

    // Copy the data back to the guest
    machine.copy_to_guest(data_ptr, &data, sizeof(data));

    // Set result to 0 indicating success
    machine.set_result_or_error(0);
}

template <int W>
static void syscall_sigprocmask(Machine<W>& machine)
{
    const auto how = machine.template sysarg<int>(0);
    const auto set_ptr = machine.sysarg(1);
    const auto oldset_ptr = machine.sysarg(2);
    SYSPRINT("SYSCALL sigprocmask, how: %d set_ptr: 0x%lX oldset_ptr: 0x%lX\n",
        how, (long)set_ptr, (long)oldset_ptr);

    sigset_t set, oldset;

    // If set_ptr is not null, copy the signal set from the guest to the host
    if (set_ptr != 0) {
        machine.copy_from_guest(&set, set_ptr, sizeof(set));
    }

    // Perform the sigprocmask syscall
    int res = sigprocmask(how, (set_ptr != 0) ? &set : nullptr, &oldset);

    // If oldset_ptr is not null, copy the old signal set from host to guest
    if (oldset_ptr != 0) {
        machine.copy_to_guest(oldset_ptr, &oldset, sizeof(oldset));
    }

    machine.set_result_or_error(res);
}

Can you suggest maybe some methods to properly debug it or fix?

@fwsGonzo
Copy link
Owner

fwsGonzo commented Jun 17, 2024

About the existing unofficial-builds node binary:
I looked for the dynamically linked executable and I wasn't able to load it. The dynamic linker seems to not understand how to do load it (not to mention the type seems to be ET_EXEC instead of ET_DYN). So I don't know how to proceed with that.

It could be that the binary needs some kind of relocation that I'm not doing.

Regarding the new one you built statically:
Can you share the statically linked executable with me? I think I need to have a look in order to understand what's happening.
Sometimes it's a small issue, other times it's a big mystery. But I will try to see if it can be uncovered.

Just looking at the code in NodeJS it seems to be using signals to do profiling, which is unfortunate as my signal implementation is very very basic. It can remember handlers, and jump to a signal handler, but I'm not sure if it can return properly yet. So, we will see.

@emcifuntik
Copy link
Author

About the existing unofficial-builds node binary: I looked for the dynamically linked executable and I wasn't able to load it. The dynamic linker seems to not understand how to do load it (not to mention the type seems to be ET_EXEC instead of ET_DYN). So I don't know how to proceed with that.

It could be that the binary needs some kind of relocation that I'm not doing.

Regarding the new one you built statically: Can you share the statically linked executable with me? I think I need to have a look in order to understand what's happening. Sometimes it's a small issue, other times it's a big mystery. But I will try to see if it can be uncovered.

Just looking at the code in NodeJS it seems to be using signals to do profiling, which is unfortunate as my signal implementation is very very basic. It can remember handlers, and jump to a signal handler, but I'm not sure if it can return properly yet. So, we will see.

Sure, here is statically linked node
https://mega.nz/file/c0NEHAoB#7Qx7kHiaejkxmSJASo1kEG6SiXa0SY7UtPiN_vObQKM

@fwsGonzo
Copy link
Owner

I have to go to sleep now, but remote GDB has this backtrace:

#0  0x00000000011bd970 in __pthread_kill_implementation.constprop.0 ()
#1  0x000000000119fc94 in raise ()
#2  0x0000000000031eda in abort ()
#3  0x0000000000a34ef0 in uv__signal_global_reinit () at ../deps/uv/src/unix/signal.c:107
#4  uv__signal_global_init () at ../deps/uv/src/unix/signal.c:76
#5  0x00000000011bf504 in __pthread_once_slow ()
#6  0x0000000000a388de in uv_once (guard=guard@entry=0x1a19050 <uv.signal_global_init_guard>, 
    callback=callback@entry=0xa34e16 <uv__signal_global_init>) at ../deps/uv/src/unix/thread.c:433
#7  0x0000000000a359e8 in uv__signal_global_once_init () at ../deps/uv/src/unix/signal.c:112
#8  0x0000000000a32692 in uv_loop_init (loop=0x1a7dc28) at ../deps/uv/src/unix/loop.c:84
#9  0x000000000020514e in node::tracing::Agent::Agent() ()
#10 0x00000000000d370e in node::InitializeOncePerProcessInternal(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, node::ProcessInitializationFlags::Flags) ()
#11 0x00000000000d48ee in node::Start(int, char**) ()

Signal 6 is SIGABRT. So it's not just a regular exit. Maybe the emulator should have special handling on SIGABRT.

@emcifuntik
Copy link
Author

Maybe. I will dig deeper tomorrow after work too.

@fwsGonzo
Copy link
Owner

One thing I noticed was the prlimit64 system call is absolute bonkers, as written by me. It will return 0 despite resource not being 3 (rlimit stack). So, I tried checking resource, and as you can guess it's not 3. So, that could be a point of progress.

I have committed the changes you made to a PR: #161

It would have been nice to get your work in there, so feel free to remake it with your name on it, if you want.

@fwsGonzo
Copy link
Owner

fwsGonzo commented Jun 18, 2024

https://github.com/nodejs/node/blob/main/deps/uv/src/unix/signal.c#L106-L107

This is the failing spot right now

static int uv__signal_unlock(void) {
  int r;
  char data = 42;

  do {
    r = write(uv__signal_lock_pipefd[1], &data, sizeof data);
  } while (r < 0 && errno == EINTR);

  return (r < 0) ? -1 : 0;
}

@fwsGonzo
Copy link
Owner

fwsGonzo commented Jun 18, 2024

It jumps to 0x0 after initializing a 4GB mmap, which takes "forever", making it hard to investigate what's happening.

EDIT: This turned out to be the 32-bit fast-path thing that avoids bounds-checking for programs fitting in 4GB or less.

However, NodeJS seems to at least require mapping 12GB virtual. There's two ways of configuring the emulator:

  1. With mmap-backed memory, where we don't really control virtual memory anymore. This is the fastest, and works just fine. But if NodeJS asks for 12GB we must be willing to provide that R/W.
  2. With just paging. It's only slow when a certain sequence of mmap/mprotect is called that turns lazily allocated zero pages into real pages with data. Unfortunately, NodeJS hits this slow-path badly, and it never completes in human time.

So we are left with allowing NodeJS 12GB, which is OK. It now runs all the way to a place where the emulator really sucks: futex emulation.

>>> futex(0x1A5D478, op=393, val=0)
FUTEX: Waiting (blocked)... uaddr=0x1A5D478 val=0
Returning to tid=1 tls=0x102D37B00 stack=0x102D316E0
SYSCALL clock_gettime, clkid: 1 buffer: 0x102D317D0
Returning to tid=0 tls=0x1A357E0 stack=0x2B345E0
SYSCALL epoll_pwait yielded...
SYSCALL writev, iov: 0x2B344F0  cnt: 1
The futex facility returned an unexpected error code.
SYSCALL writev, vfd: 2 real_fd: 2 -> 54
>>> mmap(addr 0x0, len 4096, prot 0x3, flags 0X22, vfd=-1 voff=0)
<<< mmap(addr 0x0, len 4096, ...) = 0x10353C000
SYSCALL sigprocmask, how: 1 set_ptr: 0x2B343B0 oldset_ptr: 0x0
>>> gettid() = 0
SYSCALL stubbed (zero): 172
>>> tgkill on tid=0 signal=6
Clearing thread value for tid=0 at 0x1A350D0
>>> Program exited, exit code = 0 (0x0)

But at least it has created 4 worker threads now. So it's nearing something usable.

@fwsGonzo
Copy link
Owner

I've gotten v8 with JIT and WebAssembly to work in the emulator now. I guess I can resume trying the NodeJS CLI and see if I can get further. See PR: #161. I will rewrite it soon.

@fwsGonzo
Copy link
Owner

fwsGonzo commented Jul 10, 2024

#2  0x0000000000031eda in abort ()
#3  0x0000000000a38808 in uv_rwlock_rdlock (rwlock=<optimized out>) at ../deps/uv/src/unix/thread.c:382
#4  0x00000000000f0bba in node::builtins::BuiltinLoader::LoadBuiltinSource(v8::Isolate*, char const*) const ()

It's gotten further now. Failing on a pthread_mutex_trylock:
https://github.com/libuv/libuv/blob/v1.x/src/unix/thread.c#L352-L355

  err = pthread_mutex_trylock(mutex);
  if (err) {
    if (err != EBUSY && err != EAGAIN)
      abort();

The return value is 35 (EDEADLK), which is mysterious.

[EDEADLK]
    A deadlock condition was detected or the current thread already owns the read-write lock for writing or reading. 

Now less mysterious. So it's a special error for rwlocks where one is trying to re-lock the same lock. I think that could indicate that threads are woken up too early and there is no spurious wakeup handling.

      /* Make sure we are not holding the rwlock as a writer.  This is
	 a deadlock situation we recognize and report.  */
      if (__builtin_expect (rwlock->__data.__writer
			    == THREAD_GETMEM (THREAD_SELF, tid), 0))
	{
	  result = EDEADLK;
	  break;
	}

This should be the source of the errno.
image
image
So, apparently it has already locked it (tid=0)

@fwsGonzo
Copy link
Owner

fwsGonzo commented Jul 11, 2024

After much ado: https://gist.github.com/fwsGonzo/620d36067afc0e3b74028861a086d3bd

I had a feeling that it's only the rwlock issue left, so I made a hack to see.

const auto rw_rdlock = machine.address_of("pthread_rwlock_rdlock");
const auto rw_wrlock = machine.address_of("pthread_rwlock_wrlock");
const auto rw_unlock = machine.address_of("pthread_rwlock_unlock");

machine.set_max_instructions(cli_args.fuel);
do {
	// Whenever we hit rwlock, step twice and check if it's a deadlock
	if (machine.cpu.pc() == rw_rdlock || machine.cpu.pc() == rw_wrlock) {
		machine.cpu.step_one();
		machine.cpu.step_one();
		if(machine.cpu.reg(14) == machine.cpu.reg(15)) {
			// Deadlock detected, avoid branch (beq a4, a5) and reset the lock
			machine.cpu.reg(14) = 0xFF;
			machine.memory.template write<uint32_t>(machine.cpu.reg(10), 0);
		}
	}
	machine.cpu.step_one();
} while (!machine.stopped());

I'm avoiding the deadlock and just unlocking it. After that I got the --help.

$ ./rvlinux -P ../binaries/node -- $PWD/test.js
Unhandled system call: 114
Unhandled system call: 24
Unhandled system call: 24
Node.js says fibonacci(10) = 55
>>> Program exited, exit code = 0 (0x0)
Runtime: 1983.044ms   (Use --accurate for instruction counting)
Pages in use: 9317 (37268 kB virtual memory, total 56127 kB)

@fwsGonzo
Copy link
Owner

fwsGonzo commented Jul 11, 2024

There is a PR here with the work required to get the basic command-line running: #202
I'm sure not all features will be working, but it's enough to run simple Node.js programs.

The program opens correctly every time, and closes correctly too. The only thing missing is to figure out where a whole 2 seconds is spent. With or without binary translation it takes exactly ~2 seconds to do anything at all. Where is that time spent?

@fwsGonzo
Copy link
Owner

One more step forward with #203, now allowing the main decoder to execute normally while only JIT-segments are executed instruction by instruction. With --jitless a simple fibonacci program executes in 900ms.

$ ./rvlinux -P ../binaries/node -- --jitless test.js
Warning: disabling flag --expose_wasm due to conflicting flags
Node.js says fibonacci(10) = 55
>>> Program exited, exit code = 0 (0x0)
Runtime: 893.151ms   (Use --accurate for instruction counting)
Pages in use: 9317 (37268 kB virtual memory, total 56127 kB)

$ ./rvlinux -aP ../binaries/node -- --jitless test.js
Warning: disabling flag --expose_wasm due to conflicting flags
Node.js says fibonacci(10) = 55
>>> Program exited, exit code = 0 (0x0)
Instructions executed: 366417440  Runtime: 909.675ms  Insn/s: 403mi/s
Pages in use: 9317 (37268 kB virtual memory, total 56127 kB)

There are many paths forward, but I doubt it's worth pursuing faster execution of JIT-segments. What would be nice is to fix whatever issue is causing me to have to use a workaround for rwlocks. And maybe there are some low-hanging fruit still to speed up the processing. Another avenue is figuring out why NodeJS HTTP server is stalling. Looks to be in connect().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants