Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to build UDFs #5761

Closed
jjsimps opened this issue Nov 13, 2023 · 8 comments
Closed

Unable to build UDFs #5761

jjsimps opened this issue Nov 13, 2023 · 8 comments
Assignees
Labels
affects/none PR/issue: this bug affects none version. process/fixed Process of bug severity/none Severity of bug type/bug Type: something is unexpected

Comments

@jjsimps
Copy link

jjsimps commented Nov 13, 2023

Describe the bug

When I build the UDFs for centos7 (using the nebula-dev image), it builds fine. However, when try to use the UDF in nebula I get the error:

/lib64/libstdc++.so.6: version 'CXXABI_1.3.9' not found

It seems the UDF is being built with ABI 1.3.9, but that doesn't come included on the centos7 image for graphd:

[root@nebula-graphd-2 nebula]#  nm -D /lib64/libstdc++.so.6 | grep CXXABI
0000000000000000 A CXXABI_1.3
0000000000000000 A CXXABI_1.3.1
0000000000000000 A CXXABI_1.3.2
0000000000000000 A CXXABI_1.3.3
0000000000000000 A CXXABI_1.3.4
0000000000000000 A CXXABI_1.3.5
0000000000000000 A CXXABI_1.3.6
0000000000000000 A CXXABI_1.3.7
0000000000000000 A CXXABI_TM_1

I tried statically linking libstd++, but that's just causing memory leaks in graphd see this issues.

So if I can't statically link it, then I must match the ABI version. But this doesn't seem possible without downgrading the buildtools.

Your Environments (required)

  • Nebula graph 3.6.0 (and the pre-built images)
  • Running on GKE 1.26
  • Using nebula operator

How To Reproduce(required)

Steps to reproduce the behavior:

  1. Startup nebula-dev image
  2. build make UDF=standard_deviation
  3. Load into nebula 3.6 using pre-built images
  4. See error in logs

Expected behavior

No UDF error, no build error, no memory leaks.

I also noticed that It calls dlopen() every time the function is called, which seems like it would cause a huge performance overhead. Why not cache the function pointer?

@jjsimps jjsimps added the type/bug Type: something is unexpected label Nov 13, 2023
@github-actions github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Nov 13, 2023
@wey-gu
Copy link
Contributor

wey-gu commented Nov 15, 2023


  • I also noticed that It calls dlopen() every time the function is called, which seems like it would cause a huge performance overhead. Why not cache the function pointer?

    @yixinglu could we decouple the dlopen() per non-first-time UD function call with some cache mechanism?

@yixinglu
Copy link
Contributor

@jjsimps Have you tried pulling the vesoft/nebula-dev:centos7 image again? Locally I get the following output from your above command:

$ docker run --rm -ti vesoft/nebula-dev:centos7 bash                                                                                                                                                       130 ↵
[root@d0f41c1786bc ~]# nm -D /lib64/libstdc++.so.6 | grep CXXABI
0000000000000000 A CXXABI_1.3
0000000000000000 A CXXABI_1.3.1
0000000000000000 A CXXABI_1.3.10
0000000000000000 A CXXABI_1.3.11
0000000000000000 A CXXABI_1.3.12
0000000000000000 A CXXABI_1.3.2
0000000000000000 A CXXABI_1.3.3
0000000000000000 A CXXABI_1.3.4
0000000000000000 A CXXABI_1.3.5
0000000000000000 A CXXABI_1.3.6
0000000000000000 A CXXABI_1.3.7
0000000000000000 A CXXABI_1.3.8
0000000000000000 A CXXABI_1.3.9
0000000000000000 A CXXABI_FLOAT128
0000000000000000 A CXXABI_TM_1
[root@d0f41c1786bc ~]#

@wey-gu
Copy link
Contributor

wey-gu commented Nov 16, 2023

@jjsimps Have you tried pulling the vesoft/nebula-dev:centos7 image again? Locally I get the following output from your above command:

$ docker run --rm -ti vesoft/nebula-dev:centos7 bash                                                                                                                                                       130 ↵
[root@d0f41c1786bc ~]# nm -D /lib64/libstdc++.so.6 | grep CXXABI
0000000000000000 A CXXABI_1.3
0000000000000000 A CXXABI_1.3.1
0000000000000000 A CXXABI_1.3.10
0000000000000000 A CXXABI_1.3.11
0000000000000000 A CXXABI_1.3.12
0000000000000000 A CXXABI_1.3.2
0000000000000000 A CXXABI_1.3.3
0000000000000000 A CXXABI_1.3.4
0000000000000000 A CXXABI_1.3.5
0000000000000000 A CXXABI_1.3.6
0000000000000000 A CXXABI_1.3.7
0000000000000000 A CXXABI_1.3.8
0000000000000000 A CXXABI_1.3.9
0000000000000000 A CXXABI_FLOAT128
0000000000000000 A CXXABI_TM_1
[root@d0f41c1786bc ~]#

I guess @jjsimps is about to run graphd with udf folder attached in production, thus leveraging our graphd docker image rather than the dev one.

@yixinglu
Copy link
Contributor

@jjsimps Have you tried pulling the vesoft/nebula-dev:centos7 image again? Locally I get the following output from your above command:

$ docker run --rm -ti vesoft/nebula-dev:centos7 bash                                                                                                                                                       130 ↵
[root@d0f41c1786bc ~]# nm -D /lib64/libstdc++.so.6 | grep CXXABI
0000000000000000 A CXXABI_1.3
0000000000000000 A CXXABI_1.3.1
0000000000000000 A CXXABI_1.3.10
0000000000000000 A CXXABI_1.3.11
0000000000000000 A CXXABI_1.3.12
0000000000000000 A CXXABI_1.3.2
0000000000000000 A CXXABI_1.3.3
0000000000000000 A CXXABI_1.3.4
0000000000000000 A CXXABI_1.3.5
0000000000000000 A CXXABI_1.3.6
0000000000000000 A CXXABI_1.3.7
0000000000000000 A CXXABI_1.3.8
0000000000000000 A CXXABI_1.3.9
0000000000000000 A CXXABI_FLOAT128
0000000000000000 A CXXABI_TM_1
[root@d0f41c1786bc ~]#

I guess @jjsimps is about to run graphd with udf folder attached in production, thus leveraging our graphd docker image rather than the dev one.

I checked that the implementation of the code does not execute dlopen every time. All udfs are loaded only at the beginning and then saved in the function manager, so the problem mentioned above should not exist.

@dutor
Copy link
Contributor

dutor commented Nov 16, 2023

This is exactly the situation of how UDF in C++ works, i.e. the ABI between the UDF and the host program must be compatible. ABI might be broken between different GCC versions, especially the ones before and after GCC 5.

There are indeed some techniques to address such issues like symbol versioning, but for the old ABI before C++11 it's too tedious, difficult and unreliable to maintain. So we will not support such cases.

As for the compatibility for the newer versions, a serious UDF implementation maybe should try best to handle this. But it is still a tough work to do. Keeping the building environments exactly the same is the easy and reliable way.

@jjsimps
Copy link
Author

jjsimps commented Nov 16, 2023

Thanks for the input everyone!

@wey-gu is exactly correct -- the UDFs are built using the nebula-dev image which has the builtools and CXXABI_1.3.9. But the prod image is the regular centos7 image which does not have it.

I also tried your suggestion of building using ubuntu2004 image and that seems to work. But that does require me to build my own graphd image using ubuntu2004 (not the worst, but not ideal of course). I expect the same would be fine if I build the graphd image using the nebula-dev:centos7 as the base image instead of just centos7.

As for calling dlopen, please see here. As you can see, the lambda, when called, will call dlopen(), create the UDF, call it, destroy, then call dlclose().

I do realize that this feature was kinda put in as a stealth feature (not fully documented, but referenced in release notes). I was looking forward to it as it allows me to run a bit more complex logic that would either otherwise manifest as a very complex query (with branch, bitwise operators, etc), or have to be taken back to the application level.

What would you say is the current outlook/roadmap for the serious UDF support? Is there anywhere I can find out this info? In the meantime, I think that the current implementation is 85% there for most of my use-cases. I would like to know the proper solution for dealing with the ABI (hopefully by changing the prod base image to something that has it), and also a fix for the performance issue when calling the UDF (I noticed query times jump from 10s of ms to 100s of ms when calling a cusom UDF which seems absurd).

Thank you :)

@dutor
Copy link
Contributor

dutor commented Nov 16, 2023

@jjsimps

The basic idea is that you build the UDF itself with toolchain of the same version as graphd. Besides, since graphd links libstdc++ statically(not proper for the UDF scenario) but the dependency on libstdc++ by UDF cannot be eliminated, you have to ship the UDF library with libstdc++.so(maybe with a RPATH set to locate it).

Personally I never reviewed this UDF framework implementation. But it smells bad to me after a fast browsing. It does invoke dlopen on each user function call. Maybe you could try to refactor the code to do symbol lookup in the init function and release the function reference in another destroy function.

As for the serious one, we don't have a plan on this so far. Actually we have been working on a huge refactoring which includes user defined function and procedure. But the roadmap to release is not determined yet.

@jjsimps
Copy link
Author

jjsimps commented Nov 17, 2023

Awesome, thank you.

I tried the rpath method + shipping libstdc++.6.so and that seems to work.

I'll see about refactoring some of the code to bring it back up to a bit of a better level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/none PR/issue: this bug affects none version. process/fixed Process of bug severity/none Severity of bug type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

5 participants