Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SurfaceTool's AppendFrom is extremely slow on separate thread. #51311

Closed
Delpire opened this issue Aug 6, 2021 · 7 comments
Closed

SurfaceTool's AppendFrom is extremely slow on separate thread. #51311

Delpire opened this issue Aug 6, 2021 · 7 comments

Comments

@Delpire
Copy link

Delpire commented Aug 6, 2021

Godot version

3.3.stable.mono.official

System information

Windows 10, RTX 2080, GLES3

Issue description

There may be a good reason for this that I am unaware of, however it seems like the SurfaceTool's AppendFrom method is unusually slow if you execute it from a separate thread.

For my minimal reproduction project, I have a fairly small mesh, 507 verts, and I am looping 125 times to create 125 instances inside of my surface tool. I have written the code such that the only thing that should be running on a separate thread is

  1. Creating the SurfaceTool.
  2. Calling Begin with Mesh.PrimitiveType.Triangles.
  3. Calling AppendFrom

Anything else, like generating normals, committing, setting the material, etc., should all be done back on the main thread.

If I execute the code synchronously (on the main thread), it takes ~1100ms.
If I execute the code asynchronously it takes ~11,000ms.

This is 10x slower.

Steps to reproduce

  1. Add some mesh to your project and create a scene with the mesh.
  2. Get the Mesh from the MeshInstance.
  3. Create the surface tool and call AppendFrom.
public SurfaceTool Generate()
{
    SurfaceTool surfaceTool = new SurfaceTool();
    surfaceTool.Begin(Mesh.PrimitiveType.Triangles);

    for (int x = 0; x < _length; x++)
    {
        for (int y = 0; y < _height; y++)
        {
            for (int z = 0; z < _width; z++)
            {
                surfaceTool.AppendFrom(_monkeyMesh, 0, Transform.Identity.Translated(new Vector3(x * 3, y * 3, z * 3)));
            }
        }
    }

    return surfaceTool;
}
  1. Call this asynchronously.
Task<SurfaceTool> task = Task.Run(Generate);

Minimal reproduction project

SurfaceToolAppendFromRepo.zip

@Calinou
Copy link
Member

Calinou commented Aug 6, 2021

@Delpire Can you reproduce this in a GDScript project to rule out C#/Mono being the issue here?

@Delpire
Copy link
Author

Delpire commented Aug 6, 2021

@Calinou - Sorry it took me a little bit to recreate it in gdscript as I am not very familiar with gdscript. I am sure the code can be a lot better. I went ahead and downloaded Godot 3.3.2 (the regular build, no mono support).

SurfaceToolAppendFromGDScript.zip

I also realized my benchmark isn't exactly correct. I was timing both the AppendFrom as well as the rest of the surface tool stuff like generating normals, committing, etc. If I just benchmark the AppendFrom the performance is even worse.

~140ms synchronously
~10500ms synchronously
So AppendFrom is like ~75x slower.

@clayjohn
Copy link
Member

clayjohn commented Aug 6, 2021

The AppendFrom function internally calls VisualServer.mesh_surface_get_arrays() which retrieves the mesh data from the GPU instance of the mesh. E.g. it makes an OpenGL API call to read back data from the GPU.

OpenGL has to run from the main thread only. Not only that, but when you request data back from the GPU, the main thread has to stall while it waits for the data to be returned. That is likely what is happening in your case. You are asking for GPU data 125 times and the main thread stalls while the GPU transfers 125 copies of your mesh to the CPU. (See https://docs.godotengine.org/en/stable/tutorials/optimization/using_servers.html#getting-data-from-the-servers)

In your case, you will be much better off by requesting a single copy of the mesh, converting it to an ArrayMesh once and then adding that ArrayMesh's arrays manually. You can even just copy the logic from the AppendFrom function (with the key difference being that you dont request a copy of the mesh each time you call the function)

void SurfaceTool::append_from(const Ref<Mesh> &p_existing, int p_surface, const Transform &p_xform) {
ERR_FAIL_COND_MSG(p_existing.is_null(), "First argument in SurfaceTool::append_from() must be a valid object of type Mesh");
if (vertex_array.size() == 0) {
primitive = p_existing->surface_get_primitive_type(p_surface);
format = 0;
}
int nformat;
List<Vertex> nvertices;
List<int> nindices;
_create_list(p_existing, p_surface, &nvertices, &nindices, nformat);
format |= nformat;
int vfrom = vertex_array.size();
for (List<Vertex>::Element *E = nvertices.front(); E; E = E->next()) {
Vertex v = E->get();
v.vertex = p_xform.xform(v.vertex);
if (nformat & VS::ARRAY_FORMAT_NORMAL) {
v.normal = p_xform.basis.xform(v.normal);
}
if (nformat & VS::ARRAY_FORMAT_TANGENT) {
v.tangent = p_xform.basis.xform(v.tangent);
v.binormal = p_xform.basis.xform(v.binormal);
}
vertex_array.push_back(v);
}
for (List<int>::Element *E = nindices.front(); E; E = E->next()) {
int dst_index = E->get() + vfrom;
index_array.push_back(dst_index);
}
if (index_array.size() % 3) {
WARN_PRINT("SurfaceTool: Index array not a multiple of 3.");
}
}

The fact that append_from requests data from the GPU and will stall the main thread needs to be documented.

@clayjohn clayjohn added documentation and removed bug labels Aug 6, 2021
@Delpire
Copy link
Author

Delpire commented Aug 6, 2021

@clayjohn - Thank you so much for the explanation. I wanted to use the ArrayMesh originally, but didn't want to have to apply the transforms myself. I should be able to do this myself, although it would be nice if the SurfaceTool provided a way of doing this since its possible there could be some performance gain doing this in the C++ layer. If anyone feels strongly about it, I could write a proposal in the proposal repo.

Regardless, I think this sufficiently resolves my issue, and I think it would be valuable to document APIs that are good candidates for multithreading and APIs that are not.

Thank you!

@clayjohn
Copy link
Member

clayjohn commented Aug 6, 2021

@Delpire I agree, if there is enough interest a append_from_arrays function could be really beneficial. Having regular API calls silently stall the main thread to await data from the GPU is not good.

@akien-mga
Copy link
Member

Closed by #54982.

@Scony
Copy link
Contributor

Scony commented Jan 29, 2022

The AppendFrom function internally calls VisualServer.mesh_surface_get_arrays() which retrieves the mesh data from the GPU instance of the mesh. E.g. it makes an OpenGL API call to read back data from the GPU.

OpenGL has to run from the main thread only. Not only that, but when you request data back from the GPU, the main thread has to stall while it waits for the data to be returned. That is likely what is happening in your case. You are asking for GPU data 125 times and the main thread stalls while the GPU transfers 125 copies of your mesh to the CPU. (See https://docs.godotengine.org/en/stable/tutorials/optimization/using_servers.html#getting-data-from-the-servers)

In your case, you will be much better off by requesting a single copy of the mesh, converting it to an ArrayMesh once and then adding that ArrayMesh's arrays manually. You can even just copy the logic from the AppendFrom function (with the key difference being that you dont request a copy of the mesh each time you call the function)

void SurfaceTool::append_from(const Ref<Mesh> &p_existing, int p_surface, const Transform &p_xform) {
ERR_FAIL_COND_MSG(p_existing.is_null(), "First argument in SurfaceTool::append_from() must be a valid object of type Mesh");
if (vertex_array.size() == 0) {
primitive = p_existing->surface_get_primitive_type(p_surface);
format = 0;
}
int nformat;
List<Vertex> nvertices;
List<int> nindices;
_create_list(p_existing, p_surface, &nvertices, &nindices, nformat);
format |= nformat;
int vfrom = vertex_array.size();
for (List<Vertex>::Element *E = nvertices.front(); E; E = E->next()) {
Vertex v = E->get();
v.vertex = p_xform.xform(v.vertex);
if (nformat & VS::ARRAY_FORMAT_NORMAL) {
v.normal = p_xform.basis.xform(v.normal);
}
if (nformat & VS::ARRAY_FORMAT_TANGENT) {
v.tangent = p_xform.basis.xform(v.tangent);
v.binormal = p_xform.basis.xform(v.binormal);
}
vertex_array.push_back(v);
}
for (List<int>::Element *E = nindices.front(); E; E = E->next()) {
int dst_index = E->get() + vfrom;
index_array.push_back(dst_index);
}
if (index_array.size() % 3) {
WARN_PRINT("SurfaceTool: Index array not a multiple of 3.");
}
}

The fact that append_from requests data from the GPU and will stall the main thread needs to be documented.

just to supplement @clayjohn comment - since OpenGL has to run from the main thread, VisualServer does a lot behind synchronization mechanisms. As a matter of fact, VisualServer locks semaphores heavily so if an extra thread needs to operate on VisualServer - it costs a lot.

See 3 random stack traces from Thread 7 (thread running async SurfaceTool processing from MRP here):

Thread 7 (Thread 0x7fe1b89fd640 (LWP 42374) "generate"):
#0  0x00007fe1e428d8ca in __futex_abstimed_wait_common64 () from /usr/lib/libpthread.so.0
#1  0x00007fe1e4287270 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#2  0x000000000355d671 in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
#3  0x0000000002d6ecbb in Semaphore::wait (this=0x69b63e8) at ./core/os/semaphore.h:58
#4  CommandQueueMT::push_and_ret<VisualServer, int (VisualServer::*)(RID, int) const, RID, int, int> (this=this@entry=0x69b63d0, p_instance=0x6817bb0, p_method=<optimized out>, p1=..., p1@entry=..., p2=p2@entry=0, r_ret=0x7fe1b89fc054) at ./core/command_queue_mt.h:469
#5  0x0000000002d6ee6a in VisualServerWrapMT::mesh_surface_get_array_len (this=0x69b62b0, p1=..., p2=0) at servers/visual/visual_server_wrap_mt.h:167
#6  0x0000000002bb7ce8 in VisualServer::mesh_surface_get_arrays (this=0x69b62b0, p_mesh=..., p_surface=0) at servers/visual_server.cpp:1793
#7  0x000000000285f82b in ArrayMesh::surface_get_arrays (this=0x16b7c480, p_surface=0) at scene/resources/mesh.cpp:825
#8  0x0000000002969687 in SurfaceTool::_create_list (this=0x7fe1b4000c00, p_existing=..., p_surface=<optimized out>, r_vertex=0x7fe1b89fc1d0, r_index=0x7fe1b89fc1d8, lformat=@0x7fe1b89fc1cc: 279) at scene/resources/surface_tool.cpp:506
#9  0x0000000002969813 in SurfaceTool::append_from (this=0x7fe1b4000c00, p_existing=..., p_surface=0, p_xform=...) at scene/resources/surface_tool.cpp:786
#10 0x0000000002979491 in MethodBind3<Ref<Mesh> const&, int, Transform const&>::call (this=0x9282270, p_object=<optimized out>, p_args=0x7fe1b89fc2c8, p_arg_count=-1197489440, r_error=...) at ./core/method_bind.gen.inc:2303
#11 0x00000000030e534a in Object::call (this=0x7fe1b4000c00, p_method=..., p_args=0x7fe1b89fc6b8, p_argcount=3, r_error=...) at core/object.cpp:918
#12 0x00000000031891d1 in Variant::call_ptr (this=this@entry=0x7fe1b89fc538, p_method=..., p_args=p_args@entry=0x7fe1b89fc6b8, p_argcount=p_argcount@entry=3, r_ret=0x0, r_error=...) at core/variant_call.cpp:1185
#13 0x00000000008e1fcb in GDScriptFunction::call (this=0x16b3c510, p_instance=0x16b17510, p_args=<optimized out>, p_argcount=<optimized out>, r_err=..., p_state=<optimized out>) at modules/gdscript/gdscript_function.cpp:1046
#14 0x0000000000878138 in GDScriptInstance::call (this=<optimized out>, p_method=..., p_args=<optimized out>, p_argcount=<optimized out>, r_error=...) at modules/gdscript/gdscript.cpp:1184
#15 0x00000000030e54aa in Object::call (this=0x8dfa630, p_method=..., p_args=0x7fe1b89fcd10, p_argcount=1, r_error=...) at core/object.cpp:899
#16 0x0000000003398684 in _Thread::_start_func (ud=<optimized out>) at core/bind/core_bind.cpp:2685
#17 0x000000000323d655 in Thread::callback (p_self=<optimized out>, p_settings=..., p_callback=0x3398560 <_Thread::_start_func(void*)>, p_userdata=0x17399ed0) at core/os/thread.cpp:77
#18 0x00000000035d2c04 in execute_native_thread_routine ()
#19 0x00007fe1e4281259 in start_thread () from /usr/lib/libpthread.so.0
#20 0x00007fe1e405f5e3 in clone () from /usr/lib/libc.so.6


Thread 7 (Thread 0x7fe1b89fd640 (LWP 42374) "generate"):
#0  0x00007fe1e428d8ca in __futex_abstimed_wait_common64 () from /usr/lib/libpthread.so.0
#1  0x00007fe1e4287270 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#2  0x000000000355d671 in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
#3  0x0000000002d6ecbb in Semaphore::wait (this=0x69b63e8) at ./core/os/semaphore.h:58
#4  CommandQueueMT::push_and_ret<VisualServer, int (VisualServer::*)(RID, int) const, RID, int, int> (this=this@entry=0x69b63d0, p_instance=0x6817bb0, p_method=<optimized out>, p1=..., p1@entry=..., p2=p2@entry=0, r_ret=0x7fe1b89fc054) at ./core/command_queue_mt.h:469
#5  0x0000000002d6edba in VisualServerWrapMT::mesh_surface_get_array_index_len (this=0x69b62b0, p1=..., p2=0) at servers/visual/visual_server_wrap_mt.h:168
#6  0x0000000002bb7d1b in VisualServer::mesh_surface_get_arrays (this=0x69b62b0, p_mesh=..., p_surface=0) at servers/visual_server.cpp:1796
#7  0x000000000285f82b in ArrayMesh::surface_get_arrays (this=0x16b7c480, p_surface=0) at scene/resources/mesh.cpp:825
#8  0x0000000002969687 in SurfaceTool::_create_list (this=0x7fe1b4000c00, p_existing=..., p_surface=<optimized out>, r_vertex=0x7fe1b89fc1d0, r_index=0x7fe1b89fc1d8, lformat=@0x7fe1b89fc1cc: 279) at scene/resources/surface_tool.cpp:506
#9  0x0000000002969813 in SurfaceTool::append_from (this=0x7fe1b4000c00, p_existing=..., p_surface=0, p_xform=...) at scene/resources/surface_tool.cpp:786
#10 0x0000000002979491 in MethodBind3<Ref<Mesh> const&, int, Transform const&>::call (this=0x9282270, p_object=<optimized out>, p_args=0x7fe1b89fc2c8, p_arg_count=-1197489440, r_error=...) at ./core/method_bind.gen.inc:2303
#11 0x00000000030e534a in Object::call (this=0x7fe1b4000c00, p_method=..., p_args=0x7fe1b89fc6b8, p_argcount=3, r_error=...) at core/object.cpp:918
#12 0x00000000031891d1 in Variant::call_ptr (this=this@entry=0x7fe1b89fc538, p_method=..., p_args=p_args@entry=0x7fe1b89fc6b8, p_argcount=p_argcount@entry=3, r_ret=0x0, r_error=...) at core/variant_call.cpp:1185
#13 0x00000000008e1fcb in GDScriptFunction::call (this=0x16b3c510, p_instance=0x16b17510, p_args=<optimized out>, p_argcount=<optimized out>, r_err=..., p_state=<optimized out>) at modules/gdscript/gdscript_function.cpp:1046
#14 0x0000000000878138 in GDScriptInstance::call (this=<optimized out>, p_method=..., p_args=<optimized out>, p_argcount=<optimized out>, r_error=...) at modules/gdscript/gdscript.cpp:1184
#15 0x00000000030e54aa in Object::call (this=0x8dfa630, p_method=..., p_args=0x7fe1b89fcd10, p_argcount=1, r_error=...) at core/object.cpp:899
#16 0x0000000003398684 in _Thread::_start_func (ud=<optimized out>) at core/bind/core_bind.cpp:2685
#17 0x000000000323d655 in Thread::callback (p_self=<optimized out>, p_settings=..., p_callback=0x3398560 <_Thread::_start_func(void*)>, p_userdata=0x17399ed0) at core/os/thread.cpp:77
#18 0x00000000035d2c04 in execute_native_thread_routine ()
#19 0x00007fe1e4281259 in start_thread () from /usr/lib/libpthread.so.0
#20 0x00007fe1e405f5e3 in clone () from /usr/lib/libc.so.6


Thread 7 (Thread 0x7fe1b89fd640 (LWP 42374) "generate"):
#0  0x00007fe1e428d8ca in __futex_abstimed_wait_common64 () from /usr/lib/libpthread.so.0
#1  0x00007fe1e4287270 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#2  0x000000000355d671 in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
#3  0x0000000002d6ecbb in Semaphore::wait (this=0x69b63e8) at ./core/os/semaphore.h:58
#4  CommandQueueMT::push_and_ret<VisualServer, int (VisualServer::*)(RID, int) const, RID, int, int> (this=this@entry=0x69b63d0, p_instance=0x6817bb0, p_method=<optimized out>, p1=..., p1@entry=..., p2=p2@entry=0, r_ret=0x7fe1b89fc054) at ./core/command_queue_mt.h:469
#5  0x0000000002d6ee6a in VisualServerWrapMT::mesh_surface_get_array_len (this=0x69b62b0, p1=..., p2=0) at servers/visual/visual_server_wrap_mt.h:167
#6  0x0000000002bb7ce8 in VisualServer::mesh_surface_get_arrays (this=0x69b62b0, p_mesh=..., p_surface=0) at servers/visual_server.cpp:1793
#7  0x000000000285f82b in ArrayMesh::surface_get_arrays (this=0x16b7c480, p_surface=0) at scene/resources/mesh.cpp:825
#8  0x0000000002969687 in SurfaceTool::_create_list (this=0x7fe1b4000c00, p_existing=..., p_surface=<optimized out>, r_vertex=0x7fe1b89fc1d0, r_index=0x7fe1b89fc1d8, lformat=@0x7fe1b89fc1cc: 279) at scene/resources/surface_tool.cpp:506
#9  0x0000000002969813 in SurfaceTool::append_from (this=0x7fe1b4000c00, p_existing=..., p_surface=0, p_xform=...) at scene/resources/surface_tool.cpp:786
#10 0x0000000002979491 in MethodBind3<Ref<Mesh> const&, int, Transform const&>::call (this=0x9282270, p_object=<optimized out>, p_args=0x7fe1b89fc2c8, p_arg_count=-1197489440, r_error=...) at ./core/method_bind.gen.inc:2303
#11 0x00000000030e534a in Object::call (this=0x7fe1b4000c00, p_method=..., p_args=0x7fe1b89fc6b8, p_argcount=3, r_error=...) at core/object.cpp:918
#12 0x00000000031891d1 in Variant::call_ptr (this=this@entry=0x7fe1b89fc538, p_method=..., p_args=p_args@entry=0x7fe1b89fc6b8, p_argcount=p_argcount@entry=3, r_ret=0x0, r_error=...) at core/variant_call.cpp:1185
#13 0x00000000008e1fcb in GDScriptFunction::call (this=0x16b3c510, p_instance=0x16b17510, p_args=<optimized out>, p_argcount=<optimized out>, r_err=..., p_state=<optimized out>) at modules/gdscript/gdscript_function.cpp:1046
#14 0x0000000000878138 in GDScriptInstance::call (this=<optimized out>, p_method=..., p_args=<optimized out>, p_argcount=<optimized out>, r_error=...) at modules/gdscript/gdscript.cpp:1184
#15 0x00000000030e54aa in Object::call (this=0x8dfa630, p_method=..., p_args=0x7fe1b89fcd10, p_argcount=1, r_error=...) at core/object.cpp:899
#16 0x0000000003398684 in _Thread::_start_func (ud=<optimized out>) at core/bind/core_bind.cpp:2685
#17 0x000000000323d655 in Thread::callback (p_self=<optimized out>, p_settings=..., p_callback=0x3398560 <_Thread::_start_func(void*)>, p_userdata=0x17399ed0) at core/os/thread.cpp:77
#18 0x00000000035d2c04 in execute_native_thread_routine ()
#19 0x00007fe1e4281259 in start_thread () from /usr/lib/libpthread.so.0
#20 0x00007fe1e405f5e3 in clone () from /usr/lib/libc.so.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants