Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How much the size of GDR can pin? Is there differences on Tesla and Quadro? #301

Open
Notherthing opened this issue Jul 27, 2024 · 10 comments
Assignees
Labels

Comments

@Notherthing
Copy link

Notherthing commented Jul 27, 2024

When I use V100 , it shows that I can gdr_pin nearly all of the device memory (about 32GB).
But when I use A4000, it can only pin about 220MB (the device of memory is about 16GB).
Is there differences on Tesla and Quadro?

@Notherthing
Copy link
Author

Both of the two device' driver are: NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1.

@Notherthing
Copy link
Author

I have tried to disable the CPU PA 46 bits limitation in bios, but still could only pin the GPU memory less than 220MB.

@Notherthing
Copy link
Author

Here is the error log: 12 may means out of memory? Is there some information about this?

GPU id:0; name: NVIDIA RTX A4000; Bus id: 0000:51:00
selecting device 0
testing size: 231735296
rounded size: 231735296
gpu alloc fn: cuMemAlloc
device ptr: 7f26b2000000
DBG: sse4_1=1 avx=1 sse=1 sse2=1
ERR: ioctl error (errno=12)
pin ret: 12
ERR: mh is mapped already
Assertion "(gdr_map(g, mh, &map_d_ptr, size)) == (0)" failed at copybw.cpp:81

@Notherthing
Copy link
Author

Notherthing commented Jul 28, 2024

When I want to gdr_pin 221MB, it fails. And here is the information from dmesg.
May this give more info for this question?

[64513.944169] gdrdrv:gdrdrv_open:minor=0 filep=0xff2796ccd17ff600
[64513.944176] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008daff)
[64513.944183] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc028da01)
[64513.944295] gdrdrv:__gdrdrv_pin_buffer:invoking nvidia_p2p_get_pages(va=0x7f67b2000000 len=231735296 p2p_tok=0 va_tok=0 callback=ffffffffc06b2160)
[64513.944297] gdrdrv:__gdrdrv_pin_buffer:nvidia_p2p_get_pages(va=7f67b2000000 len=231735296 p2p_token=0 va_space=0 callback=ffffffffc06b2160) failed [ret = -12]
[64513.944298] gdrdrv:gdr_free_mr_unlocked:invoking unpin_buffer while callback has already been fired
[64513.959911] gdrdrv:gdrdrv_release:closing

@Notherthing
Copy link
Author

I notice that V100 ‘s bar is about 32GB, but A4000 only has 256MB. I notice this bar. Does A4000 could get 16GB bar by compute mode? How could I switch the mode?

@pakmarkthub
Copy link
Collaborator

Hi @Notherthing ,

As you have already figured out, the limitation is your GPU BAR size. This is the GPU HW characteristic. There is nothing much we can do here. You cannot map the entire GPU memory at once because of the small GPU BAR. But you can use a sliding window technique to map the region you want to use. When you need to access a different region, you free the current mapping first and then map the new region.

@pakmarkthub pakmarkthub self-assigned this Jul 29, 2024
@Notherthing
Copy link
Author

Notherthing commented Jul 29, 2024

Hi @Notherthing ,

As you have already figured out, the limitation is your GPU BAR size. This is the GPU HW characteristic. There is nothing much we can do here. You cannot map the entire GPU memory at once because of the small GPU BAR. But you can use a sliding window technique to map the region you want to use. When you need to access a different region, you free the current mapping first and then map the new region.

Thank you, my friend. It's pity to learn about that cheap Quadro device has small BAR size. I notice this displaymode is used to switch the GPU mode to have larger BAR size. But it doesn't mention A4000 (only A5000 and devices with higher specification). Does it will work for A4000? And thanks for your valuable advice sincerely. If we could not enlarge the BAR size, I think it is necessary to use special designs when using GDR.

@pakmarkthub
Copy link
Collaborator

I am not sure what that script does. Because A4000 is not in the support list, I would not advise you to try it.

Generally, small BAR GPUs remain as small BAR. You may be able to disable the graphic mode using nvidia-smi, depending on your card. However, it would not change your total GPU BAR size. It might just remove the reserved BAR space for graphic. This is something you can experiment to squeeze out a few more MB.

@drossetti
Copy link
Member

@Notherthing depending in your motherboard, you might also be able to get a larger BAR1 by taking advantage of the PCIe "Resizable BAR" feature.
In practice the SBIOS would read a range of supported GPU BAR sizes (through a config register placed in a PCIe extension) and pick a reasonably large size.

@Notherthing
Copy link
Author

Thanks. I am going to try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants