I'm fascinated by how someone with this domain level knowledge finds a problem like this and solves it.
Is there some correlation between WoW players and wine/graphics programmers?
Is this a game dev who happens to play WoW?
Externally, the phenomenon of a gamer (i.e. user) having domain level knowledge of complex technical understanding is intriguing.
(I'm partially motivated by the fact that as a generalist programmer I don't think id get anywhere the level of understanding needed to produce something like this)
This guy had a problem, slow fps in a game, knew the generalized view of how Wine worked, and used the tools they knew of to try to fix the problem.
I've never written a line of DX, GL, etc but I know what command buffers / driver synchronization / AZDO are like this article mentions.
I also play WoW on Linux, and am kind of embarassed I didn't think to try perf monitoring the game for easy to fix huge slowdowns like this. I kind of assumed since WoW is one of the most popular Wine games and generally pushes the DX api support to make sure it always works that the main Wine devs would optimize it more.
That being said, buffer_storage is a GL 4.4 extension and Wine has this awful habit of trying to strictly support OSX, which will never see OpenGL beyond 4.1, and I'm not sure if buffer_storage is available there. That alone might mean these patches are never merged mainline, which would be... inconvenient.
"Fundamentally, it’s a function that maps a slice of GPU memory into the host’s address space, typically for streaming geometry data or texture uploads"
Can someone clarify this for me? Are OpenGL/D3D buffers that get stuff memcpy'd into them by the CPU actually "slices of GPU memory," or are they more often reserved driver memory that eventually get DMA'd to the GPU? (I realize both probably happen at different times, but I'm curious which is more typical for modern systems)
It seems like spending CPU cycles writing every byte over the bus would perform much worse than a fast write to sysmem followed by a DMA transfer.
EDIT:
I looked into it, and it seems like the typical implementation is that map returns a pointer to some pinned driver sysmem and unmap kicks off an async DMA to GPU memory.
> Can someone clarify this for me? Are OpenGL/D3D buffers that get stuff memcpy'd into them by the CPU actually "slices of GPU memory," or are they more often reserved driver memory that eventually get DMA'd to the GPU?
The answer is, as with all things OpenGL, it depends. You might get back a pointer to GPU memory that you can directly write to, or you'll get back some chunk of system memory the driver has.
The ARB_buffer_storage extension improves matters as you can almost guarantee that you'll get GPU memory, and you can keep it mapped for the entire lifetime of your application (the old buffer APIs wouldn't let you keep it mapped during a draw call). The downside is that you're now responsible for synchronising access to that data.
But as for "is it quicker?", maybe. DMA transfers aren't free, they take time to setup. Usually they need to operate from a limited pool of source memory. If the driver has to take a local copy of your data to copy it (which it will do for every glBufferData/SubData call), then you might as well copy it yourself, GPUs aren't hurting for PCIe bandwidth these days. In addition, you can use a separate thread/CPU core to do the copy, since unlike every other OpenGL call, mapping memory and memcpy'ing doesn't require an OpenGL context.
Linkedin seems to work (brings up form when logged in).
Pinterest seems to work (brought up create board dialog when logged in)
Google+ seems to work (brings up share form).
I do have to say that for me that these would actually work and the JS likely wouldn't in some cases, since I make heavy use of Firefox's containers now to sandbox a lot of online identities, and just have new windows for certain URLs automatically load in the correct container.
It's HTML and some simple CSS. It's enough to make a website look good, and it's fast (as webpages should be) on any moderately modern computer. Why don't we all do this again?
But I was left wondering what the actual problem was. Why is glBufferMap slow? Is it just the impedance mismatch between D3D and GL that don't have the same synchronization guarantees for that specific call? Why does Wine have it's own command handling thread when very likely, the underlying OpenGL driver has one, too?
glMapBuffer doesn't have any ability to declare that you won't overwrite data. All you can say is whether you want read/write access to the buffer. So the driver has to assume that the client might overwrite in-flight data, so synchronization is required.
As for the command stream handling, it makes decent sense to do translation up-front and your drawing commands into a command stream so a separate thread can just hammer through it as fast as possible, rather than doing GL calls in-line with the translation. Partly so the game can return to doing its thing as fast as possible, and partly to fix issues with GL's threading model being horrible ( see e.g. https://bugs.winehq.org/show_bug.cgi?id=24684 )
> glMapBuffer doesn't have any ability to declare that you won't overwrite data.
Well, yes and no. glMapBufferRange, which is basically a drop-in replacement for glMapBuffer, does have such a flag, it's GL_MAP_UNSYNCHRONIZED_BIT.
glMapBufferRange requires OpenGL 3.0 whereas glMapBuffer exists all the way back to OpenGL 2.0 but this looks more like just an oversight in Wine than anything else.
Wine uses glMapBufferRange when available, and has logic to translate D3DLOCK_NOOVERWRITE to GL_MAP_UNSYNCHRONIZED_BIT. I don't understand what the current implementation is doing that is causing wined3d_resource_map to block.
This is an excellent question. The current wine implementation using OpenGL should mirror the behaviour of D3D, since it is using glMapBufferRange and passing GL_MAP_UNSYNCHRONIZED_BIT when possible. I am also wondering what is actually going on that is hurting performance.
EDIT:
After further research and more thought, I suspect that the "pipeline stall" doesn't involve waiting for the GPU to complete work using the buffer, just waiting for the driver. The map/unmap with overwrite or discard is working as intended, but the persistent buffer heap he implemented outperforms it because it reduces the number of calls into the driver required.
I initially had the impression that the existing wine implementation was somehow deficient, but really what the author did was find a way to use the new persistent buffers feature to optimize D3D code using the older per-frame map/unmap method.
This is in fact what the post essentially said (after a re-read), I just misunderstood and thought the pipeline stall was actually waiting for the GPU. The note in the post about the "GPU" line really being the driver is important.
There's two main parts to the stall, which aren't well illustrated by the diagram (I'll get on updating it):
1. Waiting for the resource to exit the command stream (wined3d_resource_wait_idle).
2. Waiting for the CS thread to finish after the map (occurs in wined3d_cs_map).
It's a pipeline stall because the D3D thread has to wait for the CS thread to do things, and thus is unable to dispatch more commands to the CS (and thus the GPU) during this time. I don't consider the actual glMapBufferRange to be part of the stall.
I'm glad you saw the edit. I initially wrote it based on my first interpretation of what you wrote, but after re-reading I realized that your description was totally accurate, I just misunderstood.
Another option is using gallium nine, which directly uses D3D state tracker in driver skipping GL layer. https://wiki.ixit.cz/d3d9 (though on nvidia nouvenau will be probably slower than propertiary GL driver)
Excellent writeup. I would be curious as to the specific considerations given to a heap allocator on the GPU. Related: I'm not too familiar with Wine patches - what is the easiest way to view the final source code of this patch?
Very nice article and project! Have you talked with Wine people to see if you could eventually merge your patch in the official codebase? Seems like it would help a lot of us linux gamers :)
Is there some correlation between WoW players and wine/graphics programmers?
Is this a game dev who happens to play WoW?
Externally, the phenomenon of a gamer (i.e. user) having domain level knowledge of complex technical understanding is intriguing.
(I'm partially motivated by the fact that as a generalist programmer I don't think id get anywhere the level of understanding needed to produce something like this)
I've never written a line of DX, GL, etc but I know what command buffers / driver synchronization / AZDO are like this article mentions.
I also play WoW on Linux, and am kind of embarassed I didn't think to try perf monitoring the game for easy to fix huge slowdowns like this. I kind of assumed since WoW is one of the most popular Wine games and generally pushes the DX api support to make sure it always works that the main Wine devs would optimize it more.
That being said, buffer_storage is a GL 4.4 extension and Wine has this awful habit of trying to strictly support OSX, which will never see OpenGL beyond 4.1, and I'm not sure if buffer_storage is available there. That alone might mean these patches are never merged mainline, which would be... inconvenient.
Can someone clarify this for me? Are OpenGL/D3D buffers that get stuff memcpy'd into them by the CPU actually "slices of GPU memory," or are they more often reserved driver memory that eventually get DMA'd to the GPU? (I realize both probably happen at different times, but I'm curious which is more typical for modern systems)
It seems like spending CPU cycles writing every byte over the bus would perform much worse than a fast write to sysmem followed by a DMA transfer.
EDIT: I looked into it, and it seems like the typical implementation is that map returns a pointer to some pinned driver sysmem and unmap kicks off an async DMA to GPU memory.
The answer is, as with all things OpenGL, it depends. You might get back a pointer to GPU memory that you can directly write to, or you'll get back some chunk of system memory the driver has.
The ARB_buffer_storage extension improves matters as you can almost guarantee that you'll get GPU memory, and you can keep it mapped for the entire lifetime of your application (the old buffer APIs wouldn't let you keep it mapped during a draw call). The downside is that you're now responsible for synchronising access to that data.
But as for "is it quicker?", maybe. DMA transfers aren't free, they take time to setup. Usually they need to operate from a limited pool of source memory. If the driver has to take a local copy of your data to copy it (which it will do for every glBufferData/SubData call), then you might as well copy it yourself, GPUs aren't hurting for PCIe bandwidth these days. In addition, you can use a separate thread/CPU core to do the copy, since unlike every other OpenGL call, mapping memory and memcpy'ing doesn't require an OpenGL context.
A pleasing font family, no JavaScript, some basic CSS, stick to basic HTML tags and use them properly.
https://comminos.com/css/default.css
Not to knock an otherwise nice site (CTRL & + improves readability for me personally, though), but:
AFAIK, even for sites that want to feed the monster, that's unnecessary:http://chrisltd.com/blog/2015/04/social-share-like-buttons-w...
https://sharingbuttons.io/
Twitter seems to work (brings up form).
Facebook redirects to an error.
Linkedin seems to work (brings up form when logged in).
Pinterest seems to work (brought up create board dialog when logged in)
Google+ seems to work (brings up share form).
I do have to say that for me that these would actually work and the JS likely wouldn't in some cases, since I make heavy use of Firefox's containers now to sandbox a lot of online identities, and just have new windows for certain URLs automatically load in the correct container.
AFAICS (which I didn't notice at first) sharingbuttons has a slightly different FB URL, which may not give an error:
https://facebook.com/sharer/sharer.php?u=YOUR-URL
Compared to chrisltd's:
https://facebook.com/sharer.php?u=YOUR-URL
https://developer.mozilla.org/en-US/docs/Mozilla/Mobile/View...
But I was left wondering what the actual problem was. Why is glBufferMap slow? Is it just the impedance mismatch between D3D and GL that don't have the same synchronization guarantees for that specific call? Why does Wine have it's own command handling thread when very likely, the underlying OpenGL driver has one, too?
As for the command stream handling, it makes decent sense to do translation up-front and your drawing commands into a command stream so a separate thread can just hammer through it as fast as possible, rather than doing GL calls in-line with the translation. Partly so the game can return to doing its thing as fast as possible, and partly to fix issues with GL's threading model being horrible ( see e.g. https://bugs.winehq.org/show_bug.cgi?id=24684 )
Well, yes and no. glMapBufferRange, which is basically a drop-in replacement for glMapBuffer, does have such a flag, it's GL_MAP_UNSYNCHRONIZED_BIT.
glMapBufferRange requires OpenGL 3.0 whereas glMapBuffer exists all the way back to OpenGL 2.0 but this looks more like just an oversight in Wine than anything else.
https://github.com/wine-mirror/wine/blob/538263d0efe725124df...
EDIT:
After further research and more thought, I suspect that the "pipeline stall" doesn't involve waiting for the GPU to complete work using the buffer, just waiting for the driver. The map/unmap with overwrite or discard is working as intended, but the persistent buffer heap he implemented outperforms it because it reduces the number of calls into the driver required.
I initially had the impression that the existing wine implementation was somehow deficient, but really what the author did was find a way to use the new persistent buffers feature to optimize D3D code using the older per-frame map/unmap method.
This is in fact what the post essentially said (after a re-read), I just misunderstood and thought the pipeline stall was actually waiting for the GPU. The note in the post about the "GPU" line really being the driver is important.
1. Waiting for the resource to exit the command stream (wined3d_resource_wait_idle).
2. Waiting for the CS thread to finish after the map (occurs in wined3d_cs_map).
It's a pipeline stall because the D3D thread has to wait for the CS thread to do things, and thus is unable to dispatch more commands to the CS (and thus the GPU) during this time. I don't consider the actual glMapBufferRange to be part of the stall.
Edit: saw your edit :)
I'm glad you saw the edit. I initially wrote it based on my first interpretation of what you wrote, but after re-reading I realized that your description was totally accurate, I just misunderstood.
Great work, and thanks for writing it up.
https://github.com/disks86/VK9 https://github.com/doitsujin/dxvk https://source.winehq.org/git/vkd3d.git/