Unified Nvidia TNT/GeForce driver for Haiku



Measurements done during acceleration code rewrite for DMA use in driver 0.41.


While developing DMA acceleration for the nVidia driver, I had a lot of fun benchmarking the driver over and over again. Because of that, I thought it might be a good idea to share my findings with you so you can have a look for yourself. Maybe the info here can even help you select a future graphicscard for Haiku. Who knows... ;-)

OK, while I developed DMA acceleration, I at first 'literally' used the old existing PIO setup for command execution inside the driver. It was nice to see that the acceleration functions became about 2-3 times faster on fast CPU's already, but I was a bit disturbed to see that on slow CPU's there was no speedgain to be seen at all. I realized then that code overhead must have been playing an important role in the setup I had done, so I started to tweak the driver for more speed by minimizing code that needed to be executed.
Things I did was:
The grand total of this is about 110% speedgain over the 2-3 times I had already gained, and this time not only on fast CPU's, but also on slow CPU's! So in fact the driver can be upto 6 times faster than before. I mentioned 10 times as maximum speedup because the rest of the gain can be found on NV36 and other non-native PCIe GPU's when they are used on PCIe: see explanation below ('Interpreting results').

I am convinced now that the driver's acceleration is working at peak performance (as far as is currently possible that is).

Let's talk about the real speed improvement DMA gives over PIO mode on nVidia hardware. In other words, what can be done to speedup PIO mode so that works at peak performance too? For that we have to look at the list again I just mentioned. Only items 1, 2 and 5 can be used for PIO mode as well: Item 3 constitutes no real overhead for PIO mode (it's way simpler there), and item 4. doesn't have to be done at all just like it already is: commands are auto-started in PIO mode. When you sum the remaining items up, we see that we can speedup PIO mode with about 30%. So the real improvement DMA brings us here is 600%/130% = about 4.6 times. Nice to know.


Well, let's have a look at the table with benchmark results below. Shall we? When you look at the differences between PIO and DMA mode tables for cards, you will notice that:
are non-accelerated drawing functions. All other functions get accelerated by the graphicsdriver.


Interpreting results.

Interpreting the results with regard to a graphicsdriver only is more difficult than it seems beforehand. As BeRoMeter tests the graphics subsystem speed as a whole via the normal API, the app_server's speed and way of doing things is influencing results for one thing. But, as some API functions consist of partly accelerated pieces as well as some unaccelerated pieces, also the raw bus speed to the graphicscard has influence on some results.

Let's look at the example that shows the most from the results below. When you compare the GeForce PCX5750 (almost at the bottom of the table) with the TNT1 (at the top of the table), you'll notice something strange:
The item Graphics Polygons Unflushed Filled in PIO mode is way slower on the NV36 than it is on the NV04, while the DMA speed for the NV36 is way faster than it is on the NV04: even if you add the results for both CPU's for the TNT1! In fact, there's not a single card slower than the NV36 in PIO mode...

OK, here's the explanation:
The NV36 is sitting in a PCI-express slot. Well, it should be fast then, no? Hmm, it could. The thing is, that this is no PCIe card: it's a AGP card with a PCIe to AGP bridge sitting in front of it, making us believe it's a native PCIe card. Well, AGP means we have to enable AGP on it or it will be using old-fashioned PCI mode (without the trailing 'express'). OK, you guessed it by now: it's in PCI mode, which has a very low busspeed compared to AGP or PCIe. The AGP interface is buried on the card, no AGP busmanager will see it: we need support within the graphicsdriver to program it using specs given out by nVidia themselves (eh: probably not going to happen any time soon I fear).

Wait! We are not there yet.. Why don't we see the same behaviour in DMA mode? Well, there's a simple explanation for that as well: the (dano) app_server is quite intelligent! It actually benchmarks the graphicsdriver itself (apparantly) and then decides which parts of API functions get accelerated, and which will be drawn 'by hand'. To make the picture complete for this example, you need to understand two more things:

Note:
There's more proof for the (dano) app_server benchmarking the driver. BeRoMeter uses the API to draw various non-rectangle shaped forms, both small and big. You can actually whitness (by tweaking and logging) that the app_server issues 'short' accelerated drawing commands if the driver executes them slow, while the commands get 10-20 times longer when the driver executes them a lot faster. In both cases the items are drawn accelerated though. Here's the reason for the app_server splitting up the commands on slower engines: system responsiveness. The app_server (or any other 'client' for that matter) may not hold the engine locked down for too long: other (otherwise independant) clients may be waiting to do some accelerated drawing as well! (3D comes into mind ;-)

Note also:
The NV36 is the only one in the table suffering from the 'PCI mode problem', all other results for this card and other cards can be compared reasonably well.

And note:
Apparantly the completely unaccelerated functions do not have their bottleneck for speed in the bus speed to the graphicscard, as you can't detect differences like the one just described (for Graphics Polygons Unflushed Filled) for them.


Varying results.

BeRoMeter 1.2.6 has the habit to show variations in measured speeds with my drivers. While these variations stay within a few percent of the outcome mostly, there are exeptions to that 'rule': Graphics Rectangles Unflushed and Graphics Rectangles Unflushed Filled. Both of them sometimes fluctuate upto about 25% of their outcome. Most of the time if one goes down however, the other goes up by about the same amount (and vice versa).

All measurements where done at least 3 times therefore, to make sure the results I'd use would be representative for the system's speed. For you this means that while you interpret the results below, you should keep in mind that a minor slowdown or speedup indicated, is in fact 'no change'. And the mentioned Graphics Rectangles Unflushed functions results should be taken loosely, while bearing in mind that both should be looked at simultanously.

All in all I think the benchmarking results give you a nice overall indication of speed differences you can expect on certain systems between PIO and DMA acceleration mode. Also you can see how different architecture cards compare to each other: it's become clear that it matters after all, even for 2D only, which card you use.


Rudolf.


Notes for the table below:

Table: All measurements as they were done with Haiku nVidia driver 0.41.

PIO mode acceleration:

DMA mode acceleration:

Asus P2B-D mainboard (Intel 82440BX chipset) with dual Pentium 3 @ 500Mhz (FSB 100Mhz):

(AGP V1.0, AGP2x max, no FW support)
TNT1 (NV04, with 128-bit gfxRAM buswidth??):
GeForce 2 MX400 (NV11):
no measurement taken.

GeForce FX5200 (NV34):

Dell Inspiron 8600C laptop (Intel 82855PM chipset) with Pentium-M 725 @ 1.6Ghz (FSB 400Mhz):

(AGP V2.0, AGP4x max, FW supported)
GeForce FX5200 Go (NV34):

Asus P4B533 mainboard (Intel 82845E chipset) with Pentium 4 @ 2.8Ghz (FSB 533Mhz):

(AGP V2.0, AGP4x max, FW supported)
TNT2-M64 (NV05 with 64-bit gfxRAM buswidth):
TNT2 (original NV05 with 128-bit gfxRAM buswidth):
GeForce 2 MX400 (NV11):
GeForce 4 MX440 (NV18):
GeForce 4 Ti4200 (NV28):
GeForce FX 5200 (NV34):
no measurement taken.

Abit AA8XE mainboard (Intel 82925 XE chipset) with Pentium 4 @ 3.2Ghz (FSB 800Mhz):

(PCIe V1.0a)
GeForce PCX5750 (NV36, PCI-express card):
GeForce PCX6600 GT (NV43, PCI-express card):
PIO mode does not work.
Table: All measurements as they were done with Haiku nVidia driver 0.41.




Rudolf.

(Page last updated on February 24, 2005)