by sinbad » Sun Jan 08, 2006 1:22 pm
Old cards were transform limited, or possibly didn't even have hardware transform at all (the Voodoo1/2 cards didn't for example). Therefore it was important to send as few triangles to the pipeline as you could. This also applied to software renderers which often performed per-triangle culling.
The Q3A BSP structure subscribes to this approach, culling in very small patches of triangles and dynamically determining every frame which small groups are visible so as to only pass the smallest set it needs to the renderer.
There are basically 2 ways to handle this - either keep all the data on the card and make lots of small calls to pull in all the fragments you need to render for this frame, or build up a combined buffer of fragments to render every frame (adjusting offsets etc on the fly in the CPU) and uploading that to the card as often as you need to for rendering. Q3A uses the latter approach, our BSP renderer uses the former (it used to use the same as Q3A but as an experiment we tried it the other way, and for hardware vertex buffer enabled cards it is faster).
Modern GPUs hate both approaches; the first spends far too much time in overheads for each rendering call, and the latter spends too much time transferring data over the bus. Which one is faster depends on the card, bus and CPU you have, but neither are optimal.
Modern level structures are designed to use much bigger chunks of data in one go,and are optimised for submission of large chunks of geometry at once not picking small fragments. So, even though the Q3A format is very popular, it's very outdated and generally unsuitable for modern projects.