OGPN13 locked

Newsletter Archives

August 26th 2006 Open Graphics Project Newsletter

Over 206 Posts covered in this Newsletter

Progress and developments

ODG1 Boards

We're coming close to having prototype hardware for OGD1 boards, so we're working to get pieces of RTL together for board testing. We need modules that are just functional enough to test to make sure all of the signals on the board are there and working correctly. As of now, we have the following:

Some SPI controllers that we need to be sure are finished:

We should use the simplest one, which I think is Petter's, IIRC.

A video controller that Patrick has nearly finished up:

And a couple of memory controllers:

The memory controller is the part I'm working on right now, and it's the least tested of the lot, so I would like to get some assistance with that.


With the the boards arriving soon, the Mailing List Looked toward aspectss involved in writing the Drivers.

Hamish wondered aloud if we could queue commands with the graphics card?

Timothy agreed:
To an extent, we'll be doing that. A typical approach is to have a central circular buffer ("ring buffer") that is controlled by a central driver. Whenever an application wants to send commands, it writes to its own "indirect" buffer, and when it wants to submit those commands, a command is put into the ring buffer that points at the indirect buffer. (Indirect buffers are linear, not circular.) The ring buffer is the queue. The problem with directly enqueueing into the GPU is that if DMA is already going on, we contend for the bus. The bus is already in use, so we end up wasting thousands of CPU cycles waiting on the bus to clear just to do a handful of PIO writes. It's better to use a DMA buffer in host memory where we have some shared variables between the CPU and GPU, and we just use those to indicate where the queue head and tail are.

Hamish speculated that with queuing it would be possible to Prioritise blocks:
Have n+1 priorties of command block. One is reserved as the highest. This could be used to execute something NOW! Without waiting. (Or perhaps execute as the next block if the logic to return is too great). You could then optionally have multiple levels of priority for command blocks... The way I see it, any interrupt saved is a bonus. Having the card able to do as much as possible for scheduling itself has got to save system cpu.

Timothy felt this shouldn't be done in hardware:
Never have the hardware do something the driver can do perfectly well. This is one of those high-level things that can be handled in software with trivial overhead, so we should do it that way. ...Worst case, we'll have the driver time how long each app is tying up the GPU and if someone's generating too many commands (or commands that take a long time to render), then we'll drop its priority. ... If two applications are trying to draw at the same time, the driver could give them virtual slices and priorities. It won't be perfect (numbers of commands rather than amounts of time), but there's only so much that is reasonable. Aside from video interrupt, I think we need two engine interrupts. One is "dma idle", and the other is "engine idle".

What we do is this: In a page of DMA buffer, we have some shared variables between GPU and CPU. As the GPU consumes ring buffer entries, it'll periodically update a shared variable that indicates the "head" of the queue, where words are extracted from the circular buffer. Similarly, as the driver fills commands into the ring buffer, it'll update the "tail" pointer to indicate to the GPU where the end of the queue is. Whenever the GPU runs to the end of what it thinks locally is the tail pointer (no more DMA reads), it'll reread the pointer in the host. If what it reads is different from the old value, it keeps going. If what it reads is the same, it stops (nothing more to do) and raises an interrupt. As long as the updates to the tail pointer can be done atomically (which, as a 32-bit word, they would be on most architectures), then we can keep the GPU going continuously without ever issuing an expensive PIO. If the interrupt arrives, it means that the GPU has stopped DMA and won't be trying again automatically, so we'll have to issue a PIO to get it going again when there's more to do. Any other engine-related interrupts will be ones inserted as commands into the queue, as suggested by Jon Smirl. ... If not done right, there could be a race condition between writing the tail pointer and getting the interrupt, but this is surely a solved problem. ...

We have two possible race conditions: interrupt; write pointer or write pointer; interrupt

Jon pointed out Network cards use several technique for DMA interrupt mitigation. Timothy also noted there were some differences to be aware of:
For a NIC, the command data is small compared to packet data. For a GPU, the command data is all you have, and there's huge amounts of it.

Suggestions for Starting the ODG1 Drivers

(For some graphics cards) You can do every command externally by turning off the command stream and writing the registers via PIO. That is a good way to test the chip. This model is why Radeons have 2,000 registers.

Timothy agreed:
That's how we'll get it up and running on the earliest drivers. Throwing in DMA too soon adds too many variables and too many things to break. We discourage direct PIO writes to the register set, but they can be done.

The state of Linux graphics

Richard posted a self confessed rant covering many areas about Linux and its graphics drivers. Here are some snippets:
Windows wanted video drivers, but I don't think they demanded that anyone create them ... Instead, Windows just made things easy. DOS developers switched to writing Windows games because Windows supplied easy memory access and easy graphics access. Office application developers switched to Windows because Windows supplied easy printer access and easy network access. Linux on the other hand doesn't want to make anything easy... Create an easy, standardized, and well documented kernel video driver interface that allows card manufacturers to easily create a video driver for Linux. The manufacturers don't have to create the drivers now or even ever, but the interface needs to exist. This would be a simple interface that allows someone to copy a file from their video card's CD-ROM, tell Linux to load it, and then it's their video driver. It would not require the driver to be open source, I've actually done graphics in Linux, and so when I say it's a (bad), and if you haven't created your own complete graphical system which runs under Linux, don't even begin to tell me I'm wrong because you simply do not know what you are talking about.

Don't misjudge the OGP. We don't think we're going to solve all problems. As with any open source project, we're here to scratch an itch. (...) part of the problem (with a fixed interface) is that we're taking away from Linux one of the things that makes it great: Open development. With (a) pcode solution, we could solve a lot of the crash and general quality problems of closed-source drivers by making the pcode virtually run in a sandbox. But there's no room for Free Software developers to improve the drivers, taking away part of the biggest benefit of Free Software.(...) This defeats the whole idea behind Free Software and giving one full control over one's computer. I honestly don't think this has anything to do with closed or open drivers. ... And I agree that APIs on both sides of things needs to be created that simplify driver development, app development, and contention for the graphics resources among multiple applications (among other things). Basically, graphics needs to be CENTRALIZED (virtually if not physically), and everything that wants to do graphics can use that centralized facility. Give that facility the right level of abstraction, simplicity, and power, and you'll seldom have to look back. Just the way we already do with audio and networking, but better.

The Challenges of using a standard API
Jack speculated over the concept of a standard API vs Linux method of development:
The only way I can think of to resolve this paradox is to allow the API to evolve only in discrete releases. Each released version could be implemented in a kernel module with a different name. That way, an API once released would be frozen for all time, and a driver writer could write for that API with confidence that is would never be forced into obsolescence or require updating because of kernel or X changes. A driver writer who needs features not included in that API would write for a later API. The kernel wouldn't become bloated, because Linux normally loads only the modules it needs. Drivers are able to command specific modules to load. He then later added: There's something else to consider. Graphics is an option. Some platforms that run Linux don't have it at all, and yet they must provide a console at boot time.

Lourens pointed to one big concern in designing an API:
Do you know what a graphics card will look like in five years? That's about 8 generations of hardware. I doubt you'll be able to design an interface now that will remain perfectly adequate over that timeframe...

Jon suggested:
Check out the exokernel model. Exokernel is what DRM/DRI does today. Each of the DRM kernel drivers has a completely different IOCTL interface. The kernel DRM driver for the ATI R200 looks nothing like the Intel i925 driver. Mesa is a complete software implementation of OpenGL. But Mesa is designed so the software functions can be overlaid with functions that implement hardware acceleration. Each DRM driver has a corresponding userspace DRI library. These libraries overlay Mesa functions with hardware accelerated functions. For example the intel_dri module contain hardware texture functions that make calls into the intel DRM kernel module. If the intel_dri module isn't loaded you will fallback to software implementations of the same functions. The result of this is that there is a single public user space API, OpenGL. But all of the kernel modules for each video card are different.

Why Does Linux Sometimes Breaks Binary Drivers
Roger L. replying to a poster who wondered why Linux appears to keep changing the interfaces and breaking binary drivers:
Binary-only modules are certainly _not_ considered first class citizens. (...) they (the kernel developers) certainly don't increase their own workload just to spite the maintainers of binary-only drivers. But since they care little or nothing about binary-only modules, the cost of changing internal interfaces is limited to changing the drivers that are in the tree. I've done it myself, it's no big deal and pretty useful sometimes. No compatibility (that we care about) is broken, since all in-tree code is fixed. It's one possible approach, and there are pros and cons. (...) There are undeniable advantages to stable in-kernel interfaces and portable drivers, and there are downsides, too. Linux developers looking at this trade-off tend to come to a different conclusion than you do.

OGP and Linux Graphics
Rene brought the discussion back on track:
Let's postpone fixing the software side of Linux graphics until after this hardware is a reality, after which it can quite possibly serve as a great testing ground for ideas eminating from here, from X.org, from the framebuffer crowd and from random geniouses.

OGP is about producing hardware that works regardless of the software situation.

OGD Architecture

Query raised:
Why does a "3D" window frame need to have special hardware? You create a window and it sits there for minutes, hours, days.(...) Let the CPU do that, and give hardware assist to video, which has to decode the input and change the display 60 times a second.

Without acceleration, you can visibly see the drawing that occurs when something is painted in the first place. More bothersome, lots of people like opaque window moves, so bitblt must be accelerated. Scrolling is also common, so more bitblt. Window backgrounds are often painted to a solid color before other things are drawn, so solid fill is important. Those are really the most critical 2D effects. With 3D, your screen (or window) is completely redrawn for every event, making acceleration critical. Another reason for hardware-assist is to reduce CPU overhead for these things, making your whole system more responsive. The difference between 3D graphics and video is that the video image has to come from the host all of the time. With 3D rendering, a lot of textures and stuff are loaded into graphics memory and reused a lot, with the major bus traffic there being rendering commands.

Hardware support possible for Vblanking

vblank-oriented things ARE something I might want hardware support for. If you can be sure you're staying just behind the vertical retrace, you can draw directly to the screen without tearing and have an entire frame to do it. However, many times, we will be drawing to a back buffer and just swap front and back buffers in a vertical blank interrupt, making that not so useful.


At the moment, we have a video controller that can't scale. That doesn't mean I have rejected the idea. It just means that's what's in our code library right now. There's a while yet before we finish designing all of TRV10, so there's time yet to implement a new video controller.

He then let slip some news we all were hoping for...
In the short term, that's not a priority for me, because OGD1 prototypes will be arriving soon, and we need the minimum necessary to debug the board. (...) With a modern GUI, the video image isn't sent directly to the screen anyhow. It's decompressed to an off-screen buffer and then composited with the screen by the drawing engine... where you can scale it all you want.

AGP, PCI, and PCI Express

A discussion arose about which bus to support: Which bus interface is the card going to have? PCI is hopelessly slow for graphics.
No, it's not. Yes, it's (...) slow, but we've already done the calculations, and it's just barely fast enough.

The poll was done many month ago. For developpement and for the first run, PCI is more than enough.


MANY commands take less time to submit than they do to execute, especially if PCI is the bus. With 3D, we'll get lots of tiny triangles that are also faster than the bus. Packing is vital for throughput. Plus, why waste the bandwidth, even if you have it to burn? (...) I'm planning to use an onboard microcontroller (that I've been working on) to manage DMA. It's a simple matter of writing some RISC code (once) to process it all. Here's the approach I have in mind:Each packet has a 32-bit header. Some bits are reserved for the packet type, some are reserved for the packet length (not everything submitted--just this one command), and the rest depend on the packet type. For a drawing command packet, those remaining bits will usually be used as flags to indicate which of the many possible attributes are also contained in subsequent words of the packet header.

Consider a 2D solid rectangle fill. Upper left corner (32 bits) and width/height (32 bits) are mandatory. But there are attributes, like foreground color, that are optional because it's common to draw a bunch of rects all of the same color. Futhermore, the packet length is used to indicate how many rectangles (all with the same set of attributes) are in the packet.

You might define another packet type that lets you specify multiple rectangles, but all with their own attributes (which ones being specified again by the flags in the header). We'll define a set that makes sense to us. Some of the packet types will exist just to set some of the more obscure attributes in an efficient way. (...) I'm going to try to make it easy for software to keep track of this stuff so that software is in charge of changing the context when appropriate. I don't want the GPU to have to track multiple contexts. And we can do this without ever reading from the GPU. Have shadows of the GPU state registers, global, and local, and the driver will insert appropriate packets into the ring buffer just before the indirect command for the client. However this is not yet fixed in stone, we will carefully test and use what works in the real world as Timothy explained: There is much room for experimentation to see what is most efficient in terms of bus bandwidth, what we can fit into the microcontroller's program file, and what works well for host software.

OGH Address Space

A question was raised:OGH address space look like?
The way we've spec'd it is this: The drawing engine doesn't have access to host memory, but our DMA engine does. We'll decide on something for the size of the host-accessible memory aperture (PCI BAR). It will be smaller than the size of the graphics memory space. We discourage direct framebuffer access, but there will be cases where we MUST do software rendering. We will provide an offset register that software can use to set an appropriate window into the graphics memory.

If you want to move data between host memory and graphics memory, we strongly encourage you to use the DMA engine. The DMA engine has access to the full host address space (to the extent that the IOMMU gives it) and to the full graphics memory space. In that case, we're dealing with two pointers (host and graphics) and a number of words to move. (In addition to simple memory moves, there will be other DMA commands to move data with more complex operations.) Software is responsible for memory management, i.e. where off-screen surfaces are located in graphics memory. If you want to use a drawing surface as a texture or bitblt source or a drawing target, it must be located in graphics memory. In this generation, we will not be providing a way for the GPU to use textures directly out of host memory. Every objection will be met with "we can add more graphics memory."

Jon asked:
Can the drawing engine copy memory around in the full VRAM space? The bitblt engine should be able to do this.

Yes. And I hope I don't forget this, but I want two kinds of copy. One is a rectangular bitblt. The other is more of a linear memory move. Don't let me forget to add the latter. :-)

GART is used to deal with the fact that you usually can't allocate linear system memory. Instead you allocate a bunch of random pages and the GART hardware makes them look like linear memory so that the graphics DMA hardware will work. The GART mapping appears in the PCI address space. When you DMA you use addresses in the GART range not true system addresses. I don't believe any special hardware is required for a PCI only card to make use of GART remapping. (...) GART has been generalized on PCIe to become the PCIe IOMMU. (...) The GART hardware is almost always used since the CPU paging hardware has scrambled any images in system RAM. Resolving this scrambling is why normal DMA is seldom used. (...) GART memory is easy to use, just ask the kernel for a chunk of it. It will appear to the app as contiguous memory since the kernel GART driver will set up the page tables in the app.

Jon explained further about GART:
The kernel graphics driver has to turn the GART region address into internal GPU address space. But the drivers are smart and mapped the GART region to the same address in internal space as it is in GART space. So you just copy the address into the GPU command. Worst case you add a fixed offset to it. (...) Nobody but the kernel GART driver had to deal with the pages being scattered everywhere. (...) You have to ask the kernel GART driver to allocate the RAM for you since it needs to be marked non-pagable and non-cachable. Normal app memory is both pagable and cachable. (...) Access to this memory is not that bad since it has a MTRR set for write combining. You mostly write to it and not read from it.

There will be some reads, now and then (for the shared pointers, for instance), but they're rare. In this case, we waste only hundreds of cycles instead of thousands.

RGB conversion

I want to provide, among other things, YUV/RGB, and YUYV/RGB conversion support. The latter requires half the bandwidth.

Hardware Selection and Revision

Hardware Clarifications

GDDR vs DDR Memory

A poster noticed that some graphic cards are now using GDDR RAM and wondered if ODG1 would beneift from changing from DDR or GDDR.
There is no functional difference, except that the GDDR ram can run at higher clock frequecies. With an FPGA, the maximum clock frequency we can hope to achieve is 200MHz (400 Mbits/s) so there is no need to fit parts that clock higher than that, which is in the limits of regular DDR SDRAM. For the ASIC version of the card, we will use GDDR.

Mini FAQ for the Open Graphics Mailing list.

Where is the list?
How can I get it?
Using your favourite newsreader point it toward news.gmane.org and subscribe to comp.graphics.opengraphics
Which newsreader can I use?
There are many. Specialist software such as Pan (Linux & Windows) or Gravity (Windows) are well known, but there are many others, and often your favourite email software can mostly function as a newsreader.
Can anyone post?
Yes, the list is moderated

Suggestions for this newsletter are welcome and are made through the mailing lists.

Created by josephblack. Last Modification: Friday 03 of September, 2010 18:29:36 UTC by josephblack.