April 29th 2006 Open Graphics Project Newsletter
Table of contents
45 Posts covered in this Newsletter
OGP Alternative Usages: the wild, the crazy and the interesting
- OpenGL enhanced low-fat computer as a graphics board. - this would make an interesting project for a University.
- Minimal Instruction Set Computer. (MISC)
Progress and developments
OGP LogoFurther discussion continued over the OGP Logo. You can see the original suggestions
After considering suggestions and the resulting discussion Apostolos B. posted a revision.
P. Brett noted a factor involved in the selection of a Logo design: (The latest design...) would_ be quite suitable for printing onto the physical device.
Have a look and add your suggestions and comments to the list.
Thanks to Attila for hosting these pictures.
ShadersH. Fisher explained a little about the recent techinical discussion on the list with a comparison of nVida and Ait GPUs design regarding shaders.
Each of those 48 shader units has a four-way vector ALU. The management problem is dividing up each vertex of a triangle, or each fragment within a scan line, to be assigned to a different shader unit. Each of the shader units is executing it's own program. It's the same kind of multiprocessing problem as dividing up, say, SETI or similar parallel problems across many CPUs.
What we are discussing is how to execute the instructions WITHIN a single shader unit.
The argument for vectors is that, with todays GPUs, the majority of the instructions are four-way vector ops to begin with. It's not like regular C/C++/Java code where the compiler has to try and identify vector-like sequences. In shader code, every vector value and vector operation is explicitly declared/written in the source code, so it's easy for the assembler/compiler to generate vector ops. This also makes it easy to translate shaders into SSE or Alti Vec code sequences on conventional CPUs, which is how various pure software pixel shader implementations get their performance.
Nico: Most propose ASM instruction are vector one. But ..., you see a lot of MOV, a lot of partial vector use, scalar mul, etc...
Maybe we should read more shader code but the (...example...) shown here are pretty full of scalar op (so this code did not fit well inside vector FPU).
H. Fisher: A GPU has to execute exactly the same shader program for every pixel in a given triangle/primitive. There is a small amount of data that varies for each primitive: the coords/normal/ tex coords at vertex level; color/texcoords for fragments; which is about a dozen 4x32 bit registers at most. There's a K or two of OpenGL state that the shader can read but not write to as well, plus a K (?) or so of app state with the same restriction.
Now that shaders have branches it's not guaranteed that they all execute in lockstep, but there is a very high probability that all the execution units will need to read from the same memory location at the same time. Brute force replication might work better than dynamic scheduling.
T. Miller You have a point. Aside from a small possibility for variation in instruction sequence, if one pixel's shader needs the vector multiplier, then they all do, at the same time. But what I was thinking was that if they all needed the vmul unit on one cycle but not on the next, then two of the threads' instructions could be scheduled on one cycle and two on the next. What are the chances that we'll get a long stream of vmuls all in a row with no breaks? In that case, it would definitely be better to have four completely independent functional units.
H. Fisher: Four vmuls (actually dot products) in a row is very common for matrix multiplies. The sample shaders I've got, from the OpenGL Shading Language book and GPU Gems, are all very math intensive. I doubt you're going to be able to share ALUs between threads. On the other hand, condition/branch logic probably could be. But on the gripping hand any statistics from generation 1 and 2 shaders are going to be biased in favour of math ops because that was before branches became widespread. So it is possible that shader code will have an instruction mix more like generic C/C++ over the next few years. I'd bet on heavy floating point staying though.
T. Miller felt There are definitely some things we would want to do about multiple threads accessing the same (or nearby) memory locations.
H. Fisher: You'll probably get some sequential access patterns across threads rather than within them. If a horizontal span of fragments is being done in parallel by 2/4/N threads, it's quite likely (especially for a 2D GUI) that thread #0 will need texel P+0, thread #1 P+1, ...
MISC designs vs RISCThe group also considered the possibility a MISC design or a Minimal Instruction Set Computer as opposed to the RISC or Reduced Instruction Set Computer.
R. Medland started a rather technical discussion: Is a MISC stack processor an option for us ? That is to say, are the stack-paradigm drawbacks greater than it advantages? Is a FPU-based stack processor feasible ? In the works I've been looking through, FPU had always been removed for the sake of simplicity, but it doesn't look like this is an option for us. To give a more precise arguments, I would like to point to this MISC vs. RISC vs. CISC discussion and attached proposal : http://ultratechnology.com/mpu21.htm
For those who want an idea of why challenging regular designs based on C, tons of registers and caches, I'd like to point to this essay : http://ultratechnology.com/lowfat.htm
A whole bunch of processors where designed based on these ideas, including opencore Yoda. See :
He then added: This was for the shader block.
T. Miller: In a MISC design, if you want to add, you specify the source operands (probably general purpose registers) to copy into "input registers" for your adder. That's two moves (which you can do in one instruction). Later, you can pop out the result and move it back to a GPR. (Another move.) So here's your code:
mov rA0 <- r1, rA1 <- r2
...mov r3 <- rA2
Let's say you have a register space of 256 registers, so each instruction takes 16 bits. The two together require 32 bits. Now, let's consider a RISC design. In this case, you don't need so many registers, just the GPRs:
add r3 <- r1,r2
If the add opcode is 4 bits and the three operands are each 4 bits, then you need 16 bits to encode this. The point to take is that you need twice as many bits to encode the same "instruction". With misc, all you've done is move the upper nybble of the add ports (0xA) from the three operands into the one opcode. It may very well be worth it to use the extra bits (something I've hinted at earlier), but keep in mind where your redundancies are and make sure they're a net gain. Oh, one other thing: Compilers have a hard time with special-purpose registers.
The technical discussion continued
Floating Point vs Fixed Point.A poster asked is floating point a "must" or would fixed point be possible? Fixed point would simplify the logic a good deal.
T. Miller: We had some discussions early on in the project, and people made some solid arguments for cases where fixed just would not do. Perspective correction is one of them. OGA is virtually float all the way through, but there are points in the pipeline where we no longer need the precision and convert to fixed. The head of the pipeline, however, is all float (except target screen coordinates).
Features of a modern programmable GPU (Graphics Processing Uinit)H. Fisher: anybody else who wants to know what feature set is required for a programmable GPU need to read
ARB_vertex_program specification http://oss.sgi.com/projects/ogl-sample/registry/ARB/vertex_program.txt
ARB_fragment_program specification http://oss.sgi.com/projects/ogl-sample/registry/ARB/fragment_program.txt
You can skip the early sections labelled "Issues" and "Additionsto Chapter X of the OpenGL Specification" on a first reading, but you should read them afterwards because a lot of design choices are explained there. These date from 2003 so only represent the absolute minimum functionality for a 1st generation GPU. But that's enough to drive a lot of design choices: in MISC vs RISC, or fixed vs float, if it can't implement the ARB vertex and fragment shaders, it's a no go. (Note that these describe virtual machine instructions and that shaders are assembled at runtime on the target GPU, so it's OK for one ARB instruction to be translated into more than actual GPU operation.)
he then suggested: After designing a GPU that can implement the ARB low level shaders, buy a copy of
- OpenGL Shading Language, 2nd edition Randi J Rost, Addison - Wesley