GPU mysteriet til Playstation 3

Knebel · 4. desember 2006

EbonySeraphim på PS3forums sitt "livsverk" er vel kanskje greit og kikke på når man ser etter slikt..

http://www.ps3forums.com/showthread.php?t=22858

Jeg er ikke helt sikker på hvilket skjema som egner seg best, men du finner sikkert noe enten på førstesida eller siste sida i tråden..

Endret 4. desember 2006 av Fhj

skyggefreaken · 4. desember 2006

Over på Avsforum's xbox forum forsøker de å sammenlikne prosessorene i PS3 og X360. anbefaler dere litt viderekomne å ta en titt.

Link

This document is a long document designed for the semi-technical person. It goes though basic memory and processor theory so those people can grasp the more complex understanding of branch prediction. If you are highly technical, you may wish to skip furthur down of the document.

When reading different articles, there have been conflicting opinions concerning branch prediction in the SPEs such as the following.

SPEs cannot assist with AI code since an SPE does not have branch prediction capabilities.

Compiler options allow for options that will insert branch prediction instructions into the application to assist with branch prediction.

Due to the second statement I’ve always defended the design of the SPEs lack of branch prediction capabilities. Even the first statement is not true since even with the lack of branch prediction capabilities, AI code could still be off loaded to SPEs as long as the SPEs was executing in parallel at the same time as the PPE core. Even if the SPE ran at ½ the speed or less of the PPE, it would still produce beneficial results as long as it was executing in parallel. However, it would not make sense to execute AI code in the SPE if the PPE was idle.

Another issue was how did IBM create a SPE core using less than ½ the number of transistors of the PPE core? Removing only the branch prediction capabilities would not accomplish that. Was there some association between branch prediction capabilities and the vastly reduced transistor requirement?

The only way to determine that was to look at the design of the SPE core and compare it to the design of the PPE core. However the design specifications that are available to the public is limited in detail so there isn’t any absolute way that a person can be 100% sure. However with enough experience in hardware design a person can make some assumptions of how the timings are produced and the assumption would more than likely be true.

After a thorough investigation of the design it becomes obvious that many of the transistors that were removed have a major impact on branch prediction or any sort of memory access performed by the SPE.

Basic Memory Concepts for Processor Access

Memory access times are the only reason that branch prediction is performed. If memory access was instantaneous or nearly instantaneous, branch prediction would not be necessary. Therefore a basic understanding of memory concepts must be understood to understand the impact on the performance without branch prediction.

For processors, memories are layered in the system in such as way to give the illusion of high speed even for the slowest of memories. The concept layers the memories in such a way that the slowest to the fastest are layered so that the processor usually communicates with only the fastest memory which is feed by a slower memory which is feed by an even slower memory until the chain reaches the slowest memory. Each memory gets smaller in size as it traverses the chain from slowest to fastest. With a proper hardware design a processor may seldom stall (become idle) waiting on memory. Without a proper hardware design, missing or incorrectly sized memory in the chain, or too slow of a memory in the chain, the processor may stall on a regular basis. In the worst design (fortunately none of that type are produced today), processor speeds can go at a crawl (99% of the power could be lost) and in the best designs the processor could execute near 100% efficiency.

Memory Types

The slower the memory the less amount of power it will use plus it will generate less heat.

There are only two basic types of memories (except specialty memories but are not used for processor performance) that are normally used in today’s desktop computers or game consoles. The following are those basic types.

DRAM (Dynamic Random Access Memory) - This type of memory is primarily used as the main system memory. Almost all main memories use this basic technology but enhance this basic technology to make the memory appear to be faster. But in reality all of those memories are using this very slow 60-nanosecond technology. This type of memory also requires refresh cycles that will cause overall performance to be reduced even further. This type of memory is always external to the processor chip and resides on the bus.

SRAM (Static Random Access Memory) – All other memories in the chain between the main memory and the processor usually use this type of technology. This memory is much faster than DRAM and can be between 10-500 times faster. Most of the time this type of memory are an integral part of the processor chip but can be external in some cases.

The following are some of the possible uses of memory in the memory chain and some of the types that are available.

Main Memory – This memory is very slow memory (usually 30 nanoseconds or about 100 clocks at 3.2 GHz for the first piece data to arrive at the destination). However after the first access is accomplished, depending on the design of the memory, this type of memory can stream data at very high rates (usually 1 clock cycle per memory location which is usually 64 bits or 8 bytes).

SDRAM is the basic DRAM memory with access times of 60 nanoseconds.

DDR2 is double data rate technology. The core is still 60 nanoseconds but tricks were played to make it work at twice that speed.

GDDR3 memory uses the DDR2 technology as the core memory but other techniques such as bank phasing and channels allow for high speed streaming to occur. These memory are primarily used for graphics memory.

XDR memory use techniques such as bank phasing and channels allow for high speed streaming to occur. These memories are usually used for main memory.

L2 Cache – This is a SRAM type of memory of moderate speed (usually about 2-5 nanoseconds or about 6-16 clocks cycles at 3.2 GHz). The memory size is usually 512 KB, 1 MB, or 2 MB. The memory is organized in blocks to reflect different parts of the main memory.

L1 Cache – This is a SRAM type memory of high speed (usually .3 nanoseconds or 300 picoseconds or faster when clock speed is 3.2 GHz and can be accessed by the processor usually causing no more than a 1 clock stall). The memory size is usually about 32 KB of instruction cache and 32 KB of data cache but could be other sizes. The memory is organized in blocks to reflect different parts of the main memory and also usually reflects different parts of the L2 cache.

The L1 and L2 cache are organized according to blocks. As an example the block could contain 32 sequentially addressed instructions or 32 64-bit data words but could be organized as containing more or less sequential instructions or data words.

Putting the Memory Chain Together

The following is an example on a general purpose processor. When a transfer operation is indicated, that indicates that a copy of the data is sent and the source still would contain that same data.

Normally the processor will request instructions prior to needing those instructions. However if the instruction does not arrive back at the processor in time, a stall will occur.

During each clock cycle the processor will check to determine if it wants more instructions to be placed in the instruction pipeline (see instruction pre-fetch and branch prediction later in this document) and if instructions are desired, the processor will issue a request to the L1 instruction cache for that instruction block. If the instruction block is in the L1 instruction cache, the block of instructions will be transferred to the instruction pipeline replacing the least recently used block of instructions in the instruction pipeline and the processor will not stall.

If the instruction block cannot be found in the L1 cache, the L1 cache control will request the instruction from the L2 cache. If the block of instruction is found in the L2 cache, a block of instructions will be sent to the L1 cache which will replace the least recently used block in the L1 cache and then will send the requested block of instructions to the instruction pipeline replacing the least recently used block in the instruction pipeline. In this case, the processor can stall for up to 6-16 clocks depending on the speed of the L2 cache and whether the processor had enough instructions in the pipeline to continue executing.

Finally if the instruction cannot be found in the L2 cache, a request will be made to the memory bus to request a block of data from the main memory. The main memory will trigger a cycle to retrieve that block and return the block to the L2 cache which will replace the least recently used block in the cache, send the block to the L1 instruction cache which will replace the least recently used block in its cache, and then will send the requested block of instructions to the instruction pipeline which will replace its least recently used block. In this case a stall may occur for at for up to 300 clock cycles due to main memory speed and whether there is a bus conflict cause by the GPU, DMA, or other devices accessing memory at the same time. Even in this case, the processor may not stall due to that fact that enough instructions may have been available to execute without stalling.

If all that was designed were the memory chain capabilities and nothing else, there would still be a large number of stalls. So most processors have instruction pre-fetch, branch prediction, as well as out-of-order execution capabilities to resolve the stall problems.

Instruction Pre-Fetch

Most if not all modern day processors have instruction pre-fetch capabilities. This capability is designed into the processor to retrieve possibly needed future sequentially addressed instructions. When executing instructions, the currently translated instruction in the instruction pipeline will check to see if the sequentially next block of instructions is in the instruction pipeline. If it determines that the next instruction block is not in the instruction pipeline, it will request that block of instructions from the L1 cache. If the L1 cache has the block it will transfer the block to the instruction pipeline replacing the least recently used block or eventually it will received from the memory chain and will then transfer the block to the instruction pipeline.

This mechanism insures that instruction pipeline will always have the next sequential instruction available whenever the processor requests that instruction. As long as the processor executes instructions that are addressed sequentially, the processor will never stall waiting for an instruction.

Branch Prediction

Although code is usually executed sequentially, a large number of branches (redirection to another piece of code that is not in the sequential path) usually do occur in the code. Sometimes branches can occur on a regular basis such as function calls, loops, compares, and returns. If the processor did not support branch prediction, every branch could cause the processor to stall. Branch prediction is based on the length of the instruction pipeline (number of instructions in the pipeline) and the parallel sensing for branch instructions that are sequentially addressed in relation to the current instruction being executed. If the pipeline is very large and the complete pipeline is sensed for branch instructions, branch prediction will drastically reduce the possibility of stalls.

Like instruction pre-fetch, branch prediction requests the L1 instruction cache to retrieve a block of instructions and that will eventually replace the least recently block in the instruction pipeline. With a well designed processor with excellent branch prediction capabilities, stalls should seldom occur when branches happen.

Both the PPE core of the cell processor and the PPE cores in the xenon processors have branch prediction capabilities but were scaled down from the PowerPC processor to save on cost. I suspect that the size of the instruction pipeline was reduced, the number of instruction is a block may have been reduced, and the number of instruction being sense simultaneously was reduced. The SPEs do not have any branch prediction capabilities.

Out-Of-Order Execution

Out-of-order execution is primarily designed to increase performance by executing instructions in parallel. If there are 10 execution units (out-of-order execution) instead of 1 execution unit (in-order-execution) contained within the processor, it is possible that the performance could increase by as much as 10 fold. In reality the performance increase is much lower. Out-of-order execution allows the processor to execute several instructions in the instruction pipeline in parallel. However if the source data for an instruction is dependent on the results of another instruction and that instruction has not been completed yet, the instruction that is relying on the results will have to wait until the instruction that is producing the results is finished.

An in-order execution processor can have a stall if the instruction needs data and the data is not available in the data L1 cache. If that occurs, a stall will occur since the data will need to be retrieved from the L2 cache or main memory. Data just like instructions request data though the memory chain.

An out-of-order execution processor can reduce the length and number of stalls that are caused by the lack of data in the data L1 cache by executing other instructions while that instruction is waiting for the memory to be retrieved from one of the memories.

The out-of-order execution capabilities were stripped out of the xenon and cell processors to save cost.

Xenon Processor

Cores: 3 general-purpose cores (PPE)

L1 Cache: 32 KB of instruction cache and 32 KB of data cache per core

L2 Cache: 1 MB of cache shared between all 3 cores.

Main Memory: 512 MB of DDR3 memory shared between processor, GPU, and DMA.

Memory bandwidth: 21.6 GB/s.

Hardware Threads: 2 per core (6 total)

Out-of-order execution: No

Branch Prediction: Yes but stripped down from PowerPC.

Instruction Pre-Fetch: Yes

Instruction Size: 64 bits

The following are some performance issues for the 360.

Only 1 MB of L2 cache shared between all three cores. The minimum L2 cache desired per core is usually 512 KB. However, in 1 or 2 core usage, either 1 MB or 512KB will be available per core respectively. Also since this is designed as a game console and not a multitasking operating system, 1 MB should be sufficient since all cores could be executing the same game code and could be using the same game data so 1 MB should be sufficient. The reduction of branch prediction capabilities probably will not cause a significant performance degradation. Not having out-of-order execution will probably reduce the performance by 50%.

Cell Processor

PPE

Cores: 1 general purpose core (PPE)

L1 Cache: 32 KB of instruction cache and 32 KB of data cache

L2 Cache: 512 KB of cache

Main Memory: 256 MB of XDR memory and 256 MB of DDR3 memory shared between processor, GPU, and DMA.

Memory bandwidth: 25.6 GB/s.

Hardware Threads: 2 total

Instruction Pre-Fetch: Yes

Instruction Size: 64 bits

SPE

Cores: 7 useable specialized cores (SPE)

L1 Cache: None

L2 Cache: None

SPE Memory: 256 KB SRAM (7 total)

Memory bandwidth: 51.2 GB/s.

Hardware Threads: 1 (7 total)

Instruction Pre-Fetch: Yes

Instruction Size 128 bits

The cell PPE core should execute similarly to the xenon PPE core with the exception that the xenon has 1 MB L2 cache available for single core operation. Since the SPE does not have any L1 cache (L2 cache is not important), it is important that the SPE memory is fast. The design used SRAM so that makes memory pretty fast and appears to be 5 nanosecond memory (about the same as L2 cache). So it appears that if the instruction is not in the processors instruction pipeline, a stall of 16 clocks (maximum of 32 clocks if DMA was active to the SPE memory) could occur instead the maximum of over 300 clocks on the PPE. However, without branch prediction, stalls could occur significantly more frequently than on the PPE.

Since the compiler has the capability of projecting branches and then inserting branch prediction instruction (called a hint according to IBM) in the code, it would at first appear to be a non-issue. However, without L1 cache and probably a pretty small pipeline, the compiler could insert too many instruction block requests from memory causing more stalls than would have occurred even if the compiler option was not used. In the case of the PPE core, the worst that could happen if the same condition were to occur would be a one-cycle stall waiting for the instruction to be retrieved from the L1 instruction cache.

Conclusions

It is now clear how IBM produced each SPE core using less than ½ the number of transistors. By eliminating L2 cache, L1 cache, branch prediction, 1 hardware thread, and greatly reducing the complexity but in turn adding 256 KB of SRAM.

I suspect that if code is written for the SPE, the developer should try to minimize the use of branches if at all possible. The number of possible branches can be reduced by developing inline code and not calling functions, classes, or libraries. Also if at all possible the developer should use as few as possible while, for, if/else, do, and switch statements which will produce branch code.

It is difficult to determine how much the performance of an SPE is affected by the lack of branch prediction when normal general purpose code is executed even with the compilers ability to insert branch prediction instructions. Is it just a few percent or is it as much as 50%?

MistaPi · 4. desember 2006

KongRudi: RSX har nå GDDR3 minne.

16MB/s refererer som nevnt til lese båndbredden Cell har fra GDDR3 minnet. Jeg vil tro det er en begrenser hva Cell kan gjøre av grafikk rendering, post procesing effekter. Men det er nok mulig å pushe deler av framebufferet over til XDR minnet.

Logg inn

GPU mysteriet til Playstation 3

Anbefalte innlegg

Knebel

Lenke til kommentar

Videoannonse

skyggefreaken

Lenke til kommentar

MistaPi

Lenke til kommentar

Opprett en konto eller logg inn for å kommentere

Opprett konto

Logg inn

Hvem er aktive 0 medlemmer