XBox’s major feature are its integrated Xbox Live service that allow players to compete online, download arcade games, game demos, trailers, TV shows, music and movies and its Windows Media Center multimedia capabilities. It also offers region specific access to third-party media streaming such as Netflix and ESPN in the U.S.A or Sky Go in the UK.
This article will discuss about Architecture of XBox 360 and some important aspects.
XBox 360 features integrated 802.11 b/g/n Wifi, TOSLINK S/PDIF optical audio output, five USB 2.0 ports and a special AUX port. The former relase features a 250GB HDD while later less expensive SKU features 4GB internal storage.
XBox 360 system’s has three identical CPU cores share an 8-way set-associative with 1-MB L2 cache and run at 3.2 GHz. Each core contains a complement of four-way single-instruction multiple-data (SIMD vector units. The CPU L2 cache, cores, and vector units are customized for XBox 360 games and 3D graphics workloads. The front-side bus (FSB) runs at 5.4 Gbit/pin/s, with 16 logical pins in each direction giving a 10.8 GB/s read and 10.8 GB/s write bandwidth. The bus design and the CPU L2 provide added support that allows the GPU to read directly from the CPU L2 cache.
The I/O chip supports abundant I/O components. XBox Media Audio (XMA) decoder, custom-designed by Microsoft, provides on-the-fly decoding of a large number of compressed audio streams in hardware. Other custom I/O features include the NAND flash controller and the System Management Controller (SMC).
The GPU 3D has 48 parallel, unified shaders. It also includes 10 MB of embedded DRAM (EDRAM) which runs at 256 GB/s for reliable frame and z-buffer bandwidth. the GPU includes interfaces between the CPU, I/O chip, and the GPU internals.
tHE 512 MB unified main memory controlled by the GPU is a 700 MHz graphics-double-data-rate-3 (GDDR3) memory whih operates at 1.4 Gbit/pin/s and provides a total main memory bandwidth of 22.4 GB/s
The Central Processing Unit
XBox 360 has three CPU with PowerPC instruction set architecture which VMX SIMD vector instruction set (VMX128) customized for graphics workloads.
The Shared L2 allows fine-grained, dynamic allocation of cache lines between the six threads. Commonly game workloads significantly vary in working-set size.
The CPU core has two-per-cycle, in-order instruction issuance. A separate Vector/Scalar Issue Queue (VIQ) decouples instruction issuance between integer and vector instruction for nondependent work. There are two symmetric multithreading (SMT), fine grained hardware threads per core. The L1 caches includes a two-way set-associative 32 KB L1 instruction and a four-way set-associative 32 KB L1 data cache. The write-through data cache does not allocate cache lines on writes.
The integer execution pipelines include branch, integer, and load/store units. In addition, each core contains an IEEE-754-compliant scalar floating-point unit (FPU), which includes single- and double-precision support at full hardware throughput of one operation per cycle for most operations. Each core also includes the four-way SIMD VMX128 units: floating-point (FP), per-mute, and simple. As the name implies, the VMX128 includes 128 registers, of 128 bits each, per hardware thread to maximize throughput.
The VMX128 implementation includes an added dot product instruction, common in graphics applications. The dot product implementation adds minimal latency to a multiply-add by simplifying the rounding of intermediate multiply results. The dot prod-uct instruction takes far less latency than dis-crete instructions.
Another addition made to the VMX128 was direct 3D (D3D) compressed data for-mats,6-8 the same formats supported by the GPU. This allows graphics data to be gener-ated in the CPU and then compressed before being stored in the L2 or memory. Typical use of the compressed formats allows an approx-imate 50 percent savings in required band-width and memory footprint.
CPU Data Streaming
In the Xbox, data-streaming workloads which are not typical PC or server workloads, is given attention by Microsoft. A features added to allow a given CPU core execute a high-bandwidth workload (both read and write, but particularly write), while avoiding thrashing its own cache and the shared L2.
First, some features shared among the CPU cores help data streaming. One of these is 128-byte cache line sizes in all the CPU L1 and L2 caches. Larger cache line sizes increase FSB and memory efficiency. The L2 includes a cache-set-locking functionality, common in embedded systems but not in PCs.
Specific features that improve streaming bandwidth for writes and reduce thrashing include the write-through L1 data caches. Also, there is no write allocation of L1 data cache lines when writes miss in the L1 data cache. This is important for write streaming because it keeps the L1 data cache from being thrashed by high bandwidth transient write-only data streams.
The shared L2 has an uncached unit for each CPU core. Each uncached unit has four noncached write-gathering buffers that allow multiple streams to concurrently gath-er and dump their gathered payloads to the FSB yet maintain very high uncached write-streaming bandwidth.
The cacheable write streams are gathered by eight nonsequential gathering buffers per CPU core. This allows programming flexibility in the write patterns of cacheable very high bandwidth write streams into the L2. The write streams can randomly write within a window of a few cache lines without the writes backing up and causing stalls. The cacheable write-gathering buffers effectively act as a bandwidth compression scheme for writes. This is because the L2 data arrays see a much lower bandwidth than the raw bandwidth required by a program’s store pattern, which would have low utilization of the L2 cache arrays. Data transformation workloads commonly don’t generate the data in a way that allows sequential write behavior. If the write gathering buffers were not present, software would have to effectively gather write data in the register set before storing. This would put a large amount of pressure on the number of reg-isters and increase latency (and thus through-put) of inner loops of computation kernels.
XBox 360 also applied similar customization to read streaming. For each CPU core, there are eight outstanding loads/prefetches. A custom prefetch instruction, extended data cache block touch (xDCBT), prefetches data, but delivers to the requesting CPU core’s L1 data cache and never puts data in the L2 cache as regular prefetch instructions do. This modification seems minor, but it is very important because it allows higher bandwidth read streaming workloads to run on as many threads as desired without thrashing the L2 cache. Another option is considered for read streaming would be to lock a set of the L2 per thread for read streaming. In that case, if a user wanted to run four threads concurrently, half the L2 cache would be locked down, hurting workloads requiring a large L2 working-set size. Instead, read streaming occurs through the L1 data cache of the CPU core on which the given thread is operating, effectively giv-ing private read streaming first in, first out (FIFO) area per thread.
A system feature planned early in the Xbox 360 project was to allow the GPU to directly read data produced by the CPU, with the data never going through the CPU cache’s backing store of main memory. In a specific case of this data streaming, called Xbox procedural synthesis (XPS), the CPU is effectively a data decompressor, procedurally generating geometry on-the-fly for consumption by the GPU 3D core. For 3D games, XPS allows a far greater amount of differentiated geometry than simple traditional instancing allows, which is very important for filling large HD screen worlds with highly detailed geometry.