Hypothetical Microprocessor Architecture
Copyright © Nicholas Blachford 14 - 17 October 1997
Copyright © Nicholas Blachford 14 - 17 October 1997
The day after the first draft of this document Intel and HP announced their IA64 architecture and I was surprised at just how similar my design was to theirs. It would appear that when you decide to move complexity from hardware to the compiler it will lead down certain paths just as CPU design in general has produced a number of different chips with very similar features. Either that or great minds think alike :-)
This later draft includes a number of ideas not in the original so the design is not as close as it first seemed althought there are still commonalities. Consequently if this design was turned into a real chip and the compiler cold be made to work as planned I’m quite certain it would outperform even an IA64 running optomised code.
And yes, I did think this lot up in 4 days.
Design of the B2
This is a newly designed CPU based on some of the ideas included in the B1c but designed this time to reduce the complexity and consequent cost of producing such a device. The primary way of doing this is by removing the dedicated loop processor, removing some of the additional function units and as with the IA64 allowing the compiler to handle instruction scheduling. The role of the compiler in this design is much greater than before as the CPU will expect the compiler to structure the code before hand to maximise efficiency.
Unlike the B1c "real" registers are used. In addition each of these registers has an associated address tag register. A command such as RETURN R16 would move the contents of register 16 to the location specified in its address tag. To reduce the need for register renaming 1024 registers are used. The compiler uses different registers over different subroutines so a program can switch between many subroutines without placing any data on the stack.
There are no dedicated integer or floating point registers but rather all registers can handle either data type. Reading or writing a 16-bit value to a register just means that the other bits are ignored. If an algorithm involves a very large number of double floating point values it can if it wishes treat all 1024 registers as double floating point registers. On the other hand the program can also treat all registers as 128 bit integers. While other chips are 64, 32 bits or less the B2 is a variable bit CPU with a maximum bit length of 128 bit.
Instruction Stream Scheduling
The Compiler sorts instructions into different streams, which can then be executed in parallel. The CPU hardware does however have to do some scheduling. The compiler does not know the number of Multi Function Units (MFUs) so the hardware has to assign instruction streams to MFUs at run time. This allows the number of MFUs to be increased or decreased for different variants or generations of the B2. MFUs can perform any operation: integer, logical or floating point so any instruction stream can be assigned to any MFU. Neither the compiler nor the hardware has to take any account of the type of instruction to be executed as all MFUs can handle all instructions.
Unlike the B1c the B2 does not separate instruction and loop processing
streams. Instead the B2 would have a main processing block which
could act like the Loop processor in the B1c. This means loops can
be assigned across a number of MFUs and data read and fed into them.
Data would be fed into a pipeline of MFUs, which pass data onto one another
allowing an iteration of a loop to be completed on every cycle.
Performing the counter arithmetic and doing multiple iterations simultaneously would also be permitted but would be set up in the compiler using specific loop instructions. Unlike the B1c loops would be found by the compiler, there would be no “Loop Catch” Unit.
The B2 will internally only understand it’s native SLR (Short Length RISC) instruction set. There is a CLIW (Compact Long Instruction Set) decoder which decodes previously compacted CLIW words into SLR instructions. CLIW is a method of reading more instructions into the CPU in fewer cycles freeing up memory bandwidth for data. The additional instructions that the B2 will require however will probably make the CLIW for the B2 somewhat less efficient than the B1c.
Emulation of other instruction sets must be performed in software. There would be no hardware dedicated for this purpose, A binary translator would be the preferable method of doing this as it would be able to add the additional instructions for loops and be able to handle the large register set.
The Stack Cache
Using a number of registers as large as 1024 produces problems when programs switch between subroutines. Various methods are used to get around this, firstly because of the large number of registers changing, subroutines would not need to place any data on the stack, instead using higher or lower register numbers would avoid the use of the stack. If however a subroutine switch involves registers clashing one set of registers can be re-tagged using a process a little like virtual memory.
The contents of a number of registers and their address tags can be moved to other registers and reached using their tags (all instructions are internally tracked using tags). If the tag indicates the register has been moved it is accessed by a conversion register which retrieves the data to or from the new register. The compiler would indicate which registers are used in any given subroutine so the re-tagging would be a relatively simple operation.
The second way subroutine switch problems can be removed is to buffer the stack. There will be times some or even all or the 1024 registers will have to be placed onto the stack and this could be a very slow process. To speed this the registers can all be simultaneously shifted onto the internal stack cache. This process should take no more than a single clock cycle even at a very high clock speed.
The biggest speed problems will however happen when the stack cache overflows and has to be written into main memory. To avoid this there is a large internal level 1 stack cache in addition to the normal data and instruction caches, when this eventually overflows part of the external level 2/3 cache is used (in the diagram there are 3 separate external caches). Stack data is only moved to main memory as a last resort, otherwise switching subroutines would seriously impede the performance of the CPU. For the most part register re-tagging or swapping only some registers onto the stack would be the preferred options.
Another problem is when there is a program switch. This will mean that all registers have to be shifted onto the stack, this in itself is not a problem but it is if the CPU is constantly switching between programs as any modern multitasking OS does. To get around this the Stack Cache is organised as a Multi FILO design. There are in effect multiple stacks with one being used for each program. When a program switches all registers are pushed onto one part of the stack cache while simultaneously all registers are read from another part of the stack cache. As well as this the CPU will be reading the program code from the instruction caches so switching programs will not be as slow a process as a large register file might indicate.
To prevent the CPU waiting around for data to be accessed prefetching instructions would be used by the compiler to move data into the cache before it is required. When results are written back to the cache the compiler would indicate whether this data is used in subsequent calculations and if not it would be written to memory without being stored in the cache. If the bus is busy the data could however be temporally be stored in the cache prior to being written to main memory.
To do these cache operations the compiler would need to add instructions but the compiler will not know the size of the cache. Instructions would indicate a priority in subroutine's tag fields and the CPU would organise the cache according to this and the time since that routine was last used. Alternatively similar functions could be performed but without using additional instructions. The usage of the data could be implied by the placement of its usage in a program with more often used data being placed in certain parts of a program.
There are two Output caches which deal with the data (or instructions) as it is retired from the MFUs. If data is used immediately after one operation it is fed back to the next MFU that uses it but if it is used several cycles later the data goes into one of the Output caches. The Output subcache is used where there is a short delay and the bigger main Output cache is used where the delay is longer. If there is no indication that data is to be used the data bypasses the Output caches altogether and is written into the main level 1 cache. Using the compiler the CPU can manage data much more efficiently caching closely only that data which is due to be used.
Block Data Reading
Data would always be read in blocks. On many occasions the next piece of data is in the next address so this may outweigh any delay imposed if only a single address is required. The compiler would organise variables in memory in set areas so they can be read into the register file or cache in the minimum possible time using blocks.
Dual Branch Reading
In addition to this where possible all outcomes of a branch would be read into the CPU. These would normally use different registers from each other and from previous and subsequent subroutines. Variables from both outcomes of a branch would then be moved into the register file so there is no penalty irrespective of whether the branch is taken or not.
If part or all of these branches can be calculated before the branch is taken the CPU would execute these instructions but not use the results. When the branch is taken one set of data is then used and the other discarded.
Additional Functions Units
There are two groups of additional function units to speed up specific operations. The first group deals with 3D Graphics, Video De/Compression and includes a Programmable Logic Unit.
The Programmable logic Unit is present for large complex functions. This is not as fast as dedicated hardware but is a great deal faster than software. This could also be used for large subroutines for which insufficient MFUs may be present or alternatively as a means of boosting speed to a specific subroutine in a program. The Compiler would handle the programming for the Logic.
Digital Signal Processor
As with the B1c a DSP is included on the same die as the main CPU. The DSP would use a subset of the B2 instructions as DSPs are generally for specific algorithms and do not branch so much. Programs would have sections specified as allowed to use the DSP, these would be compiled so as not to use any non-DSP instructions but could run on either the main CPU block or the DSP. The CPU would move permitted subroutines onto the DSP if it is not in use too much.
Some subroutines could be specified to run only on the DSP and would not use the general CPU block. These would however run on the main CPU block if used on a variant of the B2, which did not have a DSP.
The DSP would be used for subroutines, which run concurrently with other programs. It could be used as an Audio generator, a Modem or for just making subroutines in games and other programs run faster.
Ideas for the B2b (26th Oct 1997)
Explicit use of the Multi FILO Stack Cache to allow programs to switch all or part of the register file out so as to allow individual programs access very large numbers of variables (>1024) without having to access the data / instruction caches or external memory. To prevent delays in switching between files the switch should be able to occur before all results are returned to their registers, the results when completed should be redirected to the Stack Cache.
The basic idea here is to make the Stack Cache partially addressable, this in effect means the B2 could use multiple register files. An advantage of this is that an area of the Stack Cache could be set aside for the use of Operating System and never switched out to main memory giving a speed boost to multitasking.
Back to the Computer Section