Multi-threaded Fragment Processor
Copyright © Nicholas Blachford - 13th September 1998
Although below I say implementation I don't mean this in the usual sense. The B3 is a hypothetical architecture and as such is really only a vehicle for my ideas on microprocessor architecture. This design is not set in stone, rather it could be considered at freshly poured wet cement. If someone was to go and try and actually build one it would probably end up looking very different from what I describe here with more ideas being added and some being removed. For example asynchronous operation of the Fragment Processors is probably necessary for high speed but the split clock they use may actually hinder things.
The diagram is only to show the basic layout / ideas, it is not a complete diagram for the following reasons:
I had the idea of splitting program into Fragments some time ago but it took me a long time to fully understand exactly how they could be used for any useful purpose. This B3 design is my first attempt at trying to figure out how a Fragment based CPU would operate.
A Fragment processor operates in a significantly different manner from conventional CPUs. All program code is split up at compilation time into a series of Fragments consisting of one or more processor instructions. When executed the Fragment and the data it requires is sent to a processing unit then executed. Once complete the data is moved out and another Fragment is moved in, the new Fragment may or may not be related to the previous one. It neither knows nor cares, all it has to do is complete it's own processing then exit.
This is the idea behind Fragment processing. It doesn't sound very spectacular but as I shall explain below it allows you to do a lot of things which a conventional processor would not allow.
The hypothetical B3 implementation
In the B3 I have taken ideas from the previous designs and with somewhat more knowledge about how CPUs operate I have designed the B3 around the idea of Fragment processing. Although it contains some of the same ideas the B3 is a completely different design from the B1c or the B2.
The process starts with data being read into the CLIW decoder. CLIW stands for Compact Long Instruction Word and is a means by which instructions could be packed into a single long data word. The purpose of this is to reduce the average length of instructions and the bus bandwidth required to load them. This would free up the bus for moving more data. A single CLIW word may include many instructions but these could be single or multiple Fragments.
Once the Fragments are decoded to their full length they are written into a buffer. Each Fragment requires data and it is at this point the addresses are generated and this data is prefetched.
The Fragments then move into the Fragment Splitter where they are divided into categories and sent to buffers to await execution. There are 6 types of Fragment in the B3:
1) Memory Fragments
These are instruction sequences which operate directly on memory. These can consist of block data moves and simple assigns. A dedicated Memory Processor is used to generate the addresses and / or data required by these instructions.
2) Free Fragments
These are Fragments which can be executed out of sequence. If a branch has two possible outcomes and these do not depend on data processed in the previous fragments both can be executed simultaneously by different Fragment Processors. Some simple loops can be issued to the Free Fragments unit where they shall be sent for execution in a single Fragment Processor.
3) Loop Fragments
These are Fragments which contain loops or are loops which are spread over a number of Fragments. The latter will most likely happen in loops which contain other loops. The Loop Fragments can issue a single iteration of a loop as Fragment providing there are no inter-loop dependencies. This means that a single loop can use all 4 Fragment processors without having to be specifically modified to do so.
Another trick the Loops Fragments unit can do is to take over all 4 Fragment Processors to allow the handling of large multi fragment loops. In order to do this the loop must be split into 4 and a part executed on each Fragment Processor in turn. The Loop in effect has to be pipelined.
4) Depend Fragments
These are Fragments which specifically depend on data in a previous Fragment. These cannot be issued for execution until the previous Fragment is completed.
5) Special Fragments
Most modern day RISC processors now come with some form of acceleration for multimedia and other instructions. The B3 would be no different and have a processor for this purpose. In addition to this a Programmable Logic Unit would be present where custom circuits can be configured and used on the fly, special Fragments would be issued to these units. Complex Arithmetic anyone?
6) Register Fragments
These are Fragments which operate on the Address Tag/Register Files. These can be loads, saves and moves. Moves will be used a great deal in a Fragment processor as different Fragments will require data spread across the Register File and the Fragment Processors will not have access to them all.
B3 Registers and Address Tags
The above paragraph may seem a little strange as all processors can usually access the full register set. One exception to this is register windows but even there if the window is changed all registers can be accessed. This is definitely NOT the case in the B3.
The B3 has a number of Address Tag/ Register files. As well as a 128 bit register the address of the data is stored in an Address Tag. This allows data to be written back to memory or moved between locations without having to specify memory addresses as the address in the Address Tag would store this.
These Register Files are organised so only a subset of the file is available to Fragments. When a Fragment is issued for execution to a Fragment Processor a set or 16 registers in that Processor is filled up with the data that Fragment works on. While this may seem like a limitation it is not given that a Fragment is only a small snippet of code which will not contain a large number of instructions. Each Fragment Processor in the B3 is to run at a high clock rate and a small register set will ensure register lookups will be fast.
The Register Files in the B3 will contain 256 entries but since 16 registers will be read out at once for transfer it will not be possible to address individual registers. When transfers are required between register sets the Register Fragments will be read out and transferred in the Register Control unit then written back to one of the Register Files.
There are not just 256 Registers however. There are 8 separate Register Files, each of which is assigned to a program or thread. One is permanently assigned to the Operating System kernel. When the kernel is about to swap threads one of the Register Files is switched out to the Stack Cache then filled with data from the next thread. This will all happen in the background while other threads are still running. Once this is complete one of the running threads is suspended while the new thread is started. This should allow threads to be started and stopped with an effective context switch latency of zero.
The Register Files are 128 bit wide with a further 128 bit Address Tags. These Registers can be accessed as 8bits 16bits etc up to 128 bits at a time, fixed or floating point values can be stored. There is no need for address registers as all Registers have Address Tags and it would be possible to manipulate these.
The Fragment Processors will run asynchronously from the rest of the B3 using a high speed split clock. The split clock will allow only the fastest operations to operate in a single clock cycle. A logical operation may operate in a single cycle but an integer multiply (and possibly add) will not.
This means that logical type operations shall be executed very fast but many operations will also speed up as a result of this clocking. If an operation takes say 0.6 of a cycle to complete normally and the split clock was for example 10 times higher it would now operate in 6 split clock cycles as opposed to 10 if it used the lower clock, (it is highly unlikely the split clock would be this high!)
A disadvantage of this system is that complex operations which normally take a large number of cycles will now be even higher. An option here would be to split such operations into very small Fragments and allow these to be processed alongside a separate Fragment. This would complicate things but not too much if the long operation was not duplicated in the second Fragment.
Another disadvantage is that many operations will now take several cycles to complete meaning they would have to be pipelined and this could seriously complicate their circuitry. Consequently the exact speed of the split clock would have to be looked at so as it does not over complicate things. It may be possible to increase the clock speed perhaps a small amount and need only to pipeline only a few operations.
The Fragment Processors
Fragment Processors do all the data processing in the B3. They perform the integer and floating point instructions, contain their own 16 entry register sets, and have facilities for multiple loop counters. The Fragment Processors are almost full RISC processors in their own right but they do not directly read or write data to or from memory and don't have any control over overall program flow.
Unlike many RISC processors there is no hardware for finding parallelism within programs, this will be handled by the compiler (al la Merced). The Parallelism however is strictly limited to no more than 4 issues per cycle. This is so not to over complicate the compiler and also to allow a higher clock rate. This also means the fragments are kept to a more manageable size and inter fragment dependencies are not excessive.
The Fragment Processors will perform processing using MFUs, another idea borrowed form one of my previous designs. An MFU is a Multi-Function-Unit, It is a single unit which can handle all operations, integer or floating point. The original idea behind them was to simplify out of order instruction logic, in the case of Fragment Processors however there is no such logic but the MFUs will make life easier for the compiler.
A conventional out of order CPU has to look at the type of instruction and then figure out if the relevant unit is free, if not the instruction is held. An MFU based processor only looks at which instruction can be issued, sees if any MFU is free and if so issues it. The type of instruction is irrelevant as every MFU can take every instruction.
The only exception to this is where a long non-pipelined multi cycle instruction has been issued and this "locks" an MFU. One possibility (especially on rarely used instructions) is to unlock the MFU if none of that type of instruction are used in subsequent Fragments. This would complicate logic and effect the instruction (Fragment) opcodes so it may not be such a good idea after all. An alternative would be to make such instructions only available as Special Fragments, this would have issues itself but would not tie up or complicate other MFUs.
MFUs probably won't make much of a difference to number of instructions issued per cycle but the simpler logic should allow the clock speed to be higher. Having said that the sheer size of multiple MFUs will effect the clock speed.
Once a fragment is complete it's results cannot always immediately be written back to memory. If the fragment has been issued out of order it has to be retired in the correct order. If a pair of Fragments have been issued speculatively the outcome of the choosing Fragment has to be checked to ensure the correct Fragment is chosen. If data from one Fragment is required by the next it has to be sent back or held in the Fragment processor while the next loads.
Writing to Memory
Once Fragments are retired their data can then be written to memory, the full register set, or both. The Base Address Store generates the Full 128 bit address. To save on address size space programs could operate with 64, 32 bit or 16 bit addressable ranges. This saves on instruction address space but also saves the CPU from sending gigantic addresses all over itself. All addresses will however need to be recalculated to their full 128bits and this is handled by the Base Address store. The MMU then generates the final physical address.
The Cache for a B3 will be a pretty complex affair as it will probably multiple caches to allow for multi threading. Currently this would cause a CPU to bloat to ludicrous proportions but if someone was to actually build a B3 it wouldn't be ready until the time 0.13 micron technology was beginning to appear. Consequently the size of multiple L1 caches would not by then be such a big deal, neither for that fact would multiple MFUs.
And Last but not least...
The B3 is a multi-threaded multi-issue out of order CPU. In essence it can run several programs at once and each of these programs can issue multiple instructions per cycle.
I have tried to keep things simple - there is no complex out of order circuitry in the Fragment Processors and the use of MFUs simplifies the compilers ability to schedule by allowing instructions of any type. In addition to this the Fragment processors will run asynchronously from the rest of the B3 and from each other. The explicit use of a small number of registers and an explicit limitation on parallel instruction issue should also simplify logic. In short the Fragment Processors should be able to run at a very high clock rate.
Fast Fragment Processors will be no use if the CPU cannot issue Fragments at a high enough rate to keep them sustained. Fragments can be issued out of order but this will be a rudimentary capability. The CPU will not go looking for the absolute maximum Fragments it can issue but if it can issue 3 at a time it will do so. Fragments require data to operate on and this will be gathered from memory or the register file while the Fragments are being held prior to execution.
Context switching or thread swapping can both be hidden in the background by reading Fragments and issuing other threads for execution. There is no reason for the CPU to suddenly stop when a cache miss occurs as this to can be covered by executing a Fragment from a different thread.
Both the Fragment Processors and the CPU infrastructure should be capable of running at a high rate due to a lack of needless complexity.
So, providing the memory system can keep data supplied to the B3 core it should execute programs at a very fast rate indeed. If four threads can issue just one instruction per cycle this is 4 instructions per cycle - this low rate is equal to greater than the peak rate of many RISC processors. Even if the B3 has a relatively moderate clock rate in average operation it should be able to run rings around pretty much everything going.
The B3 should be especially good on Floating Point applications - how many other CPUs will be able to issue 16 Floating point instructions per cycle?
Given that the B3 has 4 Fragment Processors you would expect it to be as fast as a multiprocessor system and perhaps even faster... A multiprocessor system requires communication between CPUs to keep data constant, in addition the CPUs will fight each other for Data in memory.
The B3 is not a multiprocessor and does not require the communication overhead in either hardware or software, additionally it cannot fight itself for memory access. Requests for data will be held until the bus is ready, a process which will take less time in a single processor than on 4 spread over a motherboard.
Any comments on this document are very welcome.
Diagram produced on Amiga, Documentation written on BeOS.
How's that for Alternative platforms!
Back to the computer section