In the last couple of posts I described how I have used a TI MSP430 Launchpad to emulate the 1949 Cambridge University EDSAC machine.
This opens up the possibility of using a similar scheme to emulate virtually any DIY cpu, possibly with a novel or experimental instruction set.
The MSP430 Launchpad was used only because I had it at hand, but the cpu model and the SIMPL communications framework, being written in C, could be ported to almost any microcontroller – especially if they are supported by the ever-growing Arduino IDE, now with ARM Cortex M0, M1 and M4 extensions.
A Look at an instruction set for a MISC Processor.
As I am a proponent of MISC (minimum instruction set computers), the task of simulating the cpu is simplified to just 32 instructions – represented by a 5-bit token, and I have chosen standard ascii characters to represent my instruction primitive. EDSAC used 18 capital letters to represent its instructions, with a mnemonic flavour, but any printable character is appropriate to represent an instruction primitive.
These primitives represent various ALU instructions, data moves between the ALU and memory or registers, and various input-output instructions.
I have attempted to pick a mix of instructions that are easy to implement, useful and offer flexibility. Whilst anyone can have aspirations about designing the ISA of a new processor, it is likely to have a minimum number of instructions to be useful.
This number was estimated to be around 25 in the 1990s – by Bill Meunch, Chen Hanson Ting Chuck Moore and others – working MISC and Forth cpu implementations. Previous cpus have successfully use 16 or even 8 instructions (see below).
From 25 primitive instructions, more complex instructions can b e synthesised by assembling groups of these primitives into macros.
ALU Group NOP, ADD, SUB, AND, OR, XOR, INV, SHL, SHR
Call/Jump Group CALL, JMP, JEQ, JGT, JLT, RET
Memory Group FETCH, STORE
Stack Operations DUP, DROP, SWAP, OVER, PUSH, POP
I/O Group IN, OUT, EMIT, KEY (emit is putchar and key is getchar)
Register Transfers TO-R, FROM-R
Program Flow FOR, NEXT
Thus we have 31 instructions that are tailored towards a stack based processor. For a more conventional register processor, or load-store architecture there would probably be more instructions for register transfers and richer addressing modes.
It could also be argued that all of the conditional jumps might not strictly be required, as a jump if zero can implement all program control structures. Additionally if the processor has memory mapped peripherals then IN and OUT are practically redundant.
Deciding which instructions to keep and which to lose is all part of the mental anguish process of creating a the ISA of a DIY cpu.
Having decide on a set of instruction primitives, and the character tokens to represent them, it’s a case of looking at the overall processor architecture.
One cycle per instruction would be good – always a design aim for a simple processor, and also whether the use of dual port RAM can allow a partial pipeline – such that the ALU is calculating the ALU result(s) whilst the instruction is still being decoded. This is the approach James Bowman has taken in his J1 Forth Processor.
Stack Processors generally use two stacks – the data stack and the return stack – but there are other classes of stack machines – see Koopman’s Book on Stack Processors
These stacks can be implemented in main memory, or as a closely coupled register file. The stacks will have their own stack pointers and this is why the TO-R and FROM-R instructions are present to allow the TOP of stack to communicate with R, the Return Stack Pointer. There also needs to be a hardware mechanism to put the current Program Counter onto the return stack, before executing a CALL.
In SIMPL, there are really only two main registers x and y to hold the operands for the ALU. There is also a loop counter k which is decremented each time around the loop until it zeros, and very occasionally there is need to access another variable.
It could be argued that 16 registers in a stack would be sufficient for most purposes, and at times four would be more than enough. As registers are cheap in FPGAs, then it may be worth considering a hybrid architecture, where the stack is a set of registers – any of which are addressable. This is a deviation from the pure TOP/NEXT architecture of a traditional stack processor – but it extends the processor functionality quite considerably, bringing it in line with many small commercial processors that have a rich set of registers. And in the virtual machine to execute SIMPL, we need an Instruction Pointer, Program Counter, Return Stack Pointer, Data Stack Pointer, Loop Counter, Top and Next of stack – so 7 registers already.
Having more registers gives greater flexibility, but at the cost of needing more instructions to address them. One compromise might be the idea of “zero page RAM” – where a limited area of memory has its address coded into the lower part of instruction – say the bottom 7 bits – and this allows the alu to operate quickly on this subset of memory without having to wait for a 2nd instruction that holds the address.
Dedicating the lower part of memory to an 8-bit payload, allows short literals to be encoded into the instruction – useful for supplying constants to the ALU, or for comparison operations – jump on match or similar.
Implementing SIMPL in hardware.
The end game of looking at cpu design is to attempt to find an architecture that is not only easy to implement, light in hardware and optimised to execute the SIMPL language efficiently.
As SIMPL is implemented in the form of a switch-case statement or hash-table, having an efficient “jump on match” instruction would be a useful asset – where an ascii character can be read from a buffer, multiplied up by 32 or 64 and stored into the PC causing program execution to jump to the code body defined by that ascii character.
In hardware, decoding the 5-bit token into a longer instruction word, with various bit-fields to control the alu etc, would normally be done with a microcode ROM. In the simulator, this decode function is done with a series of switch-case statements.
For example typing “&” generates a 16 bit word by calculation or from a look-up table, which it loads into the next RAM address and also prints the string AND to the terminal.
This process is repeated for all legitimate characters, and number literals, and so the RAM can be loaded with the machine code for the proposed processor. As the 32 instructions will be a common “lingua franca” to all processors, it’s just a case of changing the “look up” to suit the machine language of the target processor.
Hardware versus Software
My interest in Minimum Instruction Set Computers (MISC), lies in particular trying to understand where the border line is drawn between a successful instruction set, and one that is too compromised to work efficiently. Instructions can always be synthesised from primitives such as ADD and NAND, but this costs cpu cycles. As there is likely to be a hardware XOR function included as part of the full-adder in the cpu, is it not worth the additional instruction multiplex to make this available in the instruction set?
Historical Examples of Minimalisation
The PDP-5 and the more commercially successful PDP-8 were a miracle of minimal hardware design – as when introduced in 1965, a single transistor cost the equivalent of $20 in today’s money. Transistors were used very sparingly, only where necessary to provide signal inversion (logical NOT) and amplification (fan out). The Diode Transistor Logic (DTL) provided an easy mechanism to OR inputs and to AND outputs – so much was made of this feature to minimise the transistor count.
The PDP-8 ALU and Register slices (boards R210 and R211) contained 12 and 13 transistors respectively, and about 10 times that number of diodes. These were built at a time before ICs were available cheaply, so every last ounce of value had to be extracted from a minimum transistor count. As a result the PDP-8 had only 8 instructions, but employed a trick by where the instruction OPR, which is 7xxx (octal) allowed individual manipulation of the ALU control signals to provide clear and complement and shifting operations. A second set of 7xxx instructions allowed various conditional jumps to be performed. As these operations could be performed in parallel by ORing the various control bits, this gave much greater flexibility to a machine with apparently only 8 instructions – as follows
- 000 – AND – and operand with AC.
- 001 – TAD – add operand to <l,ac>(a 13 bit value).</l,ac>
- 010 – ISZ – increment operand and skip if result is zero.
- 011 – DCA – deposit AC in memory and clear AC.
- 100 – JMS – jump to subroutine.
- 101 – JMP – jump.
- 110 – IOT – input/output transfer.
- 111 – OPR – microcoded operations.
The PDP-8 was legendary for it’s minimalisation, and went on to sell more than 50,000 units over a period of 20 years, and in several hardware revisions – the last being a PDP-8 on a Harris 6120 VLSI chip based on gate arrays. The PDP-8 has also been recreated in more modern hardware – both in discrete TTL and as a FPGA in VHDL.
A Flexible ALU
The ALU featured in the “NAND to Tetris” course uses 6 control lines to ingeniously control the inputs and outputs to an otherwise simple ADD/AND circuit. However, only 18 of the 64 combinations of instruction are actually used, and one bitslice the ALU contains the equivalent of 35, 2-input NAND gates – which rapidly escalates to 560 gates for a 16-bit ALU.
Constraining the instruction set to just ADD or NAND halves the number of gates needed but equally shifts the problem over to firmware to synthesise the missing operations from the ADD and NAND primitives.
Before embarking on any hardware implementation it would be good to have the means to simulate and explore possible instruction set architectures (ISAs), so the technique used to simulate EDSAC can be more widely applied.
Simulating a DIY CPU using SIMPL.
As most MISC machines are likely to have less than 32 instructions, the cpu model is likely to be relatively simple. As seen with EDSAC, the processor model can be constructed in fewer than 50 lines of code, as a case statement that decodes each instruction from memory in turn.
Populating an area of RAM with the correct instructions for the chosen processor can be done with the help of a text interpreter and a look-up table. I used a modified version of SIMPL to act as an instruction loader/builder, where an instruction in the form nX was assembled into memory. n is the address/data field of the instruction, and X is the op-code. SIMPL strips off X as a 4 or 5 bit code, and shifts it up to occupy the most significant bits of the instruction word, and places n as the lower 12-bits.
This operation was fast and reliable allowing pseudo-assembly language to be loaded directly into memory using Teraterm.
Once the RAM contains the correct machine instructions, we use the processor model to fetch the instructions one at a time from RAM, decode them and execute the correct logical and arithmetical behaviour, updating registers, program counter and memory where necessary.
SIMPL was set up so that the “s” command could be used to single step through the instructions – giving a print out of the registers and key memory locations each step. This allowed for elementary debugging of the code.
Another SIMPL command “e” was used to preset the program counter to a given location – so that program execution could begin there.
Further commands and a more refined user interface are planned for a later date.
Great satisfaction was had having code actually run on the simulated EDSAC machine – using nothing more than a text editor, a terminal program and a MSP430 LaunchPad. Such low level computing really does not require sophisticated tools.