Background

One of the intresting problems from my Computer Architecture classes was a Hypothetical "Subtract and Branch if Negative" datapath. It was unique in that it had only that one instruction.... It would subtract one number from another, writing the result into a register and taking a branch if the result was negative.

The mnemonics for this would be:

SBN R3, R2, R1, 4  <=== Subtract R2 from R1 and place result in E3. Branch by +4 if written value is negative. Take the next instruction otherwise.

Most assemblers would reserve a General Purpose register as a 'scratchpad'. Some would force R0=0 as a known reference.
The obvious advantages are that such an implementation is very simple and takes little silicon area, allowing for easy development and implementaiton.
Among the much larger disadvantages:

There are no explicit memory system reads and writes (they could be embedded into register functionality).
While the assembler is easy the compiler is more complicated.
A simple multiply or divide could take hundreds of cycles if not more.

Other Instructions

Add

ADD could be implmented as a subtract of an addend from 0 and then subtracting that difference from the other addend.

Multiply

MULT R3 = R1*R2 could be implemented as:

Setting R3 to 0 (SUB R3,0,0)
SUB R1,0,R1,1 (R1=-R1, branch to the next instruction always)
SUB R3,R3,R1 (R3-=R1), "Adding" R1 to R3 R2 times.
SUB R2,R2,1 which counts R2 times until R2 is negative:

Divide

Divides (R3 = R1/R2) could be implemented by subtracting R2 from R1 until the result falls negative while simultaneously subtracting -1 from a counter. The counter would be the quotient and re-adding the dividend to the result gives the remainder.

High Level Languages

A simple

 
if (var < 2) 
  MyFunc(otherVar);
else 
  j=j+1 

// Assuming variable "var" is in R2 and j is in R3.

SUB R0,R3,2,#myFunc 
SUB R3,R3,-1,1      // Subtract -1 from R3, branch by 1 no matter what
..
.
.
#myfunc:

Relevance to prior work experience

On the TMS320C6X architecture TM3320C6201,02 and 03 I worked on the Program Fetch, Data Path, Host Port, DMA and EMIF as well as JTAG Integration/testing and integrating the full chip.

Implementation

For demo purposes I've picked a simple non-pipelined single-threaded implementation. The AXI Bus provides a means via an AXI-LITE Slave interface to communicate with the datapath providing memory mapped IO.
The registers are mapped as follows:

Register Offset	Value
0x00	Instruction 0
...
0x3C	Instruction 15
0x40	Done (bit 2, RO) Reset (bit 1,RW) Enable (bit 0,RW)
0x44	Datapath Register Write Enable (bit 8), Datapath Register File Address(bits 4-0)
0x48	Datapath Register File Write
0x4C	Datapath Register File Read
0x50-0x70	Reserved
0x74	debug
0x78	Instruction count
0x7C	Checksum (0xDEBAC7E).. In real implementations this might be a peripheral ID to ease driver enablement.

For simplicity I allow for Register File reads and writes from the "control" registers.

Pipelines and Delayed Branches

A pipeline could be implemented using a delayed branch mechanism, where on any instruction it and the next instruction are guaranteed to execute (on the C6x we had 5 delay slots).

Multithreading

Since this processor doesn't offer a call stack or return instruction, there is no way to implement the context switching needed for multithreading.

Assembly

//  Instruction format:
//     32 registers, 32 bits wide, R0 is always 0
//
//  All numbers are signed. 
//
//   Bits7-0  subtrahend
//   Bits15-8 minuend
//   Bits23-16 destination (always a register number)
//   Bits31-28 Branch Offset 
//   Bits24: subtrahend is constant (1), register (0)
//   Bits25: minuend is constant (1), register (0)
//
//   Mnemonic is:   SBN R5,R6,R0,5     # R5=R6-R0, Branch by signed(5) if negative (1 otherwise)

Github

All of my code is in my github. This includes an Assembler, source and test code. Obviously for a more serious example there would be far more test coverage. To run, provide with a .sbn file (in /testcases/testSourceAssembly) to the assembler. This will generate a json file (in testcases/testAssembled) and the results can be compared with files in the (testcases/testResults).

Running

The easiest way to run this is to examine the RunPushtoZYNQAndTest.py script. Be sure to also pull the files it indicates (also automatically generated). If you are really interested I can provide bit files.
You may also run example code from as in this example:

root@pynq:/home/xilinx/SubtractBranchNegative# ./RunSBNOverlay.py Multiply4And3.sbn.json Multiply4And3.result.json 
CONTROL Reads 0x6
Control Register 0 0x12010001
Control Register 1 0x31020201
Control Register 2 0x30301
Control Register 3 0xe3000001
Control Register 4 0x0
Control Register 5 0x0
Control Register 6 0x0
Control Register 7 0x0
Control Register 8 0x0
Control Register 9 0x0
Control Register 10 0x0
Control Register 11 0x0
Control Register 12 0x0
Control Register 13 0x0
Control Register 14 0x0
Control Register 15 0x0
Control Register 16 0x6
Control Register 17 0xe
Control Register 18 0x0
Control Register 19 0x0
Control Register 20 0x4
Control Register 21 0x0
Control Register 22 0x0
Control Register 23 0x0
Control Register 24 0x0
Control Register 25 0x0
Control Register 26 0x0
Control Register 27 0x0
Control Register 28 0x0
Control Register 29 0x0
Control Register 30 0xc
Control Register 31 0xdebac7e
Datapath Register 0 Actual  0x0 Expected 0x0
Datapath Register 1 Actual  0xfffc Expected 0xfffc
Datapath Register 2 Actual  0xffff Expected 0xffff
Datapath Register 3 Actual  0xc Expected 0xc
Datapath Register 4 Actual  0x0 Expected 0x0
Datapath Register 5 Actual  0x0 Expected 0x0
Datapath Register 6 Actual  0x0 Expected 0x0
Datapath Register 7 Actual  0x0 Expected 0x0
Datapath Register 8 Actual  0x0 Expected 0x0
Datapath Register 9 Actual  0x0 Expected 0x0
Datapath Register 10 Actual  0x0 Expected 0x0
Datapath Register 11 Actual  0x0 Expected 0x0
Datapath Register 12 Actual  0x0 Expected 0x0
Datapath Register 13 Actual  0x0 Expected 0x0
Datapath Register 14 Actual  0x0 Expected 0x0
PASS!