The RISC Wars Part 1 : The Cambrian Explosion

Apr 30, 2023

Not just a battle between RISC and CISC, but between proliferating RISC designs.

24 Comments

May 4, 2023

You forgot CRISP/Hobbit, another Dave Ditzel design for AT&T Microelectronics, which was almost picked up by Apple for the Newton before they switched to ARM.

Expand full comment

Reply (1)

Babbage

May 4, 2023

Hi Andrew, I thought about including Hobbit but in the end decided it didn't really qualify as a RISC machine - really a separate class of 'stack cache' machine I think. Although Dave Ditzel worked on the project, I don't think that it's really a descendant of RISC-I or MIPS in the way most of the other designs are.

Happy to be proven wrong on this though!

Expand full comment

Reply (1)

Peter Cockerell

May 5, 2023Edited

Yeah, not really a RISC at all. Its opcodes were frequency-encoded so you had some very common short instructions, and 5 or 7-byte rarer ones, though I think they all got decoded to a common VLIW internal format.

We used it in the EO Communicator, using the Penpoint OS. One of my tasks was to code Penpoint's method-dispatching inner loop in Hobbit assembler from the original C. I got about a 3% speedup, which at least shows what an optimal "C Machine" it was!

That project had the opposite arc to the Newton: we were supposed to use the ARM, but AT&T were funding EO, and imposed Hobbit on us.

Expand full comment

Reply (1)

Babbage

May 7, 2023

Hi Peter, Thanks so much for this - really interesting on coding in Hobbit assembly. Do you have a view as to how good an architecture Hobbit was? Do you think it could have been competitive against ARM or was the approach fundamentally flawed?

Expand full comment

Reply (2)

Peter Cockerell

May 7, 2023Edited

Interesting question! I think it had one big advantage over the ARM then, which was that it dealt natively with 16-bit quantities, whereas at that time ARM had no 16-bit load/store. This meant that a Hobbit could compile something like "a++" for a C local short as two 2-byte instructions

addshort acc, [sp+#a], #1

move [sp+#a], acc

(using my own assembly syntax!) whereas ARM would need at least several 32-bit instructions:

ldr r0, [sp+#a]

add r0, r0, #1

strb r0, [sp+#a]

shr r0, r0, #8

strb r0, [sp+#a+1]

The compiler could do much better if its register allocation was good enough and "a" had a dedicated register, but it's definitely a possible worst case, and the normal case for something like (*a)++ where you only had a pointer to "a".

Of course ARM later introduced Thumb and load/store halfwords, but they weren't available at the time, and the importance of C shorts to Penpoint really meant the death knell for ARM on that project (political considerations aside!)

I'm not enough of an architecture expert to know whether CRISP could ultimately have held its own against the later ARM revisions, or even the internal improvements Intel and AMD made to the x86 line. It was certainly competive against the 386 that Penpoint was developed on, both in terms of speed and (especially) power, but I think ultimately it might have been regarded as too specialized an instruction set, and maybe limited by its 2½ address model.

Expand full comment

Reply (1)

Babbage

May 8, 2023

Hi Peter, Thanks, that's an interesting code comparison. I hadn't appreciated the original ARM's lack of halfword load stores. It's interesting the extent to which ARM evolved over the years and of course ARM64 looks nothing like the original ARM architecture over the years. I'm sure CRISP / Hobbit would have evolved too if only AT&T hadn't dropped it!

Expand full comment

Andrew Raffman

Aug 7, 2023

I was at AT&T around that time (first at AT&T Computer Systems, later AT&T Microelectronics working on the DSP3210 that went into Apple's AV-MAC line and in some PC products). I recall Ditzel giving a talk in person to us in the 80's on the Hobbit architecture, which was fascinating. However, I also distantly remember it being later claimed that while the stack cache was great at modeling C-code with minimal instructions, it created some hardware limitations on how fast the chip could be clocked and pipelined. If true, CRISP long term had no future. It's instructive that the next design Ditzel worked on, SPARC (which I also worked on SW for at AT&T; lots of stories on that one), had a register-window architecture which was similar in function to the CRISP stack cache but was simpler to implement in hardware. Many people think SPARC descended from MIPS, but in some ways SPARC descended from CRISP. I never personally worked with CRISP (although I spent years working on ARM at MSFT).

Obligatory code sample: My favorite processor I worked with was the Philips Trimedia, which was a VLIW design. This was a very cool chip that could exectute (IIRC) five instrucitons in parallel, had great conditional testing/execution, a wonderful C compiler, but for which jumps were fairly expensive. I recall coming up with code in the ISR to find the lowest already set bit in a 32-bit value (knowing at least one bit is set) to determine the highest priority interrupt being signaled. It looked something like this (assuming mask is an unsigned 32-bit int):

mask &= -mask; // Clear all but the lowest already set bit

bit = ((mask & 0xFFFF0000)!=0)<<4) |

((mask & 0xFF00FF00)!=0)<<3) |

((mask & 0xF0F0F0F0)!=0)<<2) |

((mask & 0xAAAAAAAA)!=0)<<1) | // binary 11001100...

((mask & 0x99999999)!=0)); // binary 10101010...

This compiled to something like 3-4 lines of assembly (e.g. 3-4 cycles), with no jumps.

Expand full comment

Allen Baum

May 3, 2023

Internally, I think it was called P7 (not to be confused with a different P7 which was an x86 project).

I've seen documentation, but it was labelled INTEL confidential even long after it was cancelled.

Details will be posted someday, I'm sure.

Expand full comment

Reply (1)

Babbage

May 4, 2023

It's always confusing when they re-use project names! Thanks again.

Expand full comment

SaganAndroid

May 2, 2023

Why is the 6502 not considered RISC? It has a small instruction set, and most of the time, instructions finish in 2, 3 or 4 cycles.

Expand full comment

Reply (1)

Babbage

May 2, 2023

Hi, That's great question. My understanding is that the 6502 is almost seen as a separate category because it's so simple.

It's also not a 'load/store' architecture where only load and store instructions interact with memory - INC and ASL for example can change memory directly - which is usually one of the criteria for being classified as RISC.

Expand full comment

David Galloway

May 1, 2023

In several retrospectives of the development of ARM, I've heard Sophie Wilson give a nod to Berkely RISC.

Expand full comment

Reply (1)

Babbage

May 2, 2023

Hi David, Yes I think that's right. They saw the MIPS papers too. Interesting though that they didn't follow the RISC-I ideas precisely though - there are no register windows in ARM for example.

Expand full comment

Reply (1)

Peter Cockerell

May 7, 2023

I was lucky enough to get a demo of ARM BASIC from Sophie Wilson after work one day, first running under the ARM emulator she wrote for the NS16032 (about the only thing the 16032 ended up being good for!) and then on one of the very first ARM boards used as a second processor to the BBC Micro. Of course the speed of ARM BASIC running even on those early 4MHz ARMs was astonishing compared to the original 6502 version (which itself was regarded as a very nippy BASIC at the time).

Expand full comment

Reply (2)

Peter Cockerell

May 8, 2023

Replying to myself because Babbage's comment below doesn't have a reply button! I was at Acorn, the company that developed the ARM. This was several years before ARM the company was formed.

No the BASIC was for ARM, not the 16032, but Sophie wrote an instruction-level ARM emulator for the 16032 so ARM code could be tested before the silicon came back from VTI.

Yes, Sophie's 6502 was amazing, so you can imagine what she managed to do with an ARM instruction set that she herself designed! I remember one of the tricks in BASIC was to munge two-byte GOTO line numbers into a three-byte format that didn't contain any inconvenient values that would prevent the line being scanned as fast as possible.

Expand full comment

Reply (1)

Babbage

May 8, 2023

Hi Peter, Sorry - yes of course it would have been Acorn - thanks for correcting me. I was in Cambridge at the time so should have remembered that! Interesting on the 16032 - I don't think I'd read anywhere that the 16032 second processor had a role in the ARM's development.

Expand full comment

Babbage

May 8, 2023

Wow, that's great. Were you at ARM at the time? I didn't know that there was a BBC BASIC for the 16032 - but thinking about it I might have actually used it in the Cambridge Computer Lab in 1986ish - my memory is a bit vague by now! I remember poking around the disassembly of the 6502 BBC Basic which Sophie Wilson wrote by hand and being blown away by some of the techniques to get better performance.

Expand full comment

Allen Baum

May 1, 2023

Prior to Itanium, Intel was defining an Alpha-like 64bit RISC processor also.

It got cancelled when HP proposed the Itanium collaboration.

Expand full comment

Reply (1)

Babbage

May 3, 2023

Hi Allen, Thanks so much for commenting. I vaguely remember this but was never able to track any details down. Did it ever get a name?

BTW Thanks so much for doing your Oral History at the CHM a few years ago. Terrific discussion covering so much ground - you were involved in so many fascinating projects, with so many interesting people, over the years. Wish it had been twice as long!

Expand full comment

Andrew Reilly

Apr 30, 2023

I got to use several Sun SPARC systems at Uni, and MIPS (R3000) systems from DEC and Sony, and later an R4400 from SGI. A lot of SGI machines used an i860 for the GPU, but I don't think that the cheap(er) one that I used did. I bought myself an ARM2 for home. The very first SPARC machines were quite brittle in their performance: they could be fast, but it wasn't hard to find code that was slower on them than the older 68030-based workstations. The later ones (guessing SPARCv7 or v8?) were properly well rounded and performed beautifully. The DEC and Sony MIPS boxes were really nice. The floating point units were excellent (for the time), and performance was reliable. Ultrix, SunOS, Sony NEWS were all Berkely 4.2 or 4.3 based, so there was a great deal of source-code compatibility, even if the executables weren't. This was before the days of shared libraries, so they were simpler times. I used some lovely X terminals that were based on the i960, but just like the i860 in the SGI systems, they were effectively sealed units: there was no way to actually code for them.

Interesting to read you say that the Motorola 88000 was "the fastest" when released: was it ever used in any product? I can't think of any. I don't know whether that was because it didn't work, or was just overtaken by the PowerPC collaboration?

Expand full comment

Reply (1)

Babbage

May 2, 2023

HI Andrew, Thanks for a great comment. On the 88000 I think that Motorola claimed it was the fastest microprocessor when it was released - but that could have just been marketing fluff!

I'm not aware of it being used in any product outside of Motorola. There was an interesting piece recently from the computer History Museum about Gary Davidian writing a 68000 emulator for the 88000. They even got as far as building a prototype Mac, But, of course, we know that eventually Apple and Motorola went down the PowerPC route and the 88000 was history.

https://computerhistory.org/blog/transplanting-the-macs-central-processor-gary-davidian-and-his-68000-emulator/

Expand full comment

John Yates

Jul 10

I was an engineer at Apollo Computer in the late '80s and early '90s. Here is how the DN10000’s unusual ISA came to be.

With little more than a 1970 high school education, I first learned programming in assembler on the PDP-11 (an early CISC ISA with 2, 4 and 6 byte instruction lengths). I later joined one of DEC's compiler groups. There I built the recursive descent frontend, the code selector and the code emitter for the from-scratch VAX Pascal compile. That, of course, immersed me in another CISC ISA.

In the early 1980s Apollo Computer was getting started. They had chosen Pascal as their system programming language. Apollo's User Environment group was familiar with Smalltalk and dreamt of adding object oriented concepts to the Apollo Pascal compiler. The compiler group proper was not interested in working on such a project so the User Environment group hired me. That Pascal language extension work turned out not to be a real project so I found myself with a lot of free time to explore what else was happening in my building. Turned out, just down the hall was the team that had recently built Tern, a 3x faster 2900 bit sliced knockoff of the M68K. With no immediate next assignment, they were bouncing around RISC ideas. Things changed when Sun stole a march on Apollo and introduced the SPARCstation-1. Apollo needed a response. A crash program was kicked off. Michael Sporer was appointed lead architect and Rick Bahr lead CPU designer.

One of Michael's first contributions was to identify a planned Honeywell CMOS "FPU on a chip" as the part to be used for the new machine's FP functionality. This was to be an awesome chip:

* Pipelined FP adder

* Pipelined FP multiplier with asynchronous divider

* 5 read port / 3 write port 32 DP FP register file (also addressable as 64 SP locations)

* full bypassing

Crucially, the control interface allowed simultaneous issue of

* An operation to the FP ALU pipeline

* An operation to the FP multiplier pipeline

* A write to the register file load port

* A read from the register file store port

With this promised chip in mind, Michael challenged the team to figure out an ISA that would allow us to keep it close to saturated.

Separately, the graphics team had been doing performance projections base on various ISA assumptions. This being the graphics team, their most important kernel was streaming points (3 SP components) past a register file resident transform matrix and delivering integerized results to the rasterizers. The performance projections came down to hand software pipelining various kernels based on resource assumptions. A classic RISC design assumed that an FP operate instruction took an entire issue slot, just like every other instruction. By contrast, accepting Michael vision lead to a nearly 3x speed up. That was enough to keep the dream alive even while all the hardware designers, including Rick Bahr, told Michael that it could not be done.

It was into this tug-o-war that I ventured. Not being a hardware designer and having only just started to read the RISC papers, but having a history of coding in assembler on CISC architecture, I started churning out ISA proposals with variable length instructions. To help make my proposals more grounded I befriended Steve Ciavaglia who own the instruction decode, and integer datapath. Steve spent generous amounts of time teaching me about registers and pipelining.

To say that my proposals were not warmly received would be to put it mildly. In fact, at one point Rick said to me "John, when I present this project at a computer architecture conference, there is no way that I will get up before my peers and announce that we responded to Sun's RISC attack by designing an ISA with variable length instructions".

Still, the performance benefits of parallel issue were so compelling that Rick had to, at least, look at each proposal I brought forward. Over time, with Steve's tutelage, combined with my perspective as a programmer and compiler writer, I came to see the real contribution of RISC as a focus on maximally simple pipelines. Clearly, one contribution to such simplicity is an insistance that all memory accesses be naturally aligned. On the data side this was already RISC dogma. On the instruction side, supporting a single 4-byte instruction format was one trivial way to satisfy natural alignment. But it was not the only way.

That insight lead me to the idea of an instruction stream composed of both 4-byte and 8-byte instructions with the requirement that every 8-byte instruction had to be naturally aligned on an 8-byte boundaries. The 4-byte instruction format would include an "FP companion" bit which could only validly be set if the 4-byte instruction's address was 8-byte aligned. The 4-byte instruction with the FP companion bit would always be routed to the integer execution unit. (This meant that the entire integer instruction set needed to be encoded within 31 bits, a constraint that other RISC ISAs did not share.). The 4-byte companion FP instruction, when present, would be routed to the Honeywell FP chip. But even with a full 32 bits to play with, this was not enough to achieve Michael's vision of full independent control of both FP pipes. Per pipe I probably would have needed about 23 bits:

* 1 bit - SP versus DP datatype

* 4 bits - pipe-specific operation

* 6 bits - SP source register A address

* 6 bits - SP source register B address

* 6 bits - SP destination register address

One additional insight had emerged from the graphics team's performance investigations. The 3x speed up did not require the full generality of Michael's challenge. It was not necessary to be able to issue any one of the many FP ALU operations in parallel with any one of the many FP multiplier operations in parallel with an integer machine instruction (which now would include all loads and stores). The following limitations seemed to have no impact on the performance benefits of parallel issue:

* Support a very limited "5 operand" FP instruction format where

- Both the FP ALU and the FP multiplier had to operate in the same mode (SP or DP)

- The FP multiply operation could be fully general in the sense of having 3 6-bit register operands

- The FP ALU operation would have only 2 6-bit operands and had too be an add, a subtract or a truncate to integer

* Support one fully general operation a single FP pipe providing 3 6-bit register operands

Encoding this scheme in 32 bits took some creativity but proved doable.

When I showed this design to Rick he was less dismissive but still did not believe that the interface between the I-Cache the the instruction decode could be achieve except by introducing an additional pipeline stage. At that point I had learned enough from Steve to draw a picture that was effectively the top half of the first figure is US patent 5051885 (https://patents.google.com/patent/US5051885A/en?oq=US5051885A).

At that point Rick relented and Michael blessed my proposal. I migrated to the compiler group and lead development of the new backend needed for our brand new ISA.

Expand full comment

Don Gillies

May 4, 2023

What is the point of crediting the first RISC? It was just a revival of the simple load-store architectures of the 1950s and 1960s. Who gets credit for being the first to revive something? Usually - no one - but in this one case - Answer: professors at top ranked universities! Their job is to take credit and hype their shit even if it's not exactly innovative ...

Expand full comment

Reply (1)

Babbage

May 4, 2023

Hi, The point I think is that it reversed the trend towards more and more complex ISAs like VAX and S/360. The proof of the importance is that people stopped designing CISC architectures after RISC appeared in the scene.

Expand full comment

The Chip Letter

The RISC Wars Part 1 : The Cambrian Explosion