Stack Computers: 4.4 ARCHITECTURE OF THE NOVIX NC4016<br />

Stack Computers: the new wave
© Copyright 1989,
Philip Koopman,
All Rights Reserved.

Chapter 4. Architecture of 16-bit Systems


4.4.1 Introduction

The Novix NC4016, formerly called the NC4000, is a 16-bit stack based
microprocessor designed to execute primitives of the Forth programming
language. It was the first single-chip Forth computer to be built, and
originated many of the features found on subsequent designs. Intended
applications are real time control and high speed execution of the Forth
language for general purpose programming.

The NC4016 uses dedicated off-chip stack memories for the Data Stack and
the Return Stack. Since three separate groups of pins connect the two stacks
and the RAM data bus to the NC4016, it can execute most instructions in a
single clock cycle.

4.4.2 Block diagram

Figure 4.6 shows the block diagram of the NC4016.

[Figure 4.6]
Figure 4.6 — NC4016 block diagram.

The ALU section contains a 2-element buffer for the top elements of the
data stack (T for Top data stack element, and N (Next) for the second-from-top
data stack element). It also contains a special MD register for support of
multiplication and division as well as an SR register for fast integer square
roots. The ALU may perform operations on the T register and any one of the N,
MD, or SR registers.

The Data Stack is an off-chip memory holding 256 elements. The data stack
pointer is on-chip and provides a stack address to the off-chip memory. A
separate 16-bit stack data bus allows the Data Stack to be read or written in
parallel with other operations. As noted previously, the top two Data Stack
elements are buffered by the T and N registers in the ALU.

The Return Stack is a separate memory that is very similar to the Data
Stack, with the exception that only the top return stack element is buffered
on-chip, in the Index register. Since Forth keeps loop counters as well as
subroutine return addresses on the return stack, the Index register can be
decremented to implement countdown loops efficiently.

The stacks do not have on-chip underflow or overflow protection. In a
multitasking environment, an off-chip stack page register can be controlled
using the I/O ports to give each task a separate piece of a larger than 256
word stack memory. This gives hardware protection to avoid one task
overwriting another task’s stack, and reduces context swapping overhead to a

The Program Counter points to the location of the next instruction to be
fetched from external program memory. It is automatically altered by the jump,
loop, and subroutine call instructions. Program memory is arranged in 16 bit
words. Byte addressing is not directly supported.

The NC4016 also has two I/O buses leading off-chip on dedicated pins. The
B-port is a 16-bit I/O bus, and the X-port is a 5-bit I/O bus. The I/O ports
allow direct access to I/O devices for control applications without stealing
bandwidth from the memory bus. Some bits of the I/O ports can also be used to
extended the program memory address space by provide high order memory address

The NC4016 can use four separate 16-bit busses for data transfers on every
clock cycle for high performance (program memory, Data Stack, Return Stack, and
I/O busses).

4.4.3 Instruction set summary

The NC4016 pioneered the use of unencoded instruction formats for stack
machines. In the NC4016 the ALU instruction is formatted in independent fields
of bits that simultaneously control different parts of the machines, much like
horizontal microcode. The NC4016, and many of its Forth processor successors,
are the only 16-bit computers that use this technique. Using an unencoded
instruction format allows simple hardware decoding of instructions. Figure 4.7
shows the instruction formats for the NC4016.

[Figure 4.7a]
Figure 4.7(a) — NC4016 instruction formats — subroutine call.

Figure 4.7a shows the instruction format for subroutine calls. In this
format, the highest bit of the instruction is set to 0, and the remainder of
the instruction is used to hold a 15-bit subroutine address. This limits
programs to 32K words of memory.

[Figure 4.7b]
Figure 4.7(b) — NC4016 instruction formats — conditional branch.

Figure 4.7b shows the conditional branch instruction format. Bits 12 and
13 select either a branch if T is zero, an unconditional branch, or a decrement
and branch-if-zero using the index register for implementing loops. Bits 0-11
specify the lowest 12 bits of the target address, restricting the branch target
to be in the same 4K byte block of memory as the branch instruction.

[Figure 4.7c]
Figure 4.7(c) — NC4016 instruction formats — ALU operation.

Figure 4.7c shows the format of the ALU instruction. This instruction has
several bit fields that control various resources on the chip. Bits 0 and 1
control the operation of the shifter at the ALU output. Bit 2 specifies a
nonrestoring division cycle. Bit 3 enables shifting of the T and N registers
connected as a 32-bit shift register.

Bit 5 of the ALU instruction indicates a subroutine return operation. This
allows subroutine returns to be combined with preceding arithmetic operations
to obtain “free” subroutine returns in many cases.

Bit 6 specifies whether a stack push is to be accomplished. It, combined
with bit 4, controls pushing and popping stack elements.

Bits 7 and 8 control the input select for the ALU as well as allow specify
a step for iterative multiply or square root functions. Bits 9-11 specify the
ALU function to be performed.

[Figure 4.7d]
Figure 4.7(d) — NC4016 instruction formats — memory reference.

Figure 4.7d shows the format of a memory reference instruction. These
instructions take two clock cycles: one cycle for the instruction fetch, and
one clock cycle for the actual reading or writing of the operand. The address
for the memory access is always taken from the T register. Bit 12 indicates
whether the operation is a memory read or write. Bits 0-4 specify a small
constant that can be added or subtracted to the T value to perform
autoincrement or autodecrement addressing functions. Bits 5-11 of this
instruction specify ALU and control functions almost identical to those used in
the ALU instruction format.

[Figure 4.7e]
Figure 4.7(e) — NC4016 instruction formats — user space/register

Figure 4.7e shows the miscellaneous instruction format. This instruction
can be used to read or write a 32-word “user space” residing in the
first 32 words of program memory, saving the time taken to push a memory
address on the stack before performing the fetch or store. It can also be used
to transfer values between registers within the chip, or push either a 5-bit
literal (in a single clock cycle) or a 16-bit literal (in two clock cycles)
onto the stack. Bits 5-11 of this instruction specify ALU and control
functions very similar to those in the ALU instruction format.

The NC4016 is specifically designed to execute the Forth language. Because
of the unencoded format of many of the instructions, machine operations that
correspond to a sequence of Forth operations can be encoded in a single
instruction. Table 4.2 shows the Forth primitives and instruction sequences
supported by the NC4016.

:  (subroutine call)     AND                           
;  (subroutine exit)     BRANCH
!                        DROP
+                        DUP
-                        I
0                        LIT
0<                       NOP
0BRANCH                  OR
1+                       OVER
1-                       R>
2*                       [email protected]
>R                       SWAP
@                        XOR

Table 4.2(a) NC4016 Instruction Set Summary — Forth Primitives.
(see Appendix B for descriptions)

nn                       @ +                           
nn !                     @ +c
nn +                     @ -
nn +c                    @ -c
nn -                     @ SWAP -
nn -c                    @ SWAP -c
nn @                     @ OR
nn @ +                   @ XOR
nn @ +c                  @ AND
nn @ -                   DROP DUP
nn @ -c                  DUP nn !
nn @ AND                 DUP nn ! +
nn @ SWAP -              DUP nn ! -
nn @ SWAP -c             DUP nn ! AND
nn @ OR                  DUP nn ! OR
nn @ XOR                 DUP nn ! SWAP -
nn AND                   DUP nn ! XOR
nn [email protected]                    DUP nn I!
nn [email protected] +                  DUP nn I! +
nn [email protected] -                  DUP nn I! -
nn [email protected] AND                DUP nn I! AND
nn [email protected] OR                 DUP nn I! OR
nn [email protected] SWAP -             DUP nn I! SWAP -
nn [email protected] XOR                DUP nn I! XOR
nn [email protected]!                   DUP @ SWAP nn +
nn I!                    DUP @ SWAP nn -
nn OR                    OVER +
nn SWAP -                OVER +c
nn SWAP -c               OVER -
nn XOR                   OVER -c
lit +                    OVER SWAP -
lit +c                   OVER SWAP -c
lit -                    R> DROP
lit -c                   R> SWAP >R
lit AND                  SWAP -
lit OR                   SWAP -c
lit SWAP -               SWAP DROP
lit SWAP -c              SWAP OVER !
lit XOR                  SWAP OVER ! nn +
                         SWAP OVER ! nn -

Notes: "nn" represents a 5 bit literal or user offset value.
       "lit" represents a 16 bit literal stored in the memory
       location after the instruction.

Table 4.2(b) NC4016 Instruction Set Summary — Compound Forth

nn [email protected]                 -> N         ->
   Fetch the value from internal register nn (stored as a 5
     bit literal in the instruction).

nn I!               N ->           ->
   Store N into the internal register nn (stored as a 5 bit
     literal in the instruction)

+c              N1 N2 -> N3        ->
   Add with carry (using internal carry bit)

-c              N1 N2 -> N3        ->
   Subtract with borrow (using internal carry bit)

*'                 D1 -> D2        ->
   Unsigned Multiply step (takes two 16 bit numbers and
     produces a 32 bit product).

*-                 D1 -> D2        ->
   Signed Multiply step (takes two 16 bit numbers and produces
     a 32 bit product).

*F                 D1 -> D2        ->
   Fractional Multiply step (takes two 16 bit fractions and
     produces a 32 bit product).

*/'                D1 -> D2        ->
   Divide step (takes a 16 bit dividend and divisor and
     produces 16 bit remainder and quotients).

*/''               D1 -> D2        ->
   Last Divide step (to perform non-restoring division fixup).

2/                 N1 -> N2        ->
   Arithmetic shift right (same as division by two for
   non-negative integers.

D2/                D1 -> D2        ->
   32 bit arithmetic shift right (same as division by two for
   non-negative integers.

S'                 D1 -> D2        ->
   Square Root step.

TIMES                 ->        N1 -> N2
   Count-down loop using top of return stack as a counter.

Table 4.2(c) NC4016 Instruction Set Summary — Special Purpose Words.

4.4.4 Architectural features

The internal structure of the NC4016 is designed for single clock cycle
instruction execution. All primitive operations except memory fetch, memory
store, and long literal fetch execute in a single clock cycle. This requires
many more on-chip interconnection paths than are present on the Canonical Stack
Machine, but provides much better performance.

The NC4016 allows combining nonconflicting sequential operations into the
same instruction. For example, a value can be fetched from memory and added to
the top stack element using the sequence @ + in a Forth program. These
operations can be combined into a single instruction on the NC4016.

The NC4016 subroutine return bit allows combining a subroutine return with
other instructions in a similar manner. This results in most subroutine exit
instructions executing “for free” in combination with other
instructions. An optimization that is performed by NC4016 compilers is
tail-end recursion elimination. Tail-end recursion elimination involves
replacing a subroutine call/subroutine exit instruction pair by an
unconditional branch to the subroutine that would have been called.

Another innovation of the NC4016 is the mechanism to access the first 32
locations of program memory as global “user” variables. This
mechanism can ease problems associated with implementing high level languages
by allowing key information for a task, such as the pointer to an auxiliary
stack in main memory, to be kept in a rapidly accessible variable. It also
allows reasonable performance using high level language compilers, which may
have originally been developed for register machines, by allowing the 32
fast-access variables to be used to simulate a register set.

4.4.5 Implementation and featured application areas

The NC4016 is implemented using fewer than 4000 gates on a 3.0 micron HCMOS
gate array technology, packaged in a 121 pin Pin Grid Array (PGA). The NC4016
runs at up to 8 MHz.

When the NC4016 was designed, gate array technology did not permit placing
the stack memories on-chip. Therefore a minimum NC4016 system consists of
three 16-bit memories: one for programs and data, one for the data stack, and
one for the return stack.

Because the NC4016 executes most instructions, including conditional
branches and subroutine calls, in a single cycle, there is a significant amount
of time between the beginning of the clock cycle and the time that the memory
address is valid for fetching the next instruction. This time is approximately
half the clock cycle, meaning that program memory access time must be
approximately twice as fast as the clock rate.

The NC4016 was originally designed as a proof-of-concept and prototype
machine. It therefore has some inconveniences that can be largely overcome by
software and external hardware. For example, the NC4016 was intended to handle
interrupts, but a bug in the gate array design causes improper interrupt
response. Novix has since published an application note showing how to use a
20-pin PAL to overcome this problem. A successor product will eliminate these
implementation difficulties and add additional capabilities.

The NC4016 is aimed at the embedded control market. It delivers very high
performance with a reasonably small system. Among the appropriate applications
for the NC4016 are: laser printer control, graphics CRT display control,
telecommunications control (T1 switches, facsimile controllers, etc.), local
area network controllers, and optical character recognition.

The information in this section is derived from Golden et al. (1985),
Miller (1987), Stephens & Watson (1985), and Novix’s Programmers’
Introduction to the NC4016 Microprocessor (Novix 1985).


HOME Phil Koopman —
[email protected]