Getting Started with W65C832

Introduction

The W65C832 FPGA Core (by Michael Kohn) is based on the WDC 65C816 ISA, but offers additional 32-bit register modes. It was originally designed for the iceFUN iCE40 HX8K board, and includes peripheral modules for SPI, UART, and speaker tones.

A 32-bit capable upgrade to the 65C816 was planned, and this core seeems largely in keeping with that idea. After writing and testing code for it during development (fixing a few bugs/quirks in the Verilog along the way), I think it's a good realization of what might have been.

Demo Program

I thought it would be fun to translate the 3D cube routine shown at the end of my Apple IIgs demo for java_grinder. It also does a lot of math and memory access which made for a good test.

Things were slow-going at first: some bugs in the core needed to be ironed out, and it can be a little tricky to program. The demo uses a 6 Kb framebuffer (96x64), which is sent to the LCD every frame. Screen updates were sluggish at the 12 MHz speed of the iceFUN, so I added a PLL module to the core which sped it up to around 50 MHz. (The CPU module runs at 50% speed, so 25 MHz effectively.)

This greatly improved the screen updates, but also allowed the 3D cube movements to get fancier. There was also enough extra memory to add a "rotozoom" effect, which I translated from some Java game code I wrote years ago. It also reuses the sine tables from the cube part.

The code is not perfect and I didn't even try to optimize it that much. I don't really know the "best" way to program this thing, but hopefully the demo serves well enough as an example.

Source Code: roto_cube.asm

Tile Data: tile.bin

Register Modes

The current register mode is set with a combination of four flags, which are set to one by default:

E16 - 65C816 emulation flag
E8  - 65C02 emulation flag

M   - accumulator size
X   - index register size
  

There are four sections based on E16/E8, with the following accumulator/index register modes:

65C02 emulation (E16 = 1, E8 = 1):
  8/8 (M = 1, X = 1)

65C816 emulation (E16 = 1, E8 = 0):
  8/8 (M = 1, X = 1)
  8/16 (M = 1, X = 0)
  16/8 (M = 0, X = 1)
  16/16 (M = 0, X = 0)

65C832 native half (E16 = 0, E8 = 0):
  8/8 (M = 1, X = 1)
  8/32 (M = 1, X = 0)
  16/8 (M = 0, X = 1)
  16/32 (M = 0, X = 0)

65C832 native full (E16 = 0, E8 = 1):
  8/8 (M = 1, X = 1)
  8/32 (M = 1, X = 0)
  32/8 (M = 0, X = 1)
  32/32 (M = 0, X = 0)
  

M/X are status register bits which are set (or reset) with SEP/REP instructions. E16/E8 are internal, and accessed indirectly by swapping the carry and/or overflow flags. Going out of 65C02 emulation (and into 65C816 mode) is a simple matter of clearing the carry flag and exchanging it with E8:

clc        ; E8 = 0
xce        ; exchange carry with E8
  

At this point, XCE is treated as a different instruction: XFE. This exchanges carry with E8, and the overflow flag with E16. Through XFE, 32-bit capable modes are now available by switching off E16, and setting E8 to choose between "native half" or "native full".

For example, "native full" is enabled with:

clv        ; E16 = 0
sec        ; E8 = 1
xce        ; really XFE, exchange carry with E8 and overflow with E16
  

Then SEP/REP are used to choose one of four modes:

sep #0x30  ; M = 1, X = 1

sep #0x20  ; M = 1
rep #0x10  ; X = 0

rep #0x30  ; M = 0, X = 0
  

The processor status register retains M/X, but not E16/E8. Therefore PHP/PLP will only save/restore one of the four modes available within each section. (Since XFE swaps E8/E16 with carry/overflow, it's technically possible to save/restore any mode with extra effort.)

Assembler

I wrote the original 65816 assembler module for naken_asm, which was updated (by Mike) to support W65C832 (enabled by the .65832 directive).

Dot suffixes are used to force the storage size for immediate values:

lda.b #0x55          ; load 8-bit accumulator
lda.w #0x5555        ; load 16-bit accumulator
ldx.l #0x5555_5555   ; load 32-bit index register
  

It's important to pay close attention to these suffixes when loading immediates and match them to the current register mode. They are generally not required otherwise (except to force addressing modes in some cases).

Some "traditional" syntax is supported (<, !, >) for byte extraction or forcing certain addressing modes, and the dollar-sign ($) for hexadecimal, but only what works for 65C816.

Variables do not require any particular alignment, although care should be taken to space 16 or 32 bit values properly. The assembler supports various directives for this purpose.

Bootloader

A USB serial-based XMODEM bootloader is included in the source which is covered separately here. It avoids having to reprogram the entire core just to change a program.

ROM and RAM Sizes

The iceFUN HX8K has 16 Kb of block RAM (also used for ROM by setting initial values and disabling writes). Since the bootloader requires less than 256 bytes of ROM, I reduced the size accordingly:

// changes to rom.v:
module rom
(
  input [7:0] address,
  output reg [7:0] data_out,
  input clk
);

reg [7:0] memory [255:0];

...

// changes to memory_bus.v:
rom rom_0(
  .address   (address[7:0]),
  .data_out  (rom_data_out),
  .clk   (raw_clk)
);
  

This makes it possible to increase RAM in a similar fashion:

// changes to ram.v:
module ram
(
  input  [13:0] address,
  input  [7:0] data_in,
  output reg [7:0] data_out,
  input write_enable,
  input clk
);

reg [7:0] memory [11263:0];

...

// changes to memory_bus.v:
ram ram_0(
  .address      (address[13:0]),
  .data_in      (data_in),
  .data_out     (ram_data_out),
  .write_enable (ram_write_enable),
  .clk          (raw_clk)
);
  

(I was only able to push RAM to 11 Kb, perhaps because the core itself is using some of the block RAM for other things, but I'm not sure.)

Increasing Speed

The iceFUN board defaults to 12 MHz (6 MHz for the CPU), but its PLL (phase-locked loop) circuitry allows it to run at least 50 MHz (25 MHz for the CPU). This feature isn't included in the Verilog source at this time, but isn't very hard to do.

In icefun.pcf, I changed "raw_clk" to "main_clk". This becomes the input to the PLL module, with the output named "raw_clk". Thus, now anything in the core that uses "raw_clk" will be using the faster PLL clock instead. Some other changes are required:

In w65c832.v, add:

  wire raw_clk;
  wire pll_lock;
  

Replace this:

always @(posedge raw_clk) begin
  count <= count + 1;
  clock_div <= clock_div + 1;
end
  

with this:

always @(posedge raw_clk) begin
  if (pll_lock == 1) begin
    count <= count + 1;
    clock_div <= clock_div + 1;
  end else begin
    count <= 0;
    clock_div <= 0;
  end
end
  

Then before the memory_bus section near the end of the file, add:

pll pll_0
(
  .clock_in (main_clk),
  .clock_out (raw_clk),
  .locked (pll_lock)
);
  

Now it's time to generate a pll.v file. This is done with the icetime and icepll tools contained in the yosys package. The first thing to do is build the core, then run:

icetime -d hx8k w65c832.asc

It will report a timing estimate, which can vary somewhat with different builds. Assuming 50 MHz is within the reported range, review the output of icepll:

icepll -i 12 -o 50

Sometimes the "achieved" frequency is slightly different from the "requested" one. If everything looks good, run the command again to generate pll.v:

icepll -i 12 -o 50 -m

Copy pll.v to the src directory. The last thing to do is change the UART to support the higher speed:

In uart.v, increase the sizes of the divisor registers:

reg [12:0] tx_divisor;
reg [11:0] rx_divisor;
  

Then find places in the code that have 1249 or 624, and increase those to 5235 and 2618 (these are based on the 50.250 MHz value reported by icepll).

Now rebuilding the core should have it running at the new frequency.

Note: occassionaly a "bad" build can result in strange behavior, including the XMODEM bootloader not working correctly. Rebuilding the core typically resolves the problem.