perilus, Part Seven2026-05-03
TL;DR: I used Verilator to convert the Verilog
version of perilus into a C++ library. Then I linked that
against a Rust binary with ratatui to expose a simulation
frontend. Running the simulator revealed some logic bugs in the memory
module, so I fixed those. I also built the backend of an assembler.
Having built up enough of perilus to at least nominally
support RV32I, my next goal was to build an interactive simulation
program that would instantiate a perilus machine and let me
peek/poke its state. I imagined this as analogous to the large panels of
lights and switches on the earliest electronic computers. No shell, just
registers, memory, and the ability to step the clock.
The first step towards this goal was to use Verilator. The ChiselSim
test system already uses Verilator internally, but I needed to do this
in an explicit and standalone way. The Verilator docs are pretty good,
so I was able to take the Verilog output from my Chisel code and run it
through Verilator to produce the source code for a C++ library that
implements perilus.
I wanted to write the frontend in Rust, so I wrote a wrapper
in C++ that uses the Verilator API to export functions for the
operations I needed (instantiation, destruction, and read/write for
relevant parts of the processor state). I also had to write a corresponding
file in Rust that declares the same functions as extern
and exports its own API for more idiomatic control. Then I wrote a build.rs
that compiles the wrapper and the generated C++ library before linking
them to the Rust binary. After some fighting with C++ (or really my
ignorance of it), I had a mostly empty main.rs with the
ability to create a Perilus struct and peek/poke it like I
wanted.
I didn’t know anything about ratatui except that I
wanted to use it, so I learned just enough to display the registers,
memory, program counter, and control unit state. I also added basic
controls for editing values and pulsing the clock. The simplest input
method I could come up with was to listen for any hex character and
shift it into the least-significant nibble of the highlighted word. This
doesn’t require any widget state for sub-word cursor position.
Shift+c also clears the word. This was enough to
start programming.

I wanted to begin working towards an assembler, so I wrote out some routines for assembling R-Type and I-Type instructions on paper and assembled them by hand. Once I’d gotten these working, I was able to use them to assemble further routines for other instruction types. For any instructions that didn’t have an assembler routine yet, I still had to assemble them by hand. Building the assembler backend wasn’t a straightforward process because I didn’t have a clear design in my head or on the page, and I couldn’t make one until I understood what was sensible under these unfamiliar constraints. I had to keep reminding myself that I only need this to be a bootstrap assembler, so it doesn’t need to be beautiful or even complete. It just needs to be a useful tool for long enough to get me to a higher level of abstraction and speed.
Eventually I settled on a design for an intermediate representation for instructions and their operands. Each instruction in this representation takes two words. The first word only uses the two least-significant bytes, which encode the instruction type and certain other fields that can be deduced from the instruction name alone. The other word encodes the operands, which are specific to each instruction type. I designed this IR to be easy to read and write in the hex memory view of the simulator. This results in a lot of wasted bits, but I was willing to accept the trade. The machine code below is a routine that takes the two IR words as arguments and returns the the assembled instruction. It should work for any instruction type in RV32I and should even be able to assemble itself, but I haven’t tried that yet.

loop snippet reads two words
from the pointer in s0, loads them into a0 and
a1, calls the routine at s1, and writes the
return value to the pointer in s2 before incrementing the
pointers and looping. I copied this by hand from the simulator because
it has no nonvolatile memory, and I enjoy pretending that I don’t have
access to any other computer for this part of the project.
By this point I’d filled several pages with assembly scribbles and had gone quite mad. But I also had a viable assembler backend, and I had constructed it under my vague constraints of bootstrapping as much as I cared to. I started to pick up some speed when I used the assembler to assemble a loop to read two words from a pointer, set them as arguments, call the assembler routine, and store the result at another pointer. This allowed me to write IR into a chunk of memory and quickly assemble it into executable code, which would sometimes become part of the assembler itself.
At some point during my assembly adventure, I found a bug in the
memory module. The bug was due to my own misunderstanding about how
sub-word access works. I thought that, for example, a
lb (load byte) instruction would always read from the
lowest byte in a word and write to the lowest byte in the target
register, but that’s not the case. The actual behavior is that it truly
reads from the specified target byte in memory and sets the entire
target register equal to that single byte’s value alone (with sign
extension as appropriate). In other words, if memory contains
0x12 0x34 0x56 0x78 and I run lb x5, x0, 0x1,
then the contents of x5 should be 0x00000034
because the byte at address 0x1 is 0x34. In
the process, the byte is shifted to the least-significant position in
the target register. I had similar misunderstandings about how store
instructions work.
This required some
substantial changes to the memory module to implement correct reads
and writes. It turns out that Chisel has support for masked
writes in its Mem abstraction, but I needed to change
each memory element into a vector of bytes instead of 32-bit integers to
make it work. The rest was just a lot of careful management of different
shifting and masking cases. I’m still not sure that I got it all
correct, but I added a lot more tests to try to cover everything.
Unaligned accesses just round down to the nearest boundary for now and
aren’t tested, but from what I’ve read it seems like a good response in
this situation is to trap and halt. Some implementations also emulate
the unaligned access at potentially significant time cost, but I don’t
think I’m going to get perilus to that level of
sophistication anytime soon.
I also learned that Chisel’s Mem module generally
doesn’t get synthesized to actual RAM silicon in most FPGA targets
because it’s combinational (i.e., data appears immediately when an
address is asserted). Instead, it gets synthesized to ordinary
registers. SyncMem
has a similar API but is sequential, so it usually gets synthesized to
real RAM. Someday I want to switch to SyncMem, but this
will probably mean a redesign of the memory module’s interface, since
the current setup relies on combinational reads.
I still don’t have a clear idea when I’ll consider this project done.
I guess I don’t typically begin my long-term projects with clear end
states in mind beyond “it works”, but even that doesn’t give a solid
answer in this case. In some sense perilus has already
served its purpose of teaching me the basics of RISC-V, Chisel, and
processor/hardware design, but I could go in many new directions from
here. I guess the boundary between two projects isn’t all that important
as long as I’m achieving my personal goal of learning new stuff. It’s
more of an administrative distinction, and it’s all arbitrary anyway.
perilus feels like a project that I’m going to be playing
with for a long time, so I suppose at some point I’ll decide that it’s
mature enough to transition from “active project” to “personal
infrastructure” that can support other projects.
Regardless of all that, I want to continue working towards my long-term goal of adding a new root node to the Software Tree of Life (SToL), as I’ve described previously. The assembler I started this week is an important part of that. I also want to roughly follow Ken Thompson’s path of creating an assembler, text editor, and kernel that eventually became Unix. That will require some MMIO for a UART or similar interface to a terminal abstraction. Much later, maybe I can write a FORTH or even a C compiler. The most ambitious goal on this path would be to compile a mainstream compiler with my own toolchain, which would tie my root node into the larger Tree.
That path goes up the stack, but I could also have a lot of fun
staying at the processor level by making it faster or adding support for
more instruction extensions. Pipelining is the most important next step
for general performance. I could spend months just implementing hardware
floating point. This path is the most faithful to my personal
interpretation of what the perilus project is supposed to
be about. The SToL idea is separate enough that it might need to stand
on its own.
And of course I could try to build perilus for real,
either on an FPGA or discrete parts. The latter is probably two or three
orders of magnitude harder than the former, so I probably won’t do it in
this project, but running on an FPGA devkit could be a fun “final tour”
before closing the project.
I think I’ve talked myself into making perilus mostly
about implementing a reasonably fast and complete RISC-V processor
rather than pursuing stuff that lives on top of perilus,
which can be in other projects. That probably means that I should use my
new knowledge about Verilator to integrate the simulator with a test
suite so I can increase my confidence about perilus’s
correctness. Combined with my own test suite, this should be enough to
avoid uncaught regressions and to make forward progress. I imagine there
are also standard benchmarks that I could run.