↖️ Blog Archive

perilus, Part Seven

Bradley Gannon

2026-05-03

TL;DR: I used Verilator to convert the Verilog version of perilus into a C++ library. Then I linked that against a Rust binary with ratatui to expose a simulation frontend. Running the simulator revealed some logic bugs in the memory module, so I fixed those. I also built the backend of an assembler.

Simulator

Having built up enough of perilus to at least nominally support RV32I, my next goal was to build an interactive simulation program that would instantiate a perilus machine and let me peek/poke its state. I imagined this as analogous to the large panels of lights and switches on the earliest electronic computers. No shell, just registers, memory, and the ability to step the clock.

The first step towards this goal was to use Verilator. The ChiselSim test system already uses Verilator internally, but I needed to do this in an explicit and standalone way. The Verilator docs are pretty good, so I was able to take the Verilog output from my Chisel code and run it through Verilator to produce the source code for a C++ library that implements perilus.

I wanted to write the frontend in Rust, so I wrote a wrapper in C++ that uses the Verilator API to export functions for the operations I needed (instantiation, destruction, and read/write for relevant parts of the processor state). I also had to write a corresponding file in Rust that declares the same functions as extern and exports its own API for more idiomatic control. Then I wrote a build.rs that compiles the wrapper and the generated C++ library before linking them to the Rust binary. After some fighting with C++ (or really my ignorance of it), I had a mostly empty main.rs with the ability to create a Perilus struct and peek/poke it like I wanted.

I didn’t know anything about ratatui except that I wanted to use it, so I learned just enough to display the registers, memory, program counter, and control unit state. I also added basic controls for editing values and pulsing the clock. The simplest input method I could come up with was to listen for any hex character and shift it into the least-significant nibble of the highlighted word. This doesn’t require any widget state for sub-word cursor position. Shift+c also clears the word. This was enough to start programming.

Screenshot of the perilus simulator, in which I’ve entered “hello world!” starting at address 0x0
Registers are at the top and are labeled according to the ABI. (Pressing n toggles the labels to their indices.) The run state indicates whether the simulator is idle, running until it hits a given memory address, or free running. In the current setup, the processor runs at 60 Hz. The control unit state is reported below the run state. Memory is at the bottom, with the address lines labeled on the left, hex data in the middle, and the ASCII representation of that same data on the right. Memory is grouped into words, which requires some mental transposition because RISC-V is little endian. The highlighted word is the location of the cursor for editing memory. The word drawn in green is the location of the program counter.

I wanted to begin working towards an assembler, so I wrote out some routines for assembling R-Type and I-Type instructions on paper and assembled them by hand. Once I’d gotten these working, I was able to use them to assemble further routines for other instruction types. For any instructions that didn’t have an assembler routine yet, I still had to assemble them by hand. Building the assembler backend wasn’t a straightforward process because I didn’t have a clear design in my head or on the page, and I couldn’t make one until I understood what was sensible under these unfamiliar constraints. I had to keep reminding myself that I only need this to be a bootstrap assembler, so it doesn’t need to be beautiful or even complete. It just needs to be a useful tool for long enough to get me to a higher level of abstraction and speed.

Eventually I settled on a design for an intermediate representation for instructions and their operands. Each instruction in this representation takes two words. The first word only uses the two least-significant bytes, which encode the instruction type and certain other fields that can be deduced from the instruction name alone. The other word encodes the operands, which are specific to each instruction type. I designed this IR to be easy to read and write in the hex memory view of the simulator. This results in a lot of wasted bits, but I was willing to accept the trade. The machine code below is a routine that takes the two IR words as arguments and returns the the assembled instruction. It should work for any instruction type in RV32I and should even be able to assemble itself, but I haven’t tried that yet.

Two scanned notebook pages with columns of handwritten four-byte words. Each column has a modular sum written at the bottom. On the second page, two extra snippets are labeled as “loop” and “checksum”
The assembler backend. The main routine is 166 words long, and the sums of each column modulo 2322^{32} are written at the bottom. The loop snippet reads two words from the pointer in s0, loads them into a0 and a1, calls the routine at s1, and writes the return value to the pointer in s2 before incrementing the pointers and looping. I copied this by hand from the simulator because it has no nonvolatile memory, and I enjoy pretending that I don’t have access to any other computer for this part of the project.

By this point I’d filled several pages with assembly scribbles and had gone quite mad. But I also had a viable assembler backend, and I had constructed it under my vague constraints of bootstrapping as much as I cared to. I started to pick up some speed when I used the assembler to assemble a loop to read two words from a pointer, set them as arguments, call the assembler routine, and store the result at another pointer. This allowed me to write IR into a chunk of memory and quickly assemble it into executable code, which would sometimes become part of the assembler itself.

Memory Module Problems

At some point during my assembly adventure, I found a bug in the memory module. The bug was due to my own misunderstanding about how sub-word access works. I thought that, for example, a lb (load byte) instruction would always read from the lowest byte in a word and write to the lowest byte in the target register, but that’s not the case. The actual behavior is that it truly reads from the specified target byte in memory and sets the entire target register equal to that single byte’s value alone (with sign extension as appropriate). In other words, if memory contains 0x12 0x34 0x56 0x78 and I run lb x5, x0, 0x1, then the contents of x5 should be 0x00000034 because the byte at address 0x1 is 0x34. In the process, the byte is shifted to the least-significant position in the target register. I had similar misunderstandings about how store instructions work.

This required some substantial changes to the memory module to implement correct reads and writes. It turns out that Chisel has support for masked writes in its Mem abstraction, but I needed to change each memory element into a vector of bytes instead of 32-bit integers to make it work. The rest was just a lot of careful management of different shifting and masking cases. I’m still not sure that I got it all correct, but I added a lot more tests to try to cover everything. Unaligned accesses just round down to the nearest boundary for now and aren’t tested, but from what I’ve read it seems like a good response in this situation is to trap and halt. Some implementations also emulate the unaligned access at potentially significant time cost, but I don’t think I’m going to get perilus to that level of sophistication anytime soon.

I also learned that Chisel’s Mem module generally doesn’t get synthesized to actual RAM silicon in most FPGA targets because it’s combinational (i.e., data appears immediately when an address is asserted). Instead, it gets synthesized to ordinary registers. SyncMem has a similar API but is sequential, so it usually gets synthesized to real RAM. Someday I want to switch to SyncMem, but this will probably mean a redesign of the memory module’s interface, since the current setup relies on combinational reads.

Project Status and Future

I still don’t have a clear idea when I’ll consider this project done. I guess I don’t typically begin my long-term projects with clear end states in mind beyond “it works”, but even that doesn’t give a solid answer in this case. In some sense perilus has already served its purpose of teaching me the basics of RISC-V, Chisel, and processor/hardware design, but I could go in many new directions from here. I guess the boundary between two projects isn’t all that important as long as I’m achieving my personal goal of learning new stuff. It’s more of an administrative distinction, and it’s all arbitrary anyway. perilus feels like a project that I’m going to be playing with for a long time, so I suppose at some point I’ll decide that it’s mature enough to transition from “active project” to “personal infrastructure” that can support other projects.

Regardless of all that, I want to continue working towards my long-term goal of adding a new root node to the Software Tree of Life (SToL), as I’ve described previously. The assembler I started this week is an important part of that. I also want to roughly follow Ken Thompson’s path of creating an assembler, text editor, and kernel that eventually became Unix. That will require some MMIO for a UART or similar interface to a terminal abstraction. Much later, maybe I can write a FORTH or even a C compiler. The most ambitious goal on this path would be to compile a mainstream compiler with my own toolchain, which would tie my root node into the larger Tree.

That path goes up the stack, but I could also have a lot of fun staying at the processor level by making it faster or adding support for more instruction extensions. Pipelining is the most important next step for general performance. I could spend months just implementing hardware floating point. This path is the most faithful to my personal interpretation of what the perilus project is supposed to be about. The SToL idea is separate enough that it might need to stand on its own.

And of course I could try to build perilus for real, either on an FPGA or discrete parts. The latter is probably two or three orders of magnitude harder than the former, so I probably won’t do it in this project, but running on an FPGA devkit could be a fun “final tour” before closing the project.

I think I’ve talked myself into making perilus mostly about implementing a reasonably fast and complete RISC-V processor rather than pursuing stuff that lives on top of perilus, which can be in other projects. That probably means that I should use my new knowledge about Verilator to integrate the simulator with a test suite so I can increase my confidence about perilus’s correctness. Combined with my own test suite, this should be enough to avoid uncaught regressions and to make forward progress. I imagine there are also standard benchmarks that I could run.