ilang-python2026-06-24
TL;DR: My friend Logan has been working on a project called 𝚒 for many months. 𝚒 is a domain-specific language for high-performance tensor computation. This week I made a small contribution to the project by improving the Python wrapper. I also learned more about the CUDA programming model.
ilang-pythonFor information about the 𝚒 project in general, I refer you to the repo and its
README.
Logan has put in a lot of effort to specify a multi-pass
source-to-source tensor compiler in Rust. The frontend of the compiler
uses a DSL of his own design where the user can write so-called 𝚒
expressions. For our purposes here we can treat these as opaque string
literals that encode a particular tensor operation, as if they were
immutable runes of parallel computation. The Rust portion of the project
takes 𝚒 expressions as input (more or less), lowers them to the target
backend source language, and then builds and runs them on user-provided
inputs. The ilang-python module simply loads the dynamic
library that actually contains the 𝚒 compiler and manages I/O between
the user in Python land and the compiler’s various functions.
All this is to say that the Python wrapper isn’t really part of the
high-minded CS theory at the core of the 𝚒 project. Its value is in the
ubiquity and familiarity of Python among machine learning programmers,
which may someday help in its adoption among the most performance-minded
model authors. When I decided to start hacking on
ilang-python last week, the entirety of the wrapper lived
in a single 500-line file which no human had ever seen directly. Its
design was pretty much what you’d expect: lots of ctypes
to load and define the core functions from the .so,
followed by some classes to handle 𝚒 expressions and the data they
operate on. The only clear signs of vibecoding on a first pass were some
dead code and questionable duplication of logic here and there.
No types, though. That’s the first major change I made. Qwen 3.6
added its guesses at types and function signatures, and I nodded
approvingly while paging through the diff. The existing minimal test
suite still passed, and ty was happy,1 so
I moved on to splitting the single file into modules and restructuring
the package to be more Pythonic. This was also pretty easy because the
single file was already well-separated internally. I also added a Nix
flake configuration because I’d already written one for myself, and
Logan didn’t seem to mind.
On our weekly Discord call last week, Logan and I had a lot of fun
building most of a benchmarking capability for 𝚒 components (which are
compositions of expressions), complete with warmup runs and automatic
computation of mean and standard deviation. This will probably be an
important tool for optimizing individual algorithms, both manually and
by machine search in expression space. This pair programming session was
especially fun because we used tmate, a
terminal-sharing tool that works over SSH. In some sense this
arrangement is better than traditional pair programming in the same room
because we each get access to our own keyboards on the same machine—it’s
just that one of the keyboard cables is very long.
I’ve played with CUDA before, but working on
ilang-python reminded me that I never made a formal attempt
at learning it from the beginning. The NVIDIA/CUDA ecosystem is large
and complicated, but it turns out that the introductory documentation is
quite approachable, though it’s pretty verbose. I slowly read and
re-read the first chapter of the programming
guide, which describes the programming model independently of any
language, and that has clarified a lot of what I’ve seen and heard up
until now. There appears to be a lot of useful documentation related to
optimizing kernels for various use cases, which I’m looking forward to
reading.
Logan and I have discussed the possibility of developing an execution cost model for backend code so some future system can optimize for speed in 𝚒 expression space without having to actually run anything (or at least do it much less often). As I was learning the basics of CUDA, I realized that while there’s a lot of freely available information about the “spatial” and architectural aspects of a given GPU,2 there isn’t nearly as much official information about temporal aspects, like the number of cycles to load or store data in various parts of the memory hierarchy. These values would be vital for an effective cost model, and it turns out that some clever people have worked a lot of them out from experiment (see here and here, for example). I hope to become smart enough about CUDA to make use of these research results someday.
When I picked up the CUDA programming guide I decided to also try using Anki to make flashcards. I’d already been doing this for my DSP book with some success, and so far it looks like it was a good move here too. Cloze deletion cards are the best in my opinion, and in many cases it’s possible to copy sentences directly out of the text and simply mark a few terms as clozes. Sometimes I realize later that the cards need editing, but the Anki app makes this easy. For some reason I didn’t think flashcards would work well for engineering and software concepts, but I was wrong. When paired with actually building something, I think flashcards are a viable learning and memorization tool for a lot of the stuff I’m interested in.
When using LLMs for almost any kind of task, I’ve found it helpful—maybe vital—to give it some tool(s) that it can run on its own to get feedback on its progress to the goal. Type checkers, linters, and compilers are excellent for this. It’s not that surprising in hindsight, since I need feedback too in order to make a computer do what I want.↩︎
Such as the number of available streaming multiprocessors, maximum block/grid dimensions, number of registers, cache sizes, etc.↩︎