Tidying ilang-python

TL;DR: My friend Logan has been working on a project called 𝚒 for many months. 𝚒 is a domain-specific language for high-performance tensor computation. This week I made a small contribution to the project by improving the Python wrapper. I also learned more about the CUDA programming model.

ilang-python

For information about the 𝚒 project in general, I refer you to the repo and its README.

Logan has put in a lot of effort to specify a multi-pass source-to-source tensor compiler in Rust. The frontend of the compiler uses a DSL of his own design where the user can write so-called 𝚒 expressions. For our purposes here we can treat these as opaque string literals that encode a particular tensor operation, as if they were immutable runes of parallel computation. The Rust portion of the project takes 𝚒 expressions as input (more or less), lowers them to the target backend source language, and then builds and runs them on user-provided inputs. The ilang-python module simply loads the dynamic library that actually contains the 𝚒 compiler and manages I/O between the user in Python land and the compiler’s various functions.

All this is to say that the Python wrapper isn’t really part of the high-minded CS theory at the core of the 𝚒 project. Its value is in the ubiquity and familiarity of Python among machine learning programmers, which may someday help in its adoption among the most performance-minded model authors. When I decided to start hacking on ilang-python last week, the entirety of the wrapper lived in a single 500-line file which no human had ever seen directly. Its design was pretty much what you’d expect: lots of ctypes to load and define the core functions from the .so, followed by some classes to handle 𝚒 expressions and the data they operate on. The only clear signs of vibecoding on a first pass were some dead code and questionable duplication of logic here and there.

No types, though. That’s the first major change I made. Qwen 3.6 added its guesses at types and function signatures, and I nodded approvingly while paging through the diff. The existing minimal test suite still passed, and ty was happy,¹ so I moved on to splitting the single file into modules and restructuring the package to be more Pythonic. This was also pretty easy because the single file was already well-separated internally. I also added a Nix flake configuration because I’d already written one for myself, and Logan didn’t seem to mind.

On our weekly Discord call last week, Logan and I had a lot of fun building most of a benchmarking capability for 𝚒 components (which are compositions of expressions), complete with warmup runs and automatic computation of mean and standard deviation. This will probably be an important tool for optimizing individual algorithms, both manually and by machine search in expression space. This pair programming session was especially fun because we used tmate, a terminal-sharing tool that works over SSH. In some sense this arrangement is better than traditional pair programming in the same room because we each get access to our own keyboards on the same machine—it’s just that one of the keyboard cables is very long.

Learning CUDA

I’ve played with CUDA before, but working on ilang-python reminded me that I never made a formal attempt at learning it from the beginning. The NVIDIA/CUDA ecosystem is large and complicated, but it turns out that the introductory documentation is quite approachable, though it’s pretty verbose. I slowly read and re-read the first chapter of the programming guide, which describes the programming model independently of any language, and that has clarified a lot of what I’ve seen and heard up until now. There appears to be a lot of useful documentation related to optimizing kernels for various use cases, which I’m looking forward to reading.

Logan and I have discussed the possibility of developing an execution cost model for backend code so some future system can optimize for speed in 𝚒 expression space without having to actually run anything (or at least do it much less often). As I was learning the basics of CUDA, I realized that while there’s a lot of freely available information about the “spatial” and architectural aspects of a given GPU,² there isn’t nearly as much official information about temporal aspects, like the number of cycles to load or store data in various parts of the memory hierarchy. These values would be vital for an effective cost model, and it turns out that some clever people have worked a lot of them out from experiment (see here and here, for example). I hope to become smart enough about CUDA to make use of these research results someday.

When I picked up the CUDA programming guide I decided to also try using Anki to make flashcards. I’d already been doing this for my DSP book with some success, and so far it looks like it was a good move here too. Cloze deletion cards are the best in my opinion, and in many cases it’s possible to copy sentences directly out of the text and simply mark a few terms as clozes. Sometimes I realize later that the cards need editing, but the Anki app makes this easy. For some reason I didn’t think flashcards would work well for engineering and software concepts, but I was wrong. When paired with actually building something, I think flashcards are a viable learning and memorization tool for a lot of the stuff I’m interested in.

When using LLMs for almost any kind of task, I’ve found it helpful—maybe vital—to give it some tool(s) that it can run on its own to get feedback on its progress to the goal. Type checkers, linters, and compilers are excellent for this. It’s not that surprising in hindsight, since I need feedback too in order to make a computer do what I want.↩︎
Such as the number of available streaming multiprocessors, maximum block/grid dimensions, number of registers, cache sizes, etc.↩︎

Tidying `ilang-python`

`ilang-python`

Learning CUDA

🔗ilang-python

🔗Learning CUDA

`ilang-python`

Learning CUDA