bearfish

TL;DR: I’ve been recording my weekly Discord calls with my friend Logan for about a year and a half (with his consent), and this week I finally got around to running the audio through a transcription and diarization system. I used local models on our Nvidia Jetson AGX Orin devkit, which performs this task at about 2x real time. It was occasionally frustrating to get the software set up properly on the Jetson, but now I have a minimal solution that I can improve incrementally in the future. I call this project bearfish.

Re-flashing the Jetson

Logan and I, together with our friend Brady, bought the Jetson about a year and a half ago. In that time, I’ve had mixed success in getting it to do the things I want. The troubles I have with the Jetson seem to have two root causes: the processor architecture and the platform software. Nvidia’s target use case for the Jetson product line appears to be robotics, which explains its relatively low compute and power usage. We selected the devkit because it has a lot of GPU memory for its price (64 GB in our case¹). However, the Jetson runs an ARM processor, which means that there aren’t nearly as many pre-built wheel files for various Python libraries as there are for x86. Jetson devices also seem to require special handling when interacting with the GPU, which complicates things further.² All this means that doing anything with the Jetson is harder than it would be on a more typical Nvidia system on average.

Fortunately, Dustin Franklin at Nvidia has created—among other marvels—the jetson-containers repo, which is a large collection of Dockerfiles that set up containers for lots of different ML programs and libraries, along with some shell tools to orchestrate them. I was vaguely aware of this repo when I started working with our Jetson, but I ignored it in favor of working directly on the host system. This was a mistake, and I wish I’d used this repo sooner. It has its own problems, which I describe below, but it’s substantially better in almost every way than trying to set up system libraries and virtual environments on the host. The best way I can describe it is that jetson-containers is to Python virtual environments as venvs are to the global Python installation. You’ll just have a better time if you use them.

All of my messing with the host environment over the past year or more made it really painful to try to upgrade the Jetson’s software (collectively called JetPack) to the latest supported version for our hardware. In fact, after wrestling with apt for several hours and thinking I’d fixed it, I tried to reboot the device and found that it wouldn’t connect to the network. I’d essentially bricked it and would need to reflash. This brought more pain. The preferred (only?) method for reflashing the Jetson is to use Nvidia’s SDK Manager, which is bad enough on its own, but the particularly annoying system requirements specify that the host must use one of exactly three versions of Ubuntu. No other Linux flavors are supported for flashing JetPack 6.2. This astonished me. How can the SDK Manager be so sensitive to its environment? I must be missing a lot of information about the complexities in play here.

I felt a lot of extra frustration after trying and failing several times to get the SDK Manager to do its job under a live boot of the correct version of Ubuntu. Even with enough RAM to store the enormous software packages, it still didn’t work right, so I had to install Ubuntu to a spare disk and boot from there. Only then did it work correctly, which is at least what the documentation promises. The moral of the story is: when Nvidia says jump, you say bra.uni L_how_high;.

So I was finally back to a working system, and most of our data was still intact due to it being stored on an SSD and not on the main flash, which the reinstall had blown away. In fairness, I didn’t have to do much to get jetson-containers working. Docker was already installed, so I basically had to git clone and run one other command to add the tool to my PATH. The rest was mostly a matter of waiting for things to compile and figuring out how to do more interesting stuff with the container tools. I like the design choices here—refreshingly Unix-y.

The only gripe I have about jetson-containers is that some of the Dockerfiles are broken, or at least they don’t build on my system. Since this is essentially a fresh reflash, I’m inclined to believe they’re broken for everyone. The one that bit me hardest for this project was whisperx, which was last edited during a grand refactor and apparently never tested. I did try to figure out the problem, but I’m not smart enough with uv and friends to make it work. I’ve run into a few other build failures that I suspect are due to bad Dockerfiles. But for the most part it’s been fun to be able to spin things up relatively easily and without disturbing any neighboring environments. Just make sure you have a large, fast disk. jetson-containers does make clever use of Docker’s layer caching, though, so some of the more common layers are shared.

Transcription and Diarization

Having spent way more time than I intended on system administration, running transcription had almost fallen out of my mind. I’d failed to build whisperx, so I moved over to faster-whisper and wrote a small script to call into it. I stored the start and end times for each detected word in a SQLite database, along with the word’s probability/confidence value and an identifier for the audio file where it came from. This code isn’t too interesting because it’s not that far from the example code in the repo, so I haven’t bothered to post it anywhere. I probably should have enabled the VAD filter to avoid hallucinations during periods of noise or silence, but I think the diarization system may take care of that in the end. The Jetson takes about 48 hours to process 222 hours of audio in this stage, for an effective rate of about 4x real time.

I haven’t finished diarization yet, but I’ve tested it on a few files and will probably kick it off for the remaining ones soon. I’m using pyannote.audio,³ which doesn’t have a package in jetson-containers but is easily added based on the torchaudio package with a single call to pip. The code is mildly more interesting because in addition to storing the diarization information from the model’s output, I also added some logic to compare the speaker embeddings for the current audio file to the speakers already referenced in the database. For each individual audio file, pyannote doesn’t know who “Logan” and “Bradley” are. It just sees some vectors in a latent space and arbitrarily assigns them "SPEAKER_00" and "SPEAKER_01". To avoid having to manually map these vectors back to our real identities for every file, I thought I’d use cosine similarity to do it automatically. I still have to do it once when the database is empty, of course, but that’s alright. On the rare occasion that another speaker is briefly on the call, this should catch that too. It remains to be seen how well this works in practice, but after some minimal testing it seems promising. Diarization runs a little slower than transcription, and I also have to convert the audio to a particular format first, which adds a small amount of extra effort and time.

I don’t intend to make this an ongoing formal project for this blog, partly because I don’t think it’ll be that interesting and partly because any insights we gain from the actual transcription corpus will be personal and private. I do plan to try to push the corpus through some kind of language model to attempt summarization, queries with RAG, vector search, and maybe even automating the creation of a personalized wiki. As I think of stuff and have the time, I’ll poke at it, and in the meantime we’ll keep adding more data.

The GPU memory and main memory are integrated.↩︎
I freely admit to being ignorant of the details here. GPU programming is already complex, and the Nvidia ecosystem introduces its own complications. Do not look directly at the Nvidia ecosystem.↩︎
You may ask “why not skip diarization and simply record into separate channels?” The answer is: I do, but not very well. I don’t use headphones for most calls, so there’s some crosstalk from the remote to the local channel (speakers to microphone on my end). I might try blanking the local channel when the remote channel exceeds a certain sound power, but I haven’t bothered yet. My goal with this project is to get some kind of data set going so we can do stuff with it. It’s not a court transcript that needs to be perfectly accurate.↩︎

🔗Re-flashing the Jetson

🔗Transcription and Diarization

Re-flashing the Jetson

Transcription and Diarization