bearfish2025-11-13
TL;DR: I’ve been recording my weekly Discord calls
with my friend Logan for about a year and a half (with his consent), and
this week I finally got around to running the audio through a
transcription and diarization system. I used local models on our Nvidia
Jetson AGX Orin devkit, which performs this task at about 2x real
time. It was occasionally frustrating to get the software set up
properly on the Jetson, but now I have a minimal solution that I can
improve incrementally in the future. I call this project bearfish.
Logan and I, together with our friend Brady, bought the Jetson about a year and a half ago. In that time, I’ve had mixed success in getting it to do the things I want. The troubles I have with the Jetson seem to have two root causes: the processor architecture and the platform software. Nvidia’s target use case for the Jetson product line appears to be robotics, which explains its relatively low compute and power usage. We selected the devkit because it has a lot of GPU memory for its price (64 GB in our case1). However, the Jetson runs an ARM processor, which means that there aren’t nearly as many pre-built wheel files for various Python libraries as there are for x86. Jetson devices also seem to require special handling when interacting with the GPU, which complicates things further.2 All this means that doing anything with the Jetson is harder than it would be on a more typical Nvidia system on average.
Fortunately, Dustin
Franklin at Nvidia has created—among other marvels—the jetson-containers
repo, which is a large collection of Dockerfiles that set
up containers for lots of different ML programs and libraries, along
with some shell tools to orchestrate them. I was vaguely aware of this
repo when I started working with our Jetson, but I ignored it in favor
of working directly on the host system. This was a mistake, and I wish
I’d used this repo sooner. It has its own problems, which I describe
below, but it’s substantially better in almost every way than trying to
set up system libraries and virtual environments on the host. The best
way I can describe it is that jetson-containers is to
Python virtual environments as venvs are to the global
Python installation. You’ll just have a better time if you use them.
All of my messing with the host environment over the past year or
more made it really painful to try to upgrade the Jetson’s software
(collectively called JetPack) to the
latest supported version for our hardware. In fact, after wrestling with
apt for several hours and thinking I’d fixed it, I tried to
reboot the device and found that it wouldn’t connect to the network. I’d
essentially bricked it and would need to reflash. This brought more
pain. The preferred (only?) method for reflashing the Jetson is to use
Nvidia’s SDK Manager, which is bad enough on its own, but the
particularly annoying system
requirements specify that the host must use one of exactly three
versions of Ubuntu. No other Linux flavors are supported for
flashing JetPack 6.2. This astonished me. How can the SDK Manager
be so sensitive to its environment? I must be missing a lot of
information about the complexities in play here.
I felt a lot of extra frustration after trying and failing several
times to get the SDK Manager to do its job under a live boot of the
correct version of Ubuntu. Even with enough RAM to store the enormous
software packages, it still didn’t work right, so I had to install
Ubuntu to a spare disk and boot from there. Only then did it work
correctly, which is at least what the documentation promises. The moral
of the story is: when Nvidia says jump, you say
bra.uni L_how_high;.
So I was finally back to a working system, and most of our
data was still intact due to it being stored on an SSD and not on the
main flash, which the reinstall had blown away. In fairness, I didn’t
have to do much to get jetson-containers working. Docker
was already installed, so I basically had to git clone and
run one other command to add the tool to my PATH. The rest
was mostly a matter of waiting for things to compile and figuring out
how to do more interesting stuff with the container tools. I like the
design choices here—refreshingly Unix-y.
The only gripe I have about jetson-containers is that
some of the Dockerfiles are broken, or at least they don’t
build on my system. Since this is essentially a fresh reflash, I’m
inclined to believe they’re broken for everyone. The one that bit me
hardest for this project was whisperx,
which was last edited during a grand
refactor and apparently never tested. I did try to figure out the
problem, but I’m not smart enough with uv and friends to
make it work. I’ve run into a few other build failures that I suspect
are due to bad Dockerfiles. But for the most part it’s been
fun to be able to spin things up relatively easily and without
disturbing any neighboring environments. Just make sure you have a
large, fast disk. jetson-containers does make clever use of
Docker’s layer caching, though, so some of the more common layers are
shared.
Having spent way more time than I intended on system administration,
running transcription had almost fallen out of my mind. I’d failed to
build whisperx, so I moved over to faster-whisper
and wrote a small script to call into it. I stored the start and end
times for each detected word in a SQLite database, along with the word’s
probability/confidence value and an identifier for the audio file where
it came from. This code isn’t too interesting because it’s not that far
from the example
code in the repo, so I haven’t bothered to post it anywhere. I
probably should have enabled the VAD filter to avoid hallucinations
during periods of noise or silence, but I think the diarization system
may take care of that in the end. The Jetson takes about 48 hours to
process 222 hours of audio in this stage, for an effective rate of about
4x real time.
I haven’t finished diarization yet, but I’ve tested it on a few files
and will probably kick it off for the remaining ones soon. I’m using pyannote.audio,3 which doesn’t have a package in
jetson-containers but is easily added based on the
torchaudio package with a single call to pip.
The code is mildly more interesting because in addition to storing the
diarization information from the model’s output, I also added some logic
to compare the speaker embeddings for the current audio file to the
speakers already referenced in the database. For each individual audio
file, pyannote doesn’t know who “Logan” and “Bradley” are.
It just sees some vectors in a latent space and arbitrarily assigns them
"SPEAKER_00" and "SPEAKER_01". To avoid having
to manually map these vectors back to our real identities for every
file, I thought I’d use cosine similarity to do it automatically. I
still have to do it once when the database is empty, of course, but
that’s alright. On the rare occasion that another speaker is briefly on
the call, this should catch that too. It remains to be seen how well
this works in practice, but after some minimal testing it seems
promising. Diarization runs a little slower than transcription, and I
also have to convert the audio to a particular format first, which adds
a small amount of extra effort and time.
I don’t intend to make this an ongoing formal project for this blog, partly because I don’t think it’ll be that interesting and partly because any insights we gain from the actual transcription corpus will be personal and private. I do plan to try to push the corpus through some kind of language model to attempt summarization, queries with RAG, vector search, and maybe even automating the creation of a personalized wiki. As I think of stuff and have the time, I’ll poke at it, and in the meantime we’ll keep adding more data.
The GPU memory and main memory are integrated.↩︎
I freely admit to being ignorant of the details here. GPU programming is already complex, and the Nvidia ecosystem introduces its own complications. Do not look directly at the Nvidia ecosystem.↩︎
You may ask “why not skip diarization and simply record into separate channels?” The answer is: I do, but not very well. I don’t use headphones for most calls, so there’s some crosstalk from the remote to the local channel (speakers to microphone on my end). I might try blanking the local channel when the remote channel exceeds a certain sound power, but I haven’t bothered yet. My goal with this project is to get some kind of data set going so we can do stuff with it. It’s not a court transcript that needs to be perfectly accurate.↩︎