Andrew Chan


Things I did in April 2024

More gaussian splatting, Metal vs. CUDA, and earthquake worms.

I was able to build some stuff in April, which was fun, but like March, spent lots of time learning stuff in new domains!

Lighting with Gaussian Splats

The first thing I built was an implementation of something I saw on Twitter - a real-time lighting model for Gaussian splats. Looking back, it's pretty straightforward to get some lighting working if you know about traditional lighting techniques. For instance, all you need for shadow mapping is a way to view the depth of a scene from different viewpoints. And shading can also be done with a depth image as it's not too difficult to extract surface normals. So this post was partly just me learning traditional lighting techniques.

Gaussian splats with shadows

A splat scene with shadow mapping applied. Spooky!

That said, just bolting on these techniques isn't really enough as current splatting outputs are not designed to form coherent surface geometry. As I show in the post, we can somewhat fix this by filtering depth, but really we need to alter the reconstruction process so that proper geometry is encouraged.

One frustrating thing was that when exploring various papers as part of this project, I usually spent more time trying to get their code running than reading the actual paper or evaluating its results. For example, with the Relightable Gaussians paper, I spent more than an entire day trying to get the Python environment working with my machine.

Part of this is my own inexperience: for example, I didn't realize that while PyTorch normally comes with its own CUDA runtime which doesn't need to match the system CUDA, when installing C++ extensions for a specific PyTorch installation, it does need to match the system CUDA! I also didn't know that Anaconda provides a virtual environment for all types of dependencies, not just Python libraries. So you can have different CUDA versions in different Anaconda environments.

Adding Apple Silicon Support to Gaussian Splatting

Gaussian splats with shadows

Banana courtesy of OpenSplat.

I also added the first Apple GPU support to any open-source Gaussian splatting programThat I know of! There is Scaniverse, which is very impressive and supports training splats on iPhone, but it's proprietary.. Apple GPU support in the machine learningYeah, I know gaussian splats are not really machine learning. But they use all the same tooling. world generally feels pretty weird. On the one hand basically everyone I know in my tech bubble uses a Mac, so it seems like allowing training of small models locally would be useful, but on the other hand anybody serious will just rent or buy a desktop with an NVIDIA GPU. Maybe there is only substantial demand for supporting inference on Apple GPUs?

In any case I learned a lot about Metal, CUDA, and parallel programming in this project, which involved translating over the fused CUDA kernels used for Gaussian splatting to Metal. Some of my thoughts:

Thoughts on Metal (vs. CUDA)

The device API and shader language

Metal shader language has all the features that were used in the gaussian splatting CUDA kernels, and overall seems fairly well-featured. For example, the various units of execution in the CUDA GPU model are also there in Metal, just with different names:

There are also atomics and warp-level primitives like shuffle instructions, so for the most part I was able to translate CUDA kernels to MPS kernels, parallel reductions and all, by just copy-pasting and looking up the right name for the primitives to use.

There are some annoying differences though. One thing that stumped me for a whole afternoon was that Metal expects a particular alignment of vectorized data types. Specifically Metal assumed float3 arrays were using 16-byte float3 alignment when my program was dispatching the kernels with packed (12-byte) float3 alignment due to the data coming from packed PyTorch tensors. I had to rewrite all the kernels to use "raw" float arrays instead with helper functions to extract float3 values.

Finally, while MSL and CUDA both look a lot like C++, CUDA is a lot closer to being a superset of C++.

The host API

Metal is lower-level and more verbose than CUDA. Whereas a kernel dispatch in CUDA is as easy as writing a funny-looking function call, in Metal, you need to manually encode which kernel you're dispatching, what arguments you're passing to it, and more. Apparently, this is common for graphics programming APIs: WebGPU works the same way as Metal.

In CUDA, defining and dispatching a kernel to add all elements of a float array allocated on the device might look like so:

          
// host code
void add_constant(float x, float* data, const int n) {
  int blocks = (n + N_THREADS - 1) / N_THREADS;
  add_constant_kernel<<<blocks, N_THREADS>>>(n, x, data);
}

// device code
__global__ void add_constant_kernel(const int n, const float x, float* data) {
  unsigned idx = cg::this_grid().thread_rank();
  if (idx >= n) return;
  data[idx] += x;
}
          
      

In Metal it looks like this:

          
// host code
void add_constant(MetalContext* ctx, float x, float* data, const int n) {
  id<MTLCommandBuffer> command_buffer = [ctx->command_queue commandBuffer];
  id<MTLComputeCommandEncoder> encoder = [command_buffer computeCommandEncoder];
  
  // add_constant_cpso is a MTLComputePipelineState object created beforehand
  [encoder setComputePipelineState:ctx->add_constant_cpso];
  [encoder setBytes:&n length:sizeof(n) atIndex:0];
  [encoder setBytes:&x length:sizeof(x) atIndex:1];
  [encoder setBuffer:data offset:0 atIndex:2];
  [encoder dispatchThreadgroups:thread_groups threadsPerThreadgroup:threads_per_group];
  [encoder endEncoding];
  [command_buffer commit];
}

// device code
kernel void add_constant_kernel(
	const int& n, 
	const float& x, 
	device float* data, 
	uint idx [[thread_position_in_grid]]
) {
  if (idx >= n) return;
  data[idx] += x;
}
          
      

I'm not really sure why there is such a big difference in the developer experience. For this particular example, it seems like both a common thing to do and fairly easy to implement a compile-time-checkable kernel dispatch syntax. Twitter has not been helpful in answering my questions. That said it's possible it's much harder than I think it is because you need to be able to handle your global context, command buffer, etc. somewhere, and thus have to think carefully about how that affects other parts of your API.

Other

XCode shader validation

Metal does have some things going for it. The shader debugging experience via XCode is pretty nice; you can turn on shader validation, which will add a bunch of checks at runtime for things like invalid loads/stores. This can help catch incorrect argument encoding among other things (although a compile-time check for these types of errors is certainly preferable).

Also, I personally prefer the C++-like syntax of Metal to the Rust-like syntax of WGSL. This also made translation from CUDA easier.

Claude was also extremely helpful in learning about CUDA vs Metal. It got some surprising things right, like identifying that a memory stomp bug I was running into was due to alignment. However, it wasn't totally reliable, and kept thinking that I could use printf in my Metal shaders.

Overall, Metal feels like a much rougher developer experience than CUDA for me. I'm still very much a beginner to parallel programming (having only learned about things like parallel reductions in this project), so I don't feel I have a good understanding of how important the differences I highlighted are, or why they exist. However, at the very least, CUDA has been around since 2007, while Metal came out in 2014, so the former has clearly had much more time to mature.

The best references I found for both were:

Taking the fast.ai course

Fashion MNIST generated samples

Some samples from a diffusion model trained on the Fashion-MNIST dataset as part of the fast.ai course.

I also started taking the fast.ai course in April to get more hands-on experience with the models and tools behind the last few years of the AI wave. Part 1 was pretty basic and covered topics that I wasn't interested in (many older and less general techniques like decision trees, which I had learned about in my undergrad ML courses, and too much of a focus on using the fastai library), so I skipped it and moved on to part 2, which "builds Stable Diffusion from scratch".

Some thoughts on fastai:

However:

I think that I'm not used to a lecture-only, theory-light course which doesn't test me on the material afterwards. Probably students more comfortable with self-assessment, coming up with their own questions and answers, will have an easier time learning from this course.

Other

Still sorta tech-related: I joined South Park Commons this month. It's great having a community of interesting people to explore with and keep me accountable!

My favorite non-technical readings this month were:

Here's a good quote from the joodaloop essay:

But what if you like doing things? What if your ethic is Protestant, your dedication is to the craft, and you just can't live without the need to be useful?

What if you're like the programmers who are not mad about the possibility of losing their jobs, as much as they are sad that they can't spend the rest of their career typing ASCII characters onto a screen while thinking hard? They don't care about “prompting” an opaque model into creating business value, that's not why they got into this.

You want problems. You want to feel smart about solving them, this is your jam. Some of you are actually sad that you cannot “grind” anymore, that things will be made too easy. (The grind is dead, long live the grind). I think this is as real a sadness as any other.

Suzume

I also started a thing with friends where we watch Asian movies. We watched Suzume and argued about how the movie could've been better for two hours. My take: the movie has good bones, but the theme of moving on from past trauma is undermined by the weird romance and the other theme of saving Japan from the earthquake worms.