Now that my game engine has most of the features that make it usable, I wanted to spend the next couple of months focusing on performance.
I decided to start with the Light Pass. For testing, I loaded a scene with around 214,000 vertices. Not a huge scene, but enough to get meaningful profiler data. After running the profiler, these were the numbers for the Light Pass:
- GPU Time: 2.53 ms
- ALU Limiter: 37.65%
- ALU Utilization: 35.77%
- Texture Read Limiter: 57.8%
- Texture Read Utilization: 26.83%
- MMU Limiter: 32.19%
The biggest limiter was Texture Read, at almost 58%. This means the GPU was spending a lot of time fetching data from textures—likely because they weren’t being cached efficiently. In hindsight, I should have started by tackling the biggest limiter. Instead, I first focused on improving the MMU. Not the best choice, but that’s how you learn.
MMU Limiter means your GPU performance is constrained by memory address lookups and fetches, not arithmetic.
Optimization 1: Buffer Pre-Loading
Buffer pre-loading combines related data into a single buffer so the GPU can fetch it more efficiently, instead of bouncing between multiple buffers. In my original shader, I was sending light data through separate buffers:
constant PointLightUniform *pointLights [[buffer(lightPassPointLightsIndex)]]
constant int *pointLightsCount [[buffer(lightPassPointLightsCountIndex)]]
I restructured this into a single struct that packages light data together. This change reduced the MMU Limiter from 32.19% → 26.43%.
Optimization 2: Use .read() Instead of .sample()
During the Light Pass, I fetch data from multiple G-buffer textures such as Position, Normal, Albedo, and SSAO. Originally, I used .sample(), but this does more work than necessary—it applies filtering and mipmap logic, which adds both memory traffic and math. Switching to .read() (a direct texel fetch) gave a noticeable improvement:
- Texture Read Limiter: 57.8% → 46.79%
- Texture Read Utilization: 26.83% → 30.03%
Optimization 3: Reduce G-Buffer Resolution
Next, I reduced the G-Buffer textures from float to half. I expected this to lower bandwidth usage, but to my surprise it made things worse:
- Texture Read Limiter: 46.79% → 56.9%
- Texture Read Utilization: 30.03% → 32.31%
Sometimes optimizations backfire, and this was one of those cases.
Optimization 4: Half-Precision Math in Lighting
I then focused on ALU utilization by switching parts of my lighting calculations to use half-precision (half) math. Specifically:
- Diffuse contribution → half precision
- Specular contribution → full precision (float)
The results were underwhelming:
- ALU Limiter: 37.65% → 37.55%
- ALU Utilization: 35.77% → 35.83%
- F32 Utilization: 27.86% → 25.12%
- F16 Utilization: 0% → 1.11%
Half-precision math showed up in the counters, but didn’t change the overall bottleneck.
Final Results
Here’s the overall improvement from all the optimizations:
Before → After
- GPU Time: 2.53 ms → 2.31 ms (~8.7% faster)
- Texture Read Limiter: 57.81% → 50.93%
- MMU Limiter: 32.19% → 23.41%
- ALU Limiter: 37.65% → 37.55% (flat)
- F32 Utilization: 27.86% → 25.12%
- F16 Utilization: 0.00% → 1.11%
- Integer & Complex Limiter: 14.25% → 8.82%
- Texture Read Utilization: 26.83% → 31.62%
What the Numbers Say
- I shaved about 9% off the Light Pass—a real improvement, but not dramatic.
- The biggest wins came from reducing memory-side pressure (MMU ↓, Texture Read Limiter ↓).
- ALU stayed flat, which shows the pass is still memory/texture-bound, not math-bound.
- Half-precision math registered, but didn’t help much since math wasn’t the bottleneck.
- Removing unnecessary integer/complex math improved things locally, but again, the frame was dominated by texture fetch bandwidth.
Takeaway
Optimizations don’t always yield big wins, but each attempt brings clarity. In this case, the profiler clearly shows that the Light Pass is memory/texture-bound. My next steps will focus directly on reducing texture fetch cost, rather than trimming ALU math.
Thanks for reading.