From 29 FPS to 37 FPS: Fixing a CPU Bottleneck in My Game Engine

After adding several cool features to the engine (did I mention I added a console log), it was time to shift gears and focus on performance.

At the moment, rendering around 214,000 vertices (500 models), the engine was only hitting 29.51 FPS. That’s rough for real-time rendering. Clearly, something needed fixing.

Current State of the Engine: FPS 29.51

Profiling the Problem

I fired up Xcode’s GPU tools and the results were clear: the engine is CPU-bound.

  • CPU Frame Time: ~33.9ms
  • GPU Frame Time: ~8.1ms

So while the GPU was waiting around, the CPU was overloaded preparing work.

Untold Engine is CPU-Bound

Looking deeper with Instruments, I found the major culprit: Metal Encoder Setup Time. The CPU was spending ~31ms every frame just encoding commands into the GPU.

Metal Encoder Preparation

Why So Slow?

The bottleneck came from the Shadow and Geometry passes. Each frame, the CPU had to prepare encoders and push all material data for every model—base color, roughness, metallic textures, etc. With hundreds of models, this ballooned into a huge overhead.

First Fix: GPU Frustum Culling

The engine didn’t have any form of culling, so I decided to implement Frustum Culling. To avoid piling more work on the CPU, I pushed this logic onto the GPU.

The approach:

  • Construct the camera frustum.
  • Compute each entity’s world-space AABB.
  • Send bounding boxes to the GPU.
  • GPU checks if each AABB is inside the frustum.
  • If visible, the entity ID is written into an array via an atomic add.

The key here is that once the GPU returned the list of visible entities, the CPU only needed to encode draw calls for those entities—cutting encoder overhead. It’s a brute-force implementation, but it worked.

The Results

From the same view location,

  • FPS jumped from 29 → 37.
  • CPU Frame Time: 33.9ms → 26.7ms
  • Metal Encoder Setup Time: 31.0ms → 14.6ms
  • GPU Frame Time: 8.1ms (unchanged)

Improved FPS

Engine still CPU-Bound but is an improvement

Summary

Here’s the before-and-after snapshot:

  • FPS: 29 to 37 (+27% improvement)
  • CPU Frame Time: 33.9 ms to 26.7 ms (Encoder bottleneck reduced)
  • GPU Frame Time: 8.1 ms to 8.1 ms (Unchanged)
  • Metal Encoder Setup Time: 31.0 ms to 14.6 ms (Biggest gain)

Metal Encoder Duration decreased

Where Things Stand

The engine is still CPU-bound, but it’s in a noticeably better state than it was a week ago. By filtering out invisible objects early, I reduced the CPU’s workload and freed up encoder time. It’s not at 60 FPS yet—but the path forward is clearer.

What’s Next

Frustum culling was just the first step. To keep pushing toward 60 FPS, here are the next optimization I plan to explore:

  • Metal Bindless (Argument Buffers) – Instead of rebinding textures and material properties for every draw, I’ll move to a bindless model. All materials will live in a single argument buffer, and each draw will reference them with a simple index. This should drastically cut down CPU encoder overhead and pair nicely with GPU-driven culling.

That’s where the engine stands today: better than last week, not yet where it needs to be. But the direction is clear, and each step forward is one step closer to real-time rendering.

Thanks for reading.

Harold Serrano

Computer Graphics Enthusiast. Currently developing a 3D Game Engine.