October 26, 2025

Untold Engine Progress Update – New Editor and VisionOS Support!

October 26, 2025

This past couple of months have been amazing for the Untold Engine — from getting its first contributor and sponsorship to adding VisionOS support.

Let me tell you all about it.

Engine & Editor

You may recall that I had both the core and the editor integrated tightly in the engine. It worked nicely, but the coupling was going to give us headaches in the future.

Thanks to the effort of our first contributor miogds, the core of the engine and the editor are now de-coupled.

So, this is the new architecture of the engine:

Core: Handles the runtime — rendering, physics, ECS, and all engine systems.
Editor: A dedicated app for scene creation, entity manipulation, and asset management.

This separation makes development cleaner, more modular, and sets the stage for headless or custom integration workflows.

Additionally, the core engine will continue in its original repository UntoldEngine, while the editor now lives in a new, dedicated repo UntoldEditor.

Unit Tests & Workflows

I've also been working on making the Untold Engine repository more professional.
This includes adding unit tests, GitHub Actions workflows, and automatic formatting and linting.

My hope is that these improvements will make contributing to the project much easier and more reliable.

Website & Documentation

Another area of progress has been the new website and documentation.
The documentation site covers how to install the engine, explore the APIs, and contribute to development.

You can check it out here: Untold Engine

Each engine release will include its own version of the docs for consistent developer onboarding.

VisionOS Support

Lastly, the engine now compiles and runs on the VisionOS simulator — the first step toward supporting Apple’s Vision Pro platform.

However, this is still early support — the engine has not yet been tested on an actual Vision Pro device.
We’ve already received an issue report related to Vision Pro hardware, so if you happen to have one and would like to help debug, you’re more than welcome to contribute!

Thanks for reading.

September 14, 2025

Harold Serrano

Game Engine Development

Debugging a Flickering Issue Caused by Asynchronous Culling

September 14, 2025

Harold Serrano

Game Engine Development

After implementing frustum culling in the Untold Engine, performance improved, but right away I noticed flickering. It didn’t happen every frame, but it was noticeable whenever most of the models were in view.

So, I opened up Instruments to profile the issue. I noticed warnings that the engine was holding on to the drawable too long. I tried restructuring things to hold on to the drawable for as short a time as possible, but nothing helped.

According to Instruments, the engine was not CPU-bound or GPU-bound. There was no clear indication of the root cause of the flickering.

Digging Deeper

At that point, I decided to record a short video of the issue. I slowed it down and went frame by frame. What I saw wasn’t the usual kind of flickering—it was different.

Frame 1: a certain set of models was visible.
Frame 2: a completely different set was visible.
Frame 3: some disappeared, others suddenly appeared.

Models were popping in and out, almost as if something was out of sync.

This was a huge hint: it looked like a data race.

The Culprit

Looking at the code confirmed it.

In the frustum culling command buffer completion handler, I was updating the visibleEntityId array. This array held all the entities that passed the culling test.

The problem was that the GPU calls this completion handler asynchronously, while the CPU was already using that same array during the rendering passes (shadow and geometry).

In other words, the CPU was iterating over visibleEntityId at the same time the GPU might be modifying it.

Classic data race.

The Fix: Triple Buffering

The solution was to add a triple-buffered visible entity list.

During culling, the GPU writes results into buffer n+1.

During rendering, the CPU continues to read from buffer n.

When the frame finishes and the render command buffer’s completion handler triggers, I update the index so the CPU reads from the freshly written buffer n+1 on the next frame.

This guarantees that the CPU never reads data being modified by the GPU. The renderer always sees a stable snapshot of the visible entities.

The Result

With triple buffering in place, the flickering disappeared instantly. Models no longer popped in and out between frames.

This bug was a good reminder: sometimes what looks like a rendering artifact isn’t a math error at all, but a synchronization issue between CPU and GPU.

Lesson Learned

Whenever the GPU produces results asynchronously, the CPU should never iterate over those results directly. Always work with a snapshot. Triple buffering (or even double buffering) is a small architectural change that guarantees stability and avoids subtle bugs that can masquerade as rendering issues.

This experience reinforced for me how crucial synchronization and data ownership are when building GPU-driven systems—sometimes the hardest-looking bugs aren’t about shaders or math, but about who’s allowed to touch the data, and when.

September 4, 2025

Harold Serrano

Game Engine Development

Deferred Entity Destruction in ECS: A Mark-and-Sweep Approach

September 4, 2025

Harold Serrano

Game Engine Development

I found a bug in the Untold Engine in the weirdest way possible. After merging several branches into my develop branch, I decided to run a Swift formatter on the engine. Three files were changed. I ran the unit tests, they all passed, and then I figured I’d do a final performance check before pushing the branch to my repo.

So, I launched the engine, loaded a scene, and then deleted the scene.

The moment I did that, the console log started flooding with messages like:

Entity is missing or does not exist.
Does not have a Render Component.

This was the first time I had ever seen the engine behave like this when removing all entities from a scene. My first reaction was: the formatter broke something.

But the formatter’s changes were only cosmetic. There was no reason for this kind of bug.

At that point I was lost. So, I asked ChatGPT for some guidance, and it mentioned something interesting: maybe the formatter’s modifications had affected timing. That hint got me thinking.

After tinkering a bit, I realized the truth: this bug was always there. The formatter just exposed it earlier.

The Real Problem

My engine’s editor runs asynchronously from the engine’s core functions. When I clicked the button to remove all entities, the editor tried to clear the scene immediately — even if those entities were still being processed by a kernel or the render graph.

In other words, the engine was destroying entities while they were still in use. That’s why systems started complaining about missing entities and missing components.

The Solution: A Mini Garbage Collector

What I needed was a safe way to destroy entities. The fix was to implement a simple “garbage collector” for my ECS, with two phases:

Mark Phase – Instead of destroying entities right away, I mark them as pendingDestroy.
Sweep Phase – Once I know the command buffer has completed, I set a flag. In the next update() call, that flag triggers the sweep, where I finally destroy all entities that were marked.

This way, entity destruction only happens at a safe point in the loop, when nothing else is iterating over them.

Conclusion

What looked like a weird formatter bug turned out to be a timing bug in my engine. Immediate destruction was unsafe — the real fix was to defer destruction until the right time.

By adding a simple mark-and-sweep system, I now have a mini garbage collector for entities. It keeps the engine stable, avoids “entity does not exist” spam, and gives me confidence that clearing a scene won’t blow everything up mid-frame.

Thanks for reading.

August 27, 2025

Harold Serrano

Game Engine Development

From 26.7 ms to 16.7 ms: How a simple Optimization Boosted Performance

August 27, 2025

Harold Serrano

Game Engine Development

In my previous article, I talked about my attempts to improve the performance of the Untold Engine. Even after adding GPU frustum culling to reduce the CPU workload, the engine was still CPU-bound — stuck at around 26.7 ms per frame.

Profiling with Xcode Instruments pointed the finger at Metal’s encoder preparation, which appeared to take ~15 ms. Based on that, my next move seemed obvious: switch to a bindless rendering.

What does that mean? Instead of rebinding textures and material properties for every draw call, I would move everything into a single argument buffer. Each draw would reference materials by index. In theory, this should drastically cut CPU overhead and pair nicely with GPU-driven culling.

But reality didn’t match theory. After spending days moving to a bindless model, I ran the engine with 500 models — and the performance needle didn’t budge. In fact, things got worse: encoder prep time increased from ~15 ms to ~17 ms.

You can imagine my disappointment. But I kept digging. And then I found the real bottleneck. Instruments showed the CPU was spending almost 9.5 ms just preparing data for GPU frustum culling.

So the encoder wasn’t the problem after all. As I dug into the code, I discovered the true culprit: a single function that queries all entities with specific component IDs.

Here’s what was happening:

👉 My component mask was stored as an array of 64 booleans. Every time I checked an entity, the code looped through all 64 slots, read from two arrays, and branched on each one. With 500 entities, that meant tens of thousands of tiny checks every single frame. No wonder the CPU was choking.

The fix? Replace the boolean array with a single 64-bit integer and use a bitwise AND. That collapses the entire check into just two instructions. Here’s the new function:

That one change dropped the CPU frame time from 26.7 ms down to 16.7 ms. The GPU frame time sits at 9.3 ms.

In other words, the engine now runs at a solid 60 fps.

I’m happy with the results: the engine is no longer CPU-bound or GPU-bound.

But I’m not done yet. The next step is implementing occlusion culling — and I’m excited to see how far I can push performance.

Thanks for reading.

August 18, 2025

Harold Serrano

Game Engine Development

Optimizing My Engine’s Light Pass: Lessons from GPU Profiling

August 18, 2025

Harold Serrano

Game Engine Development

Now that the Untold Engine has most of the features that make it usable, I wanted to spend the next couple of months focusing on performance.

I decided to start with the Light Pass. For testing, I loaded a scene with around 214,000 vertices. Not a huge scene, but enough to get meaningful profiler data. After running the profiler, these were the numbers for the Light Pass:

GPU Time: 2.53 ms
ALU Limiter: 37.65%
ALU Utilization: 35.77%
Texture Read Limiter: 57.8%
Texture Read Utilization: 26.83%
MMU Limiter: 32.19%

The biggest limiter was Texture Read, at almost 58%. This means the GPU was spending a lot of time fetching data from textures—likely because they weren’t being cached efficiently. In hindsight, I should have started by tackling the biggest limiter. Instead, I first focused on improving the MMU. Not the best choice, but that’s how you learn.

MMU Limiter means your GPU performance is constrained by memory address lookups and fetches, not arithmetic.

Optimization 1: Buffer Pre-Loading

Buffer pre-loading combines related data into a single buffer so the GPU can fetch it more efficiently, instead of bouncing between multiple buffers. In my original shader, I was sending light data through separate buffers:

constant PointLightUniform *pointLights [[buffer(lightPassPointLightsIndex)]]

constant int *pointLightsCount [[buffer(lightPassPointLightsCountIndex)]]

I restructured this into a single struct that packages light data together. This change reduced the MMU Limiter from 32.19% → 26.43%.

Optimization 2: Use .read() Instead of .sample()

During the Light Pass, I fetch data from multiple G-buffer textures such as Position, Normal, Albedo, and SSAO. Originally, I used .sample(), but this does more work than necessary—it applies filtering and mipmap logic, which adds both memory traffic and math. Switching to .read() (a direct texel fetch) gave a noticeable improvement:

Texture Read Limiter: 57.8% → 46.79%
Texture Read Utilization: 26.83% → 30.03%

Optimization 3: Reduce G-Buffer Resolution

Next, I reduced the G-Buffer textures from float to half. I expected this to lower bandwidth usage, but to my surprise it made things worse:

Texture Read Limiter: 46.79% → 56.9%
Texture Read Utilization: 30.03% → 32.31%

Sometimes optimizations backfire, and this was one of those cases.

Optimization 4: Half-Precision Math in Lighting

I then focused on ALU utilization by switching parts of my lighting calculations to use half-precision (half) math. Specifically:

Diffuse contribution → half precision
Specular contribution → full precision (float)

The results were underwhelming:

ALU Limiter: 37.65% → 37.55%
ALU Utilization: 35.77% → 35.83%
F32 Utilization: 27.86% → 25.12%
F16 Utilization: 0% → 1.11%

Half-precision math showed up in the counters, but didn’t change the overall bottleneck.

Final Results

Here’s the overall improvement from all the optimizations:

Before → After

GPU Time: 2.53 ms → 2.31 ms (~8.7% faster)
Texture Read Limiter: 57.81% → 50.93%
MMU Limiter: 32.19% → 23.41%
ALU Limiter: 37.65% → 37.55% (flat)
F32 Utilization: 27.86% → 25.12%
F16 Utilization: 0.00% → 1.11%
Integer & Complex Limiter: 14.25% → 8.82%
Texture Read Utilization: 26.83% → 31.62%

What the Numbers Say

I shaved about 9% off the Light Pass—a real improvement, but not dramatic.
The biggest wins came from reducing memory-side pressure (MMU ↓, Texture Read Limiter ↓).
ALU stayed flat, which shows the pass is still memory/texture-bound, not math-bound.
Half-precision math registered, but didn’t help much since math wasn’t the bottleneck.
Removing unnecessary integer/complex math improved things locally, but again, the frame was dominated by texture fetch bandwidth.

Takeaway

Optimizations don’t always yield big wins, but each attempt brings clarity. In this case, the profiler clearly shows that the Light Pass is memory/texture-bound. My next steps will focus directly on reducing texture fetch cost, rather than trimming ALU math.

Thanks for reading.