@gfxstrand@mastodon.gamedev.place titelbild
@gfxstrand@mastodon.gamedev.place avatar

gfxstrand

@gfxstrand@mastodon.gamedev.place

Linux 3D graphics developer. Author of the NIR optimizing compiler core in Mesa as well as open-source Vulkan drivers for Intel and Nvidia GPUs. Engineering Fellow @Collabora. I enjoy good food, especially BBQ, tacos, and pizza. 🏳️‍⚧️ (she/her)

Dieses Profil is von einem föderierten Server und möglicherweise unvollständig. Auf der Original-Instanz anzeigen

karolherbst , an Random Englisch
@karolherbst@chaos.social avatar

ahh, hail at 30ºC

gfxstrand ,
@gfxstrand@mastodon.gamedev.place avatar

@karolherbst @Lyude That's sounding a lot like Texas about a month ago.

gfxstrand , an Random Englisch
@gfxstrand@mastodon.gamedev.place avatar

This week's project: Reworking NVK cbuf support. We've had a lot of issues with too much internal stalling and I think a lot of them come down to the fact that we're re-binding cbufs every draw call.

My plan for root constants, is to do inline updates with the LOAD_CONSTANT_BUFFER command. I don't know how much of a difference there is but I strongly suspect this pipelines much better.

For bound cbufs, I'm planning to just make our dirty tracking way more competent.

We'll see how it goes!

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

So far, it's pretty clear that re-binding cbufs repeatedly is causing a significant bottlekneck. By using inline cbuf updates for cb0 and disabling my bound cbuf optimization and switching back to global memory reads for UBOs, we can get a 70% perf boost in The Witness on a 4090. Yeah, clearly something is serializing inside that monster. IDK what all but something.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

Next up: Bindless UBOs.

On NVIDIA, bindless UBOs use a 64-bit descriptor with 40 bits of base address and 14 bits of size / 4. Those can be referenced directly from ALU instructions and should get nearly the same shader perf as bound cbufs.

The real trick here, though, is that they require the use of the uniform register file. This, unfortunately, has funky register allocation implications because of how it interacts with uniformity and reconvergence.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

I believe the rule that I want is that UGPRs can only ever be assigned in uniform control-flow and remain live until their last uniform use. If a UGPR is live-in to a divergent branch instruction, then the merge which forces re-convergence is considered to be a use as well.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

I don't think they can ever safely be assigned outside of uniform control-flow. On NVIDIA HW, it's possible that two different portions of the wave may execute the same block at the same time without being converged. This means that we can't even assume the usual SSA interference rules within a single block unless we're guaranteed that all non-exited invocations are converged as they execute said block.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

@karolherbst That doesn't matter in this scenario. In this scenario, there are two different HW threads executing simultaneously with different subsets of the warp but out-of-sync. In theory, you could write a register in this scenario and hope that both threads writing the same value don't race. However, if you ever rewrite the register, even for something as simple as i++, the two threads would stomp on each other.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

@karolherbst At that point the liveness rule would be that any register written in non-uniform control-flow is live until we get back into uniform control-flow. You could write but every value has to be a new register because nothing ever gets killed until you get back to uniform.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

@karolherbst Right, so if you can prove that only one invocation in the subgroup ever writes the value, then it's safe to re-use for that same invocation. Like we could maybe detect if (elect()) { } blocks and allow uniform registers there.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

Yeah, so this is a good bit more challenging than it looks at first.

Not being able to assign UGPRs outside of uniform control-flow also means I can't [un]spill them. This is fine for most things as non-uniform instructions (the only kind allowed in non-uniform control-flow) can accept either GPRs or UGPRs. If a UGPR is spilled to a GPR, I can just use the GPR.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

For bindless UBO usage, however, it MUST be a UGPR. If it's not, then I have to emit a different sequence which does a global memory load.

Unfortunately, making that decision after RA isn't really tractable as it also has implications for copy-propagation and other things. I really want to know up-front which UBO handles I can use directly and which I can't.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

My current plan is to add a NIR use_bindless_ubo_handles_nv intrinsic which goes in the uniform block, right before we diverge. Then only bindless handles listed in that intrinsic are allowed to be used inside the non-uniform control-flow. This will map to a similar intrinsic in NAK.

The problem I have yet to solve is how to teach copy-propagation to treat that as a barrier of sorts and not propagate LDC past it into non-uniform control-flow unless the handle is in that instruction.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

I've got the LDC lowering pass written and I think it probably works. I'm also reasonably happy with how it's structured.

Today I've been looking into wiring up uniform instructions and ouch... NVIDIA's uniform hardware is not as friendly as one would like it to be. 😩

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

To start with, uniform ops are only allowed for integers. None of the float ops (except f2f for some reason?!?) have a uniform version. That means if you mix float and integer at all, you're giong to end up going to the full wave.

Uniform predicates exist except that you can't usually use uniform predicates with wave ops. This means that if you do an integer calculation that results in a predicate and try to use that predicate in a float something, you have to manually convert it first.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

The one exception appears to be PLOP3 which is able to take a uniform predicate as src2. This means I can do a conversion to a wave predicate by PLOP3 with a LUT that gives me just src2.

Also, there's no encoding for uniform predicates being used as actual instruction predicates as far as I know. This means control-flow can't take them.

About the only thing you can use a uniform predicate for is USEL. Better than nothing, I guess. 🤷🏻‍♀️

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

This makes the ideal approach for NAK really unclear.

Do I try as hard as I can to use uniform stuff to save register pressure and then just hope that I don't have too much UR2R and R2UR sprinkled through my code?

Do I turn uniform ops that feed a UR2R into wave ops to avoid the UR2R? Most things are going to feed into a wave op eventually. At what point do I say "I've had enough" and emit some uniform code?

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

Also, wave ops can consume uniform registers... sort of. It comes with all the same restrictions as consuming a cbuf or immedate value, meaning that you get exactly one "special" thing. So I can't have a wave op that takes a UGPR and an immediate. I can only have UGPR and GPR.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

This may not be too bad in practice. With the exception of a handful of 3-src ops like fma, if both sources are a UGPR, the destination probably is too, right? Of course not! That would be too easy. 😂

If I have a uniform op in uniform control-flow and then something which takes entirely uniform things in non-uniform control-flow, I can't make it a uniform op unless I hoist it. I mean, I could do that some but register pressure...

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

I'm still trying to think through all this.

In the very short term, I may add a NIR op that I can use to force my bindless cbuf handles into UGPRs and then I can at least test my cbuf code.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

I did this hack and now I have dEQP-VK.ubo.* passing. It's not something I can ship in production but at least it gave me enough to be able to finally test some of my code.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

The hack also gets The Witness running and I'm seeing it go from 112 FPS to 137 FPS (+22%) on the RTX 4060 in my laptop when combined with NVK_DEBUG=no_cbuf. That's just the difference between LDC and LDG.CONSTANT.

When combined with the cbuf0 reworks from last week, I'm pretty optimistic that we're going to see a pretty decent perf bump once this all lands. 😁

Now I just have to get uniform ops working for real...

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

I've been quiet this week but that doesn't mean nothing has happened! I've been hard at work on uniform ALU. As of earlier today, my uniform-alu branch is now regression-free according to the Vulkan CTS.

The thing I still have yet to do is to write the optimization that tries to get rid of all the unnecessary R2UR and MOV instructions we scatter everywhere.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

The biggest things that tripped me up this week were spilling and the instruction dependency tracker.

Spilling was fairly straightforward and my original plan mostly worked. The biggest annoyance is dealing with the fact that we can't unspill uniform values in non-uniform control-flow because that would mean writing to a uniform register in non-uniform control-flow which, as described above, is taboo.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

For UGPRs this isn't too bad. Since we can't have uniform instructions in non-uniform control-flow, I only have to worry about the case where a warp instruction is accessing a UGPR. However, every warp instruction that accesses a UGPR can also access a GPR just as easliy. Since the spiller spills UGPRs to GPRs instead of directly to memory, this is just a matter of using the spilled value instead of unspilling.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

For predicates, it gets trickier. We still have the same rule that warp instructions accessing UPreds can also use a Pred. However, I'm not spilling UPred to Pred but am spilling UPred to UGPR. This is because we don't have many predicates of either form and spilling to UGPR is probably more efficient.

However, we can unspill a UGPR to a Pred by using ISETP to do UGPR != 0, as long as the UGPR is in src1. So whenever we would unspill, we unspill to Pred and switch the instruction to use that.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

The other issue with spilling is that the Braun-Hack algorithm we implement doesn't quite work under the constraints of uniform registers.

The way the Braun-Hack algorithm works is to walk the CFG in a dominance-preserving order. For each block, you compute sets W and S of resident and spilled values, respectively, and then process the block. The W and S sets at the start of a block are initialized based on the predecessor blocks and the paper provides a strategy for initializing them.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

The problem was that the way you're supposed to initialize W and S for loops ignores the predecessor blocks and instead looks only at the loop's internals. This is good for optimizing spilling around a loop but it totally ignores one of the core invariants we need for uniform values: You can't [un]spill them inside non-uniform control-flow.

Instead, I added a special case for uniform registers in non-uniform control-flow that takes the W and S sets from the predecessors verbatim.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

This should work because nothing else is going to [un]spill in non-uniform control-flow so the only way those sets can change before everything re-converges, is if values are killed. Nothing new will ever become live. This means that I can re-use the sets and never need to worry about whether or not the union of predecessors sets will use too many values.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

The other thing that gave me headaches, and still is, is the instruction dependency tracker.

While debugging various UGPR fails, I noticed a class of hazards that hasn't been an issue before: Write-after-read hazards.

We have a dependency tracker that understands read-after-write hazards and inserts delays as needed to avoid them. However, we assumed that pipelineing took care of the rest. That reads happened fast enough and writes slow enough that there would never be a problem here.

gfxstrand OP ,
@gfxstrand@mastodon.gamedev.place avatar

Well, it turns out that the LDC instruction is so blazing fast that if you have a sequence like

ur3 = uiadd3 ur0, ur1, ur2
ur0 = uldc c[0][0x20]

that the ULDC might overwrite ur0 before the UIADD3 gets around to reading it.

  • Alle
  • Abonniert
  • Moderiert
  • Favoriten
  • random
  • haupteingang
  • Alle Magazine