I'm running BenchmarkDotNet against the following code on .NET 8:
using System.Runtime.InteropServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
[StructLayout(LayoutKind.Explicit)]
public readonly struct Counter
{
[FieldOffset(0)] public readonly int Value;
[FieldOffset(1)] public readonly byte MidByte;
public Counter(int value)
: this()
{
this.Value = value;
}
public Counter Increment(int count) // 0 <= count <= 32
{
int value = this.Value + count;
int lowByte = (byte)value;
if (lowByte > 32)
value = value - 32 + (this.MidByte < 16 ? 0x100 : 0xF100);
return new Counter(value);
}
public static Counter operator ++(Counter counter) => counter.Increment(1);
}
public class Benchmarks
{
[Benchmark]
public Counter Benchmark1()
{
Counter c = new(0x10101);
for (int i = 0; i < 10000; ++i) {
c = c.Increment(1);
}
return c;
}
[Benchmark]
public Counter Benchmark2()
{
Counter c = new(0x10101);
for (int i = 0; i < 10000; ++i) {
++c;
}
return c;
}
}
The only difference between Benchmark1
and Benchmark2
is that Benchmark2
calls operator ++
(which in turn calls Increment(1)
), whereas Benchmark1
just calls Increment(1)
directly.
Because the JIT compiler will likely inline operator ++
, I expected these two benchmarks to perform the same. To my great surprise and bafflement, ++c
absolutely demolishes c = c.Increment(1)
:
BenchmarkDotNet v0.13.8, Windows 10 (10.0.19045.4651/22H2/2022Update)
11th Gen Intel Core i7-11800H 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.303
[Host] : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2
DefaultJob : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2
Method Mean Error StdDev Code Size Benchmark1 18.376 us 0.1999 us 0.1870 us 68 B Benchmark2 6.564 us 0.0436 us 0.0408 us 81 B
Why is ++c
so much faster than c.Increment(1)
when the former simply calls the latter?
UPDATE #1: A couple commenters questioned whether ++c
simply discards the result instead of updating c
. I verified that both Benchmark1
and Benchmark2
return the same Counter
value (0x00140911), which proves that both benchmarks perform the same 10000 calculations.
UPDATE #2: If I replace the reference to this.MidByte
in Counter.Increment
with the equivalent expression (byte)(value >> 8)
, then the performance gap between Benchmark1
and Benchmark2
vanishes:
Method Mean Error StdDev Benchmark1 4.662 us 0.0489 us 0.0433 us Benchmark2 4.595 us 0.0399 us 0.0373 us
Best Answer
There is definitely a difference in the JITted code between .NET 6 and .NET 8. And I think it's due to
and also this issue and the comments/linked issues.
MidByte
). This github issue from 2023 seems very relevant to our case. Especially the two comments:First of all, since there is a lot of inlining going on, I've simplified the
Increment
method (still preserving the differences in emitted code) (full sharplab):EDIT: The following only holds true for our very special case of having an explicit struct layout AND using the
MidByte
explitly laid out field in the method. If you experiment in the sharplab I linked you'd find that the JIT compiler produces identical code forBenchmark1
andBenchmark2
if we have default sequential layout (if you manage to find a corner case too there please comment)There is no difference between .NET 6 and .NET 8, for how
Counter.Increment
andCounter.op_Increment
are jitted:Even at this point the call to Counter_Increment is inlined in the op_Increment method.
The difference between the methods lies in the fact that the
Counter.Increment
is an instance method and relies onthis
being in [ecx] whereas theCounter.op_Increment(Counter)
as a static method that has a reference to the the counter instance on the stack[esp+8]
. Logic is the same otherwise as the instance method is really inlined in the static method.There definitely seems to be an optimization in
Benchmark1
(that makes calls to Counter.Increment(Int32)) between .NET 6 and .NET 8 (stripping the prologue/epilogue/loop code):As you can see the unnecessary struct copying ceremony is optimized away.
Now to the case of
Benchmark2
:All the temp variables unnecessary copying is gone.
The interesting thing is that here we don't have the initial assignment to a local variable:
instead we do:
My speculation is that it's because we inline a static method which relies on mutation of a local variable and not
this
.In
Benchmark1
this
was optimized to be[ebp-4]
which was already declared, so we reused it.In
Benchmark2
we don't mutatethis
but we need to share a local variable from our method (Benchmark2
) as a local variable for the static inlined method (Counter.op_Increment(Counter)
). The static method also needs an argument. It normally got it from the stack, but here we are inlined, so the best thing is to receive it from a registereax
. In the spirit of removing ceremony we omitted all the temp stack variables, so we can use only the register for holding counter.Looking at the assembly of
Benchmark1
vsBenchmark2
shows why the latter is faster:we have one more instruction in
Benchmark1
because our "outer" variable is on the stack and not in a register.Needless to say, this observed behavior can and probably would change in future versions of .NET as there are more plans to optimize structs.