C Micro-Optimization – Mod and Assignment vs Conditional and Assignment

assemblyc++micro-optimizationx86

I have a counter in an ISR (which is triggered by an external IRQ at 50us). The counter increments and wraps around a MAX_VAL (240).

I have the following code:

if(condition){
  counter++;
  counter %= MAX_VAL;
  doStuff(table[counter]);
}

I am considering an alternative implementation:

if(condition){
  //counter++;//probably I would increment before the comparison in production code
  if(++counter >= MAX_VAL){
    counter=0;
  }
  doStuff(table[counter]);
}

I know people recommend against trying to optimize like this, but it made me wonder. On x86 what is faster? what value of MAX_VAL would justify the second implemenation?

This gets called about every 50us so reducing the instruction set is not a bad idea. The if(++counter >= MAX_VAL) would be predicted false so it would remove the assignment to 0 in the vast majority of cases. For my purposes id prefer the consistency of the %= implementation.

Best Answer

As @RossRidge says, the overhead will mostly be lost in the noise of servicing an interrupt on a modern x86 (probably at least 100 clock cycles, and many many more if this is part of a modern OS with Meltdown + Spectre mitigation set up).

If MAX_VAL is a power of 2, counter %= MAX_VAL is excellent, especially if counter is unsigned (in which case just a simple and, or in this case a movzx byte to dword which can have zero latency on Intel CPUs. It still has a throughput cost of course: Can x86's MOV really be "free"? Why can't I reproduce this at all?)

Is it possible to fill the last 255-240 entries with something harmless, or repeats of something?

As long as MAX_VAL is a compile-time constant, though, counter %= MAX_VAL will compile efficiently to just a couple multiplies, shift, and adds. (Again, more efficient for unsigned.) Why does GCC use multiplication by a strange number in implementing integer division?

But a check for wrap-around is even better. A branchless check (using cmov) has lower latency than the remainder using a multiplicative inverse, and costs fewer uops for throughput.

As you say, a branchy check can take the check off the critical path entirely, at at the cost of a mispredict sometimes.

// simple version that works exactly like your question
// further optimizations assume that counter isn't used by other code in the function,
// e.g. making it a pointer or incrementing it for the next iteration
void isr_countup(int condition) {
    static unsigned int counter = 0;

    if(condition){
      ++counter;
      counter = (counter>=MAX_VAL) ? 0 : counter;  // gcc uses cmov
      //if(counter >= MAX_VAL) counter = 0;        // gcc branches
      doStuff(table[counter]);
    }
}

I compiled many versions of this on the Godbolt compiler explorer, with recent gcc and clang.

(For more about static performance analysis of throughput and latency for short blocks of x86 asm, see What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?, and other links in the x86 tag wiki, especially Agner Fog's guides.)

clang uses branchless cmov for both versions. I compiled with -fPIE in case you're using that in your kernels. If you can use -fno-pie, then the compiler can save an LEA and use mov edi, [table + 4*rcx], assuming you're on a target where static addresses in position-dependent code fit in 32-bit sign-extended constants (e.g. true in the Linux kernel, but I'm not sure if they compile with -fPIE or do kernel ASLR with relocations when the kernel is loaded.)

# clang7.0 -O3 -march=haswell -fPIE.
#  gcc's output is the same (with different registers), but uses `mov edx, 0` before the cmov for no reason, because it's also before a cmp that sets flags
isr_countup:                            # @isr_countup
    test    edi, edi
    je      .LBB1_1                  # if condition is false

    mov     eax, dword ptr [rip + isr_countup.counter]
    add     eax, 1                   # counter++
    xor     ecx, ecx
    cmp     eax, 239                 # set flags based on (counter , MAX_VAL-1)
    cmovbe  ecx, eax                 # ecx = (counter <= MAX_VAL-1) ? 0 : counter
    mov     dword ptr [rip + isr_countup.counter], ecx   # store the old counter
    lea     rax, [rip + table]
    mov     edi, dword ptr [rax + 4*rcx]        # index the table

    jmp     doStuff@PLT             # TAILCALL
.LBB1_1:
    ret

The block of 8 instructions starting at the load of the old counter value is a total of 8 uops (on AMD, or Intel Broadwell and later, where cmov is only 1 uop). The critical-path latency from counter being ready to table[++counter % MAX_VAL] being ready is 1 (add) + 1 (cmp) + 1 (cmov) + load-use latency for the load. i.e. 3 extra cycles. That's the latency of 1 mul instruction. Or 1 extra cycle on older Intel where cmov is 2 uops.

By comparison, the version using modulo is 14 uops for that block with gcc, including a 3-uop mul r32. The latency is at least 8 cycles, I didn't count exactly. (For throughput it's only little bit worse, though, unless the higher latency reduces the ability of out-of-order execution to overlap execution of stuff that depends on the counter.)

Other optimizations

Use the old value of counter, and prepare a value for next time (taking the calculation off the critical path.)
Use a pointer instead of a counter. Saves a couple instructions, at the cost of using 8 bytes instead of 1 or 4 of cache footprint for the variable. (uint8_t counter compiles nicely with some versions, just using movzx to 64-bit).

This counts upward, so the table can be in order. It increments after loading, taking that logic off the critical path dependency chain for out-of-order execution.

void isr_pointer_inc_after(int condition) {
    static int *position = table;

    if(condition){
        int tmp = *position;
        position++;
        position = (position >= table + MAX_VAL) ? table : position;
        doStuff(tmp);
    }
}

This compiles really nicely with both gcc and clang, especially if you're using -fPIE so the compiler needs the table address in a register anyway.

# gcc8.2 -O3 -march=haswell -fPIE
isr_pointer_inc_after(int):
    test    edi, edi
    je      .L29

    mov     rax, QWORD PTR isr_pointer_inc_after(int)::position[rip]
    lea     rdx, table[rip+960]        # table+MAX_VAL
    mov     edi, DWORD PTR [rax]       # 
    add     rax, 4
    cmp     rax, rdx
    lea     rdx, -960[rdx]             # table, calculated relative to table+MAX_VAL
    cmovnb  rax, rdx
    mov     QWORD PTR isr_pointer_inc_after(int)::position[rip], rax

    jmp     doStuff(int)@PLT
.L29:
    ret

Again, 8 uops (assuming cmov is 1 uop). This has even lower latency than the counter version possibly could, because the [rax] addressing mode (with RAX coming from a load) has 1 cycle lower latency than an indexed addressing mode, on Sandybridge-family. With no displacement, it never suffers the penalty described in Is there a penalty when base+offset is in a different page than the base?

Or (with a counter) count down towards zero: potentially saves an instruction if the compiler can use flags set by the decrement to detect the value becoming negative. Or we can always use the decremented value, and do the wrap around after that: so when counter is 1, we'd use table[--counter] (table[0]), but store counter=MAX_VAL. Again, takes the wrap check off the critical path.

If you wanted a branch version, you'd want it to branch on the carry flag, because sub eax,1 / jc can macro-fuse into 1 uops, but sub eax,1 / js can't macro-fuse on Sandybridge-family. x86_64 - Assembly - loop conditions and out of order. But with branchless, it's fine. cmovs (mov if sign flag set, i.e. if the last result was negative) is just as efficient as cmovc (mov if carry flag is set).

It was tricky to get gcc to use the flag results from dec or sub without also doing a cdqe to sign-extend the index to pointer width. I guess I could use intptr_t counter, but that would be silly; might as well just use a pointer. With an unsigned counter, gcc and clang both want to do another cmp eax, 239 or something after the decrement, even though flags are already set just fine from the decrement. But we can get gcc to use SF by checking (int)counter < 0:
```
// Counts downward, table[] entries need to be reversed
void isr_branchless_dec_after(int condition) {
    static unsigned int counter = MAX_VAL-1;

    if(condition){
        int tmp = table[counter];
        --counter;
        counter = ((int)counter < 0) ? MAX_VAL-1 : counter;
        //counter = (counter >= MAX_VAL) ? MAX_VAL-1 : counter;
        //counter = (counter==0) ? MAX_VAL-1 : counter-1;
        doStuff(tmp);
    }
}

 # gcc8.2 -O3 -march=haswell -fPIE
isr_branchless_dec_after(int):
    test    edi, edi
    je      .L20

    mov     ecx, DWORD PTR isr_branchless_dec_after(int)::counter[rip]
    lea     rdx, table[rip]
    mov     rax, rcx                   # stupid compiler, this copy is unneeded
    mov     edi, DWORD PTR [rdx+rcx*4] # load the arg for doStuff
    mov     edx, 239                   # calculate the next counter value
    dec     eax
    cmovs   eax, edx
    mov     DWORD PTR isr_branchless_dec_after(int)::counter[rip], eax  # and store it

    jmp     doStuff(int)@PLT
.L20:
    ret
```
still 8 uops (should be 7), but no extra latency on the critical path. So all of the extra decrement and wrap instructions are juicy instruction-level parallelism for out-of-order execution.

Best Answer

Related Solutions

Performance – Redundant Assignment Speeds Up Code Without Optimization

Does this mean adding useless statements will speed up real programs? (with optimization enabled)

Related Question