C++ Optimization – Why Custom Loop is Faster: Compiler Issues, Unsafe Code, or Cache Hits?

assemblyc++optimization

i just started learning assembly and making some custom loop for swapping two variables using C++ 's asm{} body with Digital-Mars compiler in C-Free 5.0

Enabled the -o (optimization)

And got the results:

 time of for-loop(cycles)        844
 time of while-loop(cycles)      735
 time of custom-loop-1(cycles)   562
 time of custom-loop-2(cycles)   469

i couldnt find Digital-Mars compiler "asm output" option to compare.
There is no other optimisation options in the build options.
Should i change my compiler? if yes, which one?
Can you look at the codes below and tell me why custom loops are faster?

here is the standard for loop:

t1=clock(); 
for(int i=0;i<200000000;i++)
{
    temp=a;//instruction 1
    a=b;//instruction 2
    b=temp;//3 instructions total   
}   
t2=clock();
printf("\n time of for-loop(increasing) %i  \n",(t2-t1));

here is the standard while loop:

t1=clock();
while(j<200000000)
{
    temp=a;//again it is three instructions
    a=b;
    b=temp; 
            j++;
}
t2=clock();
printf("\n time of while-loop(cycles)  %i  \n",(t2-t1));

here is my custom loop 1:

t1=clock();
j=200000000;//setting the count
    __asm
    {
        pushf           //backup
        push eax        //backup
        push ebx        //backup
        push ecx        //backup
        push edx        //backup

        mov ecx,0       //init of loop range(0 to 200000000)
        mov edx,j

        do_it_again:    //begin to loop


        mov eax,a       //basic swap steps between cpu and mem(cache)
        mov ebx,b       
        mov b,eax       
        mov a,ebx       //four instructions total

        inc ecx         // j++
        cmp ecx,edx     //i<200000000  ?
        jb do_it_again  // end of loop block

        pop edx     //rolling back to history   
        pop ecx         
        pop ebx         
        pop eax         
        popf            
    }

t2=clock();
printf("\n time of custom-loop-1(cycles)   %i   \n",(t2-t1));

here is my second custom loop:

t1=clock();
j=200000000;//setting the count
    __asm
    {
        pushf           //backup
        push eax        
        push ebx        
        push ecx        
        push edx        

        mov ecx,0       //init of loop range(0 to 200000000)
        mov edx,j

        mov eax,a       //getting variables to registers
        mov ebx,b

        do_it_again2:   //begin to loop

        //swapping with using only 2 variables(only in cpu)
        sub eax,ebx         //a is now a-b
        add ebx,eax         //b is now a
        sub eax,ebx         //a is now -b
        xor eax,80000000h   //a is now b and four instructions total

        inc ecx         // j++
        cmp ecx,edx     //i<200000000  ?
        jb do_it_again2  // end of loop block

        pop edx         //rollback
        pop ecx         
        pop ebx         
        pop eax         
        popf            
    }

t2=clock();
printf("\n time of custom-loop-2(cycles)  %i   \n",(t2-t1));

full code:

#include<stdio.h>
#include<stdlib.h>
#include<time.h>

int main()
{
int j=0;

int a=0,b=0,temp=0;

srand(time(0));
time_t t1=0;
time_t t2=0;


t1=clock(); 
for(int i=0;i<200000000;i++)
{
    temp=a;//instruction 1
    a=b;//instruction 2
    b=temp;//3 instructions total   
}   
t2=clock();
printf("\n time of for-loop(cycles) %i  \n",(t2-t1));


t1=clock();
while(j<200000000)
{
    temp=a;//again it is three instructions
    a=b;
    b=temp; 
    j++;
}
t2=clock();
printf("\n time of while-loop(cycles)  %i  \n",(t2-t1));


t1=clock();
j=200000000;//setting the count
    __asm
    {
        pushf           //backup
        push eax        //backup
        push ebx        //backup
        push ecx        //backup
        push edx        //backup

        mov ecx,0       //init of loop range(0 to 200000000)
        mov edx,j

        do_it_again:    //begin to loop


        mov eax,a       //basic swap steps between cpu and mem(cache)
        mov ebx,b       
        mov b,eax       
        mov a,ebx       //four instructions total

        inc ecx         // j++
        cmp ecx,edx     //i<200000000  ?
        jb do_it_again  // end of loop block

        pop edx     //rolling back to history   
        pop ecx         
        pop ebx         
        pop eax         
        popf            
    }

t2=clock();
printf("\n time of custom-loop-1(cycles)   %i   \n",(t2-t1));


t1=clock();
j=200000000;//setting the count
    __asm
    {
        pushf           //backup
        push eax        
        push ebx        
        push ecx        
        push edx        

        mov ecx,0       //init of loop range(0 to 200000000)
        mov edx,j

        mov eax,a       //getting variables to registers
        mov ebx,b

        do_it_again2:   //begin to loop

        //swapping with using only 2 variables(only in cpu)
        sub eax,ebx         //a is now a-b
        add ebx,eax         //b is now a
        sub eax,ebx         //a is now -b
        xor eax,80000000h   //a is now b and four instructions total

        inc ecx         // j++
        cmp ecx,edx     //i<200000000  ?
        jb do_it_again2  // end of loop block

        pop edx         //rollback
        pop ecx         
        pop ebx         
        pop eax         
        popf            
    }

t2=clock();
printf("\n time of custom-loop-2(cycles)  %i   \n",(t2-t1));

return 0;

}

i am just learning c++ and assembly and wondered how things going on.
Thank you

windows xp, pentium 4 (2 GHz) Digital-Mars in C-Free

Best Answer

The code generated by that compiler is pretty horrible. After disassembling the object file with objconv, here's what I got in regards to the first for loop.

?_001:  cmp     dword [ebp-4H], 200000000               ; 0053 _ 81. 7D, FC, 0BEBC200
        jge     ?_002                                   ; 005A _ 7D, 17
        inc     dword [ebp-4H]                          ; 005C _ FF. 45, FC
        mov     eax, dword [ebp-18H]                    ; 005F _ 8B. 45, E8
        mov     dword [ebp-10H], eax                    ; 0062 _ 89. 45, F0
        mov     eax, dword [ebp-14H]                    ; 0065 _ 8B. 45, EC
        mov     dword [ebp-18H], eax                    ; 0068 _ 89. 45, E8
        mov     eax, dword [ebp-10H]                    ; 006B _ 8B. 45, F0
        mov     dword [ebp-14H], eax                    ; 006E _ 89. 45, EC
        jmp     ?_001                                   ; 0071 _ EB, E0

The issues should be clear to anybody who ever looked at some assembly.

  1. The loop is very tightly dependent on the value that is put in eax. This makes any out-of-order execution practically impossible due to dependencies created on that register by every next instruction.

  2. There are six general-purpose registers available (since ebp and esp aren't really general-purpose in most of the setups), but your compiler uses none of them, falling back to using the local stack. This is absolutely unacceptable when speed is the optimization goal. We can even see that the current loop index is stored at [ebp-4H], while it could've been easily stored in a register.

  3. The cmp instruction uses a memory and an immediate operand. This is the slowest possible mix of operands and should never be used when performance is at stake.

  4. And don't get me started on the code size. Half of those instructions are just unnecessary.

All in all, the first thing I'd do is ditch that compiler at the earliest possible chance. But then again, seeing that it offers "memory models" as one of its options, one can't really seem to have much hope.

Related Question