Welcome to Intel® Software Network Quick Login | Join | Help |
Search in Intel® Software Network Forums
in Go

Difference using SSE on Intel and AMD processors (?)

Last post 09-25-2006, 2:11 PM by JimDempseyAtTheCove. 4 replies.
Sort Posts: Previous Next
 09-16-2006, 9:11 PM 30223971  

Difference using SSE on Intel and AMD processors (?)

Hi, I have the following problem: I use SSE in database system for selection (sql command select). I coded it on AMD Duron 1800 MHz processor and it works fine. But when I tested it on Intel Pentium 4 and Pentium D, it gives bad results - it doesn't work properly. In C language we could write it in a simplified way (for sql command "select * from TABLE where VALUE > key"):

----------------------------------
float
*input;      // address to input data stored in array
                   // (data page with 820 entries of type float)

float key;         // find key
...
for (i = 0; i <
get_values_count_in_column(); i++)
    {
    if (input[i] > key)
       add_row_to_output_table(i);
    }

----------------------------------

My SSE code is:

----------------------------------
float *input;
float *key;         // addres to find key
...
__asm
    {
    mov    esi, input
    mov    edi, key
    xor    ecx, ecx                  // counter
    xor    edx, edx
    mov    ebx,
values_count_in_column

    movss    xmm1, [edi]             // xmm1 <- key
    shufps   xmm1, xmm1, 0           // broadcast
                   
    prefetchnta    [esi+32]

START_LOOP:
    movaps   xmm0, [esi]             // xmm0 <- input

    // The following line is problematic. On AMD processor it is     // not needed, while on Intel processor without this line XMM1     // losts its contents after first calling of procedure             // sse_add_row (see below)
    movaps    xmm1, [edi]           // xmm1 <- key


    cmpnleps  xmm0, xmm1            // compare input > key
    movmskps  edx, xmm0             // store mask to edx

    // for testing purposes, we show the xmm1 register (see below)
    push   eax
    push   ecx
    push   edx
    call   show_xmm1
    pop    edx
    pop    ecx
    pop    eax

    test   edx, edx         // if nothing found, skip testing bits
    jz     NOT_FOUND_3

FOUND:
    test   edx, 1          // test bit 0
    jz     NOT_FOUND_0     // if not set, jump to test bit 1

    // bit is set, we have to store data into output
    // selection table - it is done by function sse_add_row
    push   eax
    push   ecx
    push   edx
    call   sse_add_row     // sse_add_row stores entry with                                    // offset in ecx to output table in DBS
    pop    edx
    pop    ecx
    pop    eax

NOT_FOUND_0:
    test   edx, 2          // test bit 1
    jz     NOT_FOUND_1     //
if not set, jump to test bit 2

    push   eax
    push   ecx
    push   edx
    add    ecx, 1
    call   sse_add_row               
    pop    edx
    pop    ecx
    pop    eax

NOT_FOUND_1:
    test   edx, 4         // test bit 2
    jz     NOT_FOUND_2    //
if not set, jump to test bit 3

    push   eax
    push   ecx
    push   edx
    add    ecx, 2
    call   sse_add_row
    pop    edx
    pop    ecx
    pop    eax

NOT_FOUND_2:
    test   edx, 8         // test bit 3
    jz     NOT_FOUND_3   
// if not set, jump to end of bit testing

    push   eax
    push   ecx
    push   edx
    add    ecx, 3
    call   sse_add_row
    pop    edx
    pop    ecx
    pop    eax

NOT_FOUND_3:      
    add    esi, 16
    add    ecx, 4
    cmp    ecx, ebx
    jne    START_LOOP
    }
...

// write entry to output table of the DBS
// sse_ecx is offset of found entry
void __fastcall sse_add_row(register sse_ecx)
    {
    Row *row = algebra -> generateRow(table, page, sse_ecx);
    algebra -> syscat  -> addRowData (output_table, row);
    delete row;
    }

// print the contents of XMM1 (for testing purposes only)
void __fastcall show_xmm1(register sse_ecx)
    {
    float *o = (float *)malloc(4 * sizeof(float));
    __asm
        {
        mov    edi, o
        movups [edi], xmm1
        }

    printf("%d: %f %f %f %f\n", sse_ecx, o[0], o[1], o[2], o[3]);
    free(o);
    }
----------------------------------

On AMD Duron 1800Mhz processor the red line above is not needed because XMM1 is already loaded (movss and broadcast). Its contents is constant. But on Intel, its contents is constat only until procedure sse_add_row is called. After the first call the contents of XMM1 is changed - it is rewriten to these components: 0.00000    2.90625    0.00000    0.00000    and then stay constant with these values.

I don't understand what part of code is wrong or some strange-side-effect-generating, why it runs fine on AMD and why the new content of XMM1 is right 0.00000    2.90625    0.00000    0.00000. I studied manuals with instructions and function calling conventions, but I didn't find what could modify the contents of XMM1 and why only on Intel processors.

Now I tested it on AMD Turion and it run in the same way like on Intel. The XMM1 contents is rewriten... So my program run correctly only on AMD Duron 1800 MHz.

Can somebody find the clue? Thanks in advance.
Jozef
 
 09-19-2006, 3:53 PM 30224080 in reply to 30223971  

Re: Difference using SSE on Intel and AMD processors (?)

I think the calling convention for the mmx/xmm registers requires that they be "caller saved" which means that they could be modified by called functions.  You have to save them before calls and restore them afterwards.  Check the following links:

Windows:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/Kernel_d/hh/Kernel_d/64bitamd_6848c803-89d3-4f19-82b2-6fae5e63ec13.xml.asp

Intel compiler changes:

http://www.intel.com/support/performancetools/c/windows/sb/cs-020438.htm

Of passing interest:

SysV AMD64 ABI:

http://www.x86-64.org/documentation/abi-0.96.pdf#search=%22linux%20IA32%20ABI%20SSE%22

 
 09-19-2006, 5:29 PM 30224086 in reply to 30224080  

Re: Difference using SSE on Intel and AMD processors (?)

Can you step into the function calls and see if some instruction is explicitly writing to the XMM1 register?
 
 09-21-2006, 1:27 AM 30224203 in reply to 30223971  

Re: Difference using SSE on Intel and AMD processors (?)

Another response to the original question, forwarded to us by engineering:

Which compiler are you using? XMM registers in 32 bit mode are non-volatile, and the question appears to assume they are. It is very likely that the compiler is calling an optimized memory routine in the Intel case.

==

Lexi S.

Intel(R) Software Network Support

http://www.intel.com/software

Contact us

 

 
 09-25-2006, 2:11 PM 30224387 in reply to 30223971  

Re: Difference using SSE on Intel and AMD processors (?)

Jozef,

Some comments on your code:

First

    movss    xmm1, [edi]             // xmm1 <- key
    shufps   xmm1, xmm1, 0           // broadcast

is not equivalent to

    movaps    xmm1, [edi]           // xmm1 <- key

Unless edi points to 4 identical single precision FP values. (I assume it is)

Second, as per Lexi's suggestion step through your code. You will most likely find the code called by your sse_add_row is modifying XMM1 (caller's responsibility to preserve/restore XMM registers). If you find this the case then insert the

    movaps    xmm1, [edi]           // xmm1 <- key

following each call to sse_add_row. In this manner the overhead only occures when needed. (remove what you thought was the unnecessary movaps)

Third, if you data is such that the majority of compares are "not founds" then rearrange the code to place the NOT_FOUND_3 section following the first test

START_LOOP:
    ...
    test   edx,edx
    jnz    FOUND
NOT_FOUND_3:      
    add    esi, 16
    add    ecx, 4
    cmp    ecx, ebx
    jne    START_LOOP
    jmp    DONE

FOUND:
    ...
DONE:
}

There are a few more tweeks, but I will let you find them for yourself.

Jim Dempsey

 

 

 
View as RSS news feed in XML

Shortcuts


Tags For This Post

...

Community Tags

...