Jozef,
Some comments on your code:
First
movss xmm1, [edi] // xmm1 <- key
shufps xmm1, xmm1, 0 // broadcast
is not equivalent to
movaps xmm1, [edi] // xmm1 <- key
Unless edi points to 4 identical single precision FP values. (I assume it is)
Second, as per Lexi's suggestion step through your code. You will most likely find the code called by your sse_add_row is modifying XMM1 (caller's responsibility to preserve/restore XMM registers). If you find this the case then insert the
movaps xmm1, [edi] // xmm1 <- key
following each call to sse_add_row. In this manner the overhead only occures when needed. (remove what you thought was the unnecessary movaps)
Third, if you data is such that the majority of compares are "not founds" then rearrange the code to place the NOT_FOUND_3 section following the first test
START_LOOP:
...
test edx,edx
jnz FOUND
NOT_FOUND_3:
add esi, 16
add ecx, 4
cmp ecx, ebx
jne START_LOOP
jmp DONE
FOUND:
...
DONE:
}
There are a few more tweeks, but I will let you find them for yourself.
Jim Dempsey