Welcome to Intel® Software Network Quick Login | Join | Help |
Search in Intel® Software Network Forums
in Go

ICC 11.0.26 - why this code won't vectorize?

Last post 07-10-2008, 9:36 PM by Igor Levicki. 11 replies.
Sort Posts: Previous Next
 07-02-2008, 9:25 AM 30258126  

ICC 11.0.26 - why this code won't vectorize?

I guess Dale will be interested to see this:


#include <stdlib.h>

void test(short *a, float *b, int n, int c1, int c2, float f1, float f2)
{
	#pragma vector aligned
	for (int i = 0; i < n; i++) {
		a[i] = __max(f1 * b[i * c1 + c2] + f2, 0.0f);
	}
}

This should be very easy to vectorize but compiler thinks it is not possible:


C:\test\icl11\non_vec\test.cpp(7): (col. 2) remark: loop was not vectorized: unsupported data type.

Changing data type of a[] from short to int is not an option because it doubles memory requirements and requires another conversion later before saving the file.


Regards,
Igor Levicki
http://www.levicki.net/
 
 07-03-2008, 1:31 PM 30258240 in reply to 30258126  

Re: ICC 11.0.26 - why this code won't vectorize?

I'm always interested to see your posts, Igor :-)

Well, like all good things in life, it's complicated. Specifically a major issue here is that currently there isn't an easy way to convert floats to shorts with SSE instructions. Given the language requirements about overflow exceptions and the like, there's no good safe way to do it. When you add in the complexity of the subscript, max and changing sizes it makes for something fairly difficult for the compiler to do efficiently. It could be argued that we could do a bit better jobs with certain cases dealing with ints to shorts, but I'm not sure how much of a bang for the buck there is (even given the low value of a buck these days :-) for a case like the above. If you write the loop using SSE intrinsics, how much performance improvement do you get?

Dale

 
 07-03-2008, 2:56 PM 30258251 in reply to 30258240  

Re: ICC 11.0.26 - why this code won't vectorize?

Dale,

If I write vectorized intrinsic version of this loop I am getting the same performance with a single thread as with the threaded version of the above non-vectorized loop. So clearly there is a lot to be gained by vectorizing this.

Vectorized code is actually very simple, and I am surprised that there are any restrictions at all regarding overflow.

At any rate, float to int can also give bogus results in some border cases but the compiler still does vectorize the code if a[] is changed to int.

What I am trying to say is — if I am using short datatype, then that is because I am sure that there can't be overflow and there should be a way to override the heuristics which prohibits the compiler from vectorizing the above loop.

Problem is that writing the above code with intrinsics makes it harder to thread or at least less readable for maintainance.

As for the code, if you change a[] to int and compile with /QxSSE4.1 /c /FAs to see the assembler listing of vectorized loop, all that you have to change to get the short version is to put PACKSSDW/MOVQ instead of MOVDQA in the loop after CVTTPS2DQ.


Regards,
Igor Levicki
http://www.levicki.net/
 
 07-04-2008, 12:29 PM 30258307 in reply to 30258251  

Re: ICC 11.0.26 - why this code won't vectorize?

Hi Igor,

This time I won't argue with you Igor: I agree sincerely! According to Wirth's law "Software gets slower faster than hardware gets faster" :-). Thus, easier access to extremely advanced vector hardware may possibly lead to a cool and unexpected way to extend Moore's Law into the future! A magic compiler is what we need ;-). As far as I know there are many cases where either a better compiler or a better language would lead to increased performance of both existing and future hardware. Obviously threading a loop over 256 cores may be stupid when a single core can keep all 256 values in a single register, but according to Wirth's law that's exactly what most people will do...

Best regards and a good summer to you Igor!

Lars Petter

 

 

 
 07-09-2008, 4:02 PM 30258697 in reply to 30258240  

Re: ICC 11.0.26 - why this code won't vectorize?

So Dale, should I submit this to Premier Support as a feature request then or not?


Regards,
Igor Levicki
http://www.levicki.net/
 
 07-09-2008, 6:49 PM 30258704 in reply to 30258697  

Re: ICC 11.0.26 - why this code won't vectorize?

IgorLevicki:

So Dale, should I submit this to Premier Support as a feature request then or not?



It probably wouldn't hurt.

Dale
 
 07-10-2008, 1:02 AM 30258720 in reply to 30258697  

Re: ICC 11.0.26 - why this code won't vectorize?

Igor, I don't think your problem will be resolved after you submit it to Premier Support. The compiler can't do everything, even we think it is so natural.

I  have posted a thread at Intel® AVX and CPU Instructions for almost a month-------- "Anyone thinking of setting low-width integars with high-width integars? "  I think your problem is just as mine, as in your problem, acturaly, the compiler can transfer float to int, but it cant transfer int to short automatically.

I havnt had satisfing result for my thread yet.

 
 07-10-2008, 5:09 AM 30258731 in reply to 30258720  

Re: ICC 11.0.26 - why this code won't vectorize?

kalven,

I disagree. First, submitting a request would make engineers aware that we need support for such a feature. Second, I believe that compiler should honor my request to convert to short (or even to byte if I say so) because after all I am the one who knows more about the data being processed, not the compiler. If I guarantee that there will be no overflow then I certainly should have an option to use smaller types in auto-vectorizable constructs.


Regards,
Igor Levicki
http://www.levicki.net/
 
 07-10-2008, 7:37 AM 30258744 in reply to 30258720  

Re: ICC 11.0.26 - why this code won't vectorize?

I agree with Igor.

If you as a programmer writes

   short aShort;
   int aInt = 123;
   ...
   sShort = aInt;

The compiler should at most warn you of a possible loss of precision (or not warn if appropriate #pragma in use) but the compile should generated the code as specified. (or at least to the best of its capabilities)

Think of low-width -> high-width as a scatter operation, and high-width -> low-width as a gather operation

While you are fixing this instruction you might consider adding high-width-high -> low-width as well as low-width -> high-width-high.

This could be done as an extention of shuffle whereby the source or destination is double width. bytes/words, words/dwords, dwords/qwords, qwords/dqwords.

Jim Dempsey

 
 07-10-2008, 8:23 AM 30258748 in reply to 30258744  

Re: ICC 11.0.26 - why this code won't vectorize?

Thanks everyone for the support, I have submitted a feature request to the Premier Support.


Regards,
Igor Levicki
http://www.levicki.net/
 
 07-10-2008, 8:39 PM 30258804 in reply to 30258731  

Re: ICC 11.0.26 - why this code won't vectorize?

Honestly, I even disagree with myself too. But I have to accept the reality. And the reality is what intel said "you should write compiler friendly code", not we said to intel "you should write my code friendly compiler". We are so blessed one day if compilers can honor our code a little. And I think the day will come.

I have used intel compiler for about two months, and the most situation for not vectorized loop is unsurported data type. I was angry at first, because I learned I have too be family with what data type needed for all SIMD instructions to write compiler friendly code and efficient code.

If you get any news from Premier Support for this issue, it is very appreciated of you to post them here.

PS:
And now, intel packaged IPP together with compiler for selling in China. IPP is just useless to me, as I have too write IPP friendly code, not just compiler friendly code.

 
 07-10-2008, 9:36 PM 30258812 in reply to 30258804  

Re: ICC 11.0.26 - why this code won't vectorize?

kalven,

The issue has been reproduced and escalated. I submitted it as a feature request but it turns out it may be a bug. We'll see.

As for the unsupported data type you seem to be missunderstanding the issue — vector int to short conversion is supported in x86 instruction set (PACKSSDW), just not in the compiler code generator.

By the way, no language or compiler rule can justify using 2.44 MB of RAM thus trashing almost whole Penryn L2 cache when 1.22 MB is enough to hold the data. For this particular case I know the data range is between 0 and 32767 and I am using short on purpose.

In my opinion, compiler should do what it usually does when type conversions are involved — it should issue a warning about possible loss of precision and vectorize the loop. If the vectorization is not desireable it can always be suppressed with #pragma novector.


Regards,
Igor Levicki
http://www.levicki.net/
 
View as RSS news feed in XML

Shortcuts


Tags For This Post

...

Community Tags

...