Intel® Software Network Knowledge Base Wiki


Constructing Nav Tree
One Moment...

(refresh menu)



 
Welcome, Guest | Quick Login | Register

Develop for Core processor


Optimize the AES Algorithm for the Intel® Pentium® 4 Processor

Version 3, Changed by LINDA SWINK on 3/21/2008
Created by: KYLEX.S.LEWIS@INTEL.COM

Challenge

Optimize performance of the Advanced Encryption Standard (AES) algorithm for the Pentium® 4 processor. The Data Encryption Standard (DES) had served as an important cryptographic algorithm for over two decades. However, the growth of computing power during that time had compromised the security of that algorithm. There was considerable evidence that it was time to replace DES with a new standard. In October 2000, the National Institute of Standards and Technology (NIST) selected the Rijndael algorithm as the new encryption standard to replace the current DES algorithm. For further information on the selection process and AES itself see http://csrc.nist.gov/CryptoToolkit/aes/*.

Antoon Bosselaers and Vincent Rijmen, who proposed Rijndael as the AES, wrote the following implementation of the Rijndael algorithm in the C programming language:



1 int rijndaelEncrypt (word8 a[16], word8 b[16], word8 rk[MAXROUNDS+1][4][4])
2 {
3
4 int r;
5 word8 temp[4][4];
6
7 *((w32*)temp[0]) = *((w32*)a) ^ *((w32*)rk[0][0]);
8 *((w32*)temp[1]) = *((w32*)(a+4)) ^ *((w32*)rk[0][1]);
9 *((w32*)temp[2]) = *((w32*)(a+8)) ^ *((w32*)rk[0][2]);
10 *((w32*)temp[3]) = *((w32*)(a+12)) ^ *((w32*)rk[0][3]);
11
12 *((w32*)b)= *((w32*)T1[temp[0][0]]) ^ *((w32*)T2[temp[1][1]])
13 ^ *((w32*)T3[temp[2][2]])
14 ^ *((w32*)T4[temp[3][3]]);
15
16 *((w32*)(b+4))= *((w32*)T1[temp[1][0]]) ^ *((w32*)T2[temp[2][1]])
17 ^ *((w32*)T3[temp[3][2]])
18 ^ *((w32*)T4[temp[0][3]]);
19
20 *((w32*)(b+8))= *((w32*)T1[temp[2][0]]) ^ *((w32*)T2[temp[3][1]])
21 ^ *((w32*)T3[temp[0][2]])
22 ^ *((w32*)T4[temp[1][3]]);
23
24 *((w32*)(b+12))=*((w32*)T1[temp[3][0]]) ^ *((w32*)T2[temp[0][1]])
25 ^ *((w32*)T3[temp[1][2]])
26 ^ *((w32*)T4[temp[2][3]]);
27
28 for(r = 1; r < ROUNDS-1; r++) {
29 *((w32*)temp[0]) = *((w32*)b) ^ *((w32*)rk[r][0]);
30 *((w32*)temp[1]) = *((w32*)(b+4)) ^ *((w32*)rk[r][1]);
31 *((w32*)temp[2]) = *((w32*)(b+8)) ^ *((w32*)rk[r][2]);
32 *((w32*)temp[3]) = *((w32*)(b+12)) ^ *((w32*)rk[r][3]);
33
34 *((w32*)b)= *((w32*)T1[temp[0][0]]) ^ *((w32*)T2[temp[1][1]])
35 ^ *((w32*)T3[temp[2][2]])
36 ^ *((w32*)T4[temp[3][3]]);
37
38 *((w32*)(b+4)) = *((w32*)T1[temp[1][0]]) ^ *((w32*)T2[temp[2][1]])
39 ^ *((w32*)T3[temp[3][2]])
40 ^ *((w32*)T4[temp[0][3]]);
41
42 *((w32*)(b+8)) = *((w32*)T1[temp[2][0]]) ^ *((w32*)T2[temp[3][1]])
43 ^ *((w32*)T3[temp[0][2]])
44 ^ *((w32*)T4[temp[1][3]]);
45
46 *((w32*)(b+12))= *((w32*)T1[temp[3][0]]) ^ *((w32*)T2[temp[0][1]])
47 ^ *((w32*)T3[temp[1][2]])
48 ^ *((w32*)T4[temp[2][3]]);
49 }
50 }



The array a contains 16 bytes of input text, the array b will contain the output and the array rk contains 16 bytes of key for each round of encryption. For 128-bit Rijndael encryption, the value of MAXROUNDS and ROUNDS will be 14 and 10, respectively. The array temp is the state table containing indexes into four fixed-sized, constant tables T1, T2, T3 and T4. Lines 7-10 calculate the initial value of the state table. Lines 12-26 perform the table lookups and the XOR operations to calculate the encrypted text for the first iteration, which also becomes the input to the next iteration. The state table for the next iteration is calculated in lines 29-32, which is used in lines 34-49 to calculate the encrypted text for that round.

Solution

Rearrange the code and hide dependencies. We will start optimizing the original C code shown in the Challenge section by rearranging the code. We will move lines 12-15 inside the for loop as shown below. This reduces the number of assignment statements, avoiding assignment to the temp array through the b array. Instead, we use the temp array to communicate between iterations of the loop.

Since the Pentium 4 processor can execute several instructions in parallel and out of the order they appear in the code, it is often useful to rearrange the code to let the processor start a long computation so that the result will be ready for later calculations that depend on it. This is often referred to as 'hiding the latency' of an instruction. The first four bytes of the encrypted text are available after the calculation in line 12. In order to hide the latency of instructions, we can start the operations of line 29 at line 12. In this way, we can avoid building a dependency chain that can interfere with the out-of-order and parallel execution in the processor. The modified code below includes changes to start computing the state table as soon as possible. It also partially unrolls the for loop:



1 int rijndaelEncrypt (word8 a[16], word8 b[16], word8 rk[MAXROUNDS+1][4][4])
2 {
3 register int r;
4 word8 temp0[4][4], temp1[4][4];
5
6 *((w32*)temp0[0]) = *((w32*)a) ^ *((w32*)rk[0][0]);
7 *((w32*)temp0[1]) = *((w32*)(a+4)) ^ *((w32*)rk[0][1]);
8 *((w32*)temp0[2]) = *((w32*)(a+8)) ^ *((w32*)rk[0][2]);
9 *((w32*)temp0[3]) = *((w32*)(a+12)) ^ *((w32*)rk[0][3]);
10
11
12 for (r = 1; r< (ROUNDS-1); r= r+2) {
13 *((w32*)temp1[0]) = *((w32*)rk[r][0])
14 ^ *((w32*)T1[temp0[0][0]])
15 ^ *((w32*)T2[temp0[1][1]])
16 ^ *((w32*)T3[temp0[2][2]])
17 ^ *((w32*)T4[temp0[3][3]]);
18
19 *((w32*)temp1[1]) = *((w32*)rk[r][1])
20 ^ *((w32*)T1[temp0[1][0]])
21 ^ *((w32*)T2[temp0[2][1]])
22 ^ *((w32*)T3[temp0[3][2]])
23 ^ *((w32*)T4[temp0[0][3]]);
24
25 *((w32*)temp1[2]) = *((w32*)rk[r][2])
26 ^ *((w32*)T1[temp0[2][0]])
27 ^ *((w32*)T2[temp0[3][1]])
28 ^ *((w32*)T3[temp0[0][2]])
29 ^ *((w32*)T4[temp0[1][3]]);
30
31 *((w32*)temp1[3]) = *((w32*)rk[r][3])
32 ^ *((w32*)T1[temp0[3][0]])
33 ^ *((w32*)T2[temp0[0][1]])
34 ^ *((w32*)T3[temp0[1][2]])
35 ^ *((w32*)T4[temp0[2][3]]);
36
37 *((w32*)temp0[0]) = *((w32*)rk[r+1][0])
38 ^ *((w32*)T1[temp1[0][0]])
39 ^ *((w32*)T2[temp1[1][1]])
40 ^ *((w32*)T3[temp1[2][2]])
41 ^ *((w32*)T4[temp1[3][3]]);
42
43 *((w32*)temp0[1]) = *((w32*)rk[r+1][1])
44 ^ *((w32*)T1[temp1[1][0]])
45 ^ *((w32*)T2[temp1[2][1]])
46 ^ *((w32*)T3[temp1[3][2]])
47
48 ^ *((w32*)T4[temp1[0][3]]);
49 *((w32*)temp0[2]) = *((w32*)rk[r+1][2])
50 ^ *((w32*)T1[temp1[2][0]])
51 ^ *((w32*)T2[temp1[3][1]])
52 ^ *((w32*)T3[temp1[0][2]])
53 ^ *((w32*)T4[temp1[1][3]]);
54
55 *((w32*)temp0[3]) = *((w32*)rk[r+1][3])
56 ^ *((w32*)T1[temp1[3][0]])
57 ^ *((w32*)T2[temp1[0][1]])
58 ^ *((w32*)T3[temp1[1][2]])
59 ^ *((w32*)T4[temp1[2][3]]);
60 }
61 //Code for Last round -1
62 //Code for Last round

Source

Optimizing Performance of the AES Algorithm for the Intel® Pentium® 4 Processor



Served
23 Knowledge Bases
604 Pages
Search
Powering Up Search...


Vote on this Page

Tags For This Page
Loading Tags..

Tag This



Additional legal information