





Sidebar: Matrix SIMD in Intel Chips Intel has announced AMX – the Advanced Matrix Extensions. It looks like this will multiply 16x16 matrices of data types fp16, int16, and int8. AMX will be appearing starting with the 4th Generation Xeon Scalable Processors. This is being billed as an "Al Acceleration Engine". I suspect this is much like the Tensor Cores on Nvidia GPUs.







Requirements for a For-Loop to be SIMD'd

• If there are nested loops, the one to vectorize must be the inner one.

• There can be no jumps or branches. "Masked assignments" (an if-statement-controlled assignment) are OK, e.g.,

if( A[ i ] > 0. )

B[ i ] = 1.;

• The total number of iterations must be known at runtime when the loop starts

• There can be no inter-loop data dependencies such as:

a[ i ] = a[ i-1 ] + 1.;

1010 element a[100] = a[99] + 1.; // this crosses an SSE boundary, so it is ok
a[101] = a[100] + 1.; // this is within one SSE operation, so it is not OK

• It helps performance if the elements have contiguous memory addresses.

/

This all sounds great!
What is the catch?

The catch is that compilers haven't caught up to producing really efficient SIMD code. So, while there are great ways to express the desire for SIMD in code, you won't get the full potential speedup ... yet.

One way to get a better speedup is to use assembly language.
Don't worry – you wouldn't need to write it.

Here are two assembly functions:

1. SimdMul: C[0:len] = A[0:len] \* B[0:len]

2. SimdMulSum: return ( \( \sum\_{A} \) [0:len] \* B[0:len] )

Warning – due to the nature of how different compilers and systems handle local variables, these two functions only work on flip and rabbit using gcc/g++, without any optimization!!!

9

Array\*Array Multiplication Speed

| Social Speed | Array\*Array Multiplication Speed |

12

10

8

```
Avoiding Assembly Language: SIMD using the OpenMP SIMD Pragma 13
  Array * Array
    SimdMul( float *a float *b) float *c, int len)
              #pragma omp simd
              for( int i= 0; i < len; i++ )
c[i] = a[i] * b[i];
  Array * Scalar
   void
SimdMul(float *a, float b) float *c, int len )
              #pragma omp simd
for( int i = 0; i < len; i++ )
c[ i ] = a[ i ] * b;
```

```
Avoiding Assembly Language: SIMD using the OpenMP SIMD Pragma 14
                    #pragma omp simd
for( int i = 0; i < ArraySize; i++ )</pre>
                             c[i] = a[i] * b[i];
                                                                  #pragma omp simd
```

14

16

13

```
Avoiding Assembly Language: the Intel Intrinsics
Intel has a mechanism to get at the SSE SIMD without resorting to assembly language. These are called Intrinsics.
        Intrinsic
                                         Meaning
  m128
                         Declaration for a 128 bit 4-float word
                          Load a __m128 word from memory
 _mm_loadu_ps
 _mm_storeu_ps
                          Store a __m128 word into memory
 _mm_mul_ps
                          Multiply two __m128 words
  _mm_add_ps
                          Add two __m128 words
```

SimdMul using Intel Intrinsics void SimdMul(float\*a, float\*b, float\*c, int len) int limit = ( len/SSE\_WIDTH ) \* SSE\_WIDTH; register float \*pa = a; register float \*pb = b; register float \*pc = c; for( int i = 0; i < limit; i += SSE\_WIDTH) \_mm\_storeu\_ps(pc, \_mm\_mul\_ps(\_mm\_loadu\_ps(pa ), \_mm\_loadu\_ps(pb ) ) ); pa += SSE WIDTH; pb += SSE WIDTH; pc += SSE\_WIDTH; for( int i = limit; i < len; i++ ) c[i] = a[i] \* b[i];U Oregon State University Computer Graphics

15

SimdMulSum using Intel Intrinsics float
SimdMulSum(float \*a, float \*b, int len )  $\begin{aligned} & \text{float sum}[4] = \{\,0.,\,0.,\,0.,\,0.\,\}; \\ & \text{int limit} = (\,\text{len/SSE\_WIDTH}\,) \,\,\text{* SSE\_WIDTH}; \\ & \text{register float *pa} = a; \\ & \text{register float *pb} = b; \end{aligned}$ \_\_m128 ss = \_mm\_loadu\_ps(&sum[0]); for(int i = 0; i < limit; i += SSE\_WIDTH)  $ss = _mm\_add\_ps(ss, \_mm\_mul\_ps(\_mm\_loadu\_ps(pa), \_mm\_loadu\_ps(pb))); \\ pa += SSE\_WIDTH; \\ pb += SSE\_WIDTH;$ mm\_storeu\_ps( &sum[0], ss ); for(int i = limit; i < len; i++) sum[0] += a[ i ] \* b[ i ]; return sum[0] + sum[1] + sum[2] + sum[3]; 17

Intel Intrinsics Intrinsics for SIMD C[i] = A[i]\*B[i] sum = sum + A[i]\*B[i] SpeedUp SIMD Array Size 18

#define NUM\_ELEMENTS\_PER\_CORE (ARRAYSIZE / NUMT)

...

omp\_set\_num\_threads( NUMT );
double maxMegaMultsPerSecond = 0,;
double time0 = omp\_get\_wtime();
#pragma omp parallel

{
 int thisThread = omp\_get\_thread\_num();
 int first = thisThread = NUM\_ELEMENTS\_PER\_CORE;
 SimdMul( &A[first], &B[first], &C[first], NUM\_ELEMENTS\_PER\_CORE);
}
double time1 = omp\_get\_wtime();
double time1 = omp\_get\_wtime();
computer Graphics

organistate
University
Computer Graphics

Notes:

Remember that #pragma omp parallel creates a thread team and that all threads execute everything in the curly braces.

The variable thisThread is the thread number of the thread who is executing this code right now. There will eventually be NUMT threads who get to execute this code. Thus, all the instances of thisThread will be between 0 and NUMT-1.

The variable first is the first array element number that thisThread will execute.

Starting the SIMD multiplications at &A[first], &B[first], &C[first] gives each thread its very own set of contiguous array elements to work on. The SimdMul function depends on this.

19

Combining SIMD with Multicore

Speedup for Multicore, SIMD, and Multicore+SIMD

16x

16x

4 cores alone,
1 core alone
1 x

Array Size

Array Size

- "cores alone" = a for-loop with mo multicore or SIMD.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores signe" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.
- "cores alone" = a for-loop with morphallel for.

20

Prefetching 22

Prefetching is used to place a cache line in memory before it is to be used, thus hiding the latency of fetching from off-chip memory.

There are two key issues here:

1. Issuing the prefetch at the right time
2. Issuing the prefetch at the right distance

The right time:

If the prefetch is issued too late, then the memory values won't be back when the program wants to use them, and the processor has to wait anyway.

If the prefetch is issued too early, then there is a chance that the prefetched values could be evicted from cache by another need before they can be used.

The right distance:

The "prefetch distance" is how far ahead the prefetch memory is than the memory we are using right now.

Too far, and the values sit in cache for too long, and possibly get evicted.

21

23

The Effects of Prefetching on SIMD Computations

Array Multiplication
Length of Arrays (NUM): 1,000,000
Length per SIMD call (ONETIME): 256

for( int i = 0; i < NUM; i += ONETIME )
{
 \_\_builtin\_prefetch ( &A[i+PD], WILL\_READ\_ONLY, LOCALITY\_LOW );
 \_\_builtin\_prefetch ( &C[i+PD], WILL\_READ\_ONLY, LOCALITY\_LOW );
 \_\_builtin\_prefetch ( &C[i+PD], WILL\_READ\_AND\_WRITE, LOCALITY\_LOW );
 SimdMul(A, B, C, ONETIME );
}

22

The Effects of Prefetching on SIMD Computations

24

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

500.0

24

