







### **Storage Level Characteristics**

5

|                             | L1       | L2        | L3         | Memory      | Disk        |
|-----------------------------|----------|-----------|------------|-------------|-------------|
| Type of Storage             | On-chip  | On-chip   | On-chip    | Off-chip    | Disk        |
| Typical Size                | 100 KB   | 8 MB      | 32 MB      | 32 GB       | Many<br>GBs |
| Typical Access<br>Time (ns) | .25      | .50       | 10.8       | 50          | 5,000,000   |
| Scaled Access<br>Time       | 1 second | 2 seconds | 43 seconds | 3.3 minutes | 231 days    |
| Managed by                  | Hardware | Hardware  | Hardware   | OS          | os          |

Adapted from: John Hennessy and David Patterson, *Computer Architecture: A Quantitative Approach*, Morgan-Kaufmann, 2007. (4<sup>th</sup> Edition)

Usually there are two L1 caches – one for Instructions and one for Data. You will often see this referred to in data sheets as: "L1 cache: 32KB + 32KB" or "I and D cache"



mjb - March 9, 2023

6

### **Cache Hits and Misses**

When the CPU asks for a value from memory, and that value is already in the cache, it can get it quickly.

This is called a cache hit

When the CPU asks for a value from memory, and that value is not already in the cache, it will have to go off the chip to get it.

This is called a cache miss

While cache might be multiple kilo- or megabytes, the bytes are transferred in much smaller quantities, each called a **cache line**. The size of a cache line is typically juet **64 bytes**.

68333

Performance programming should strive to avoid as many cache misses as possible. That's why it is very helpful to know the cache structure of your CPU.

Oregon State
University
Computer Graphics

mjb – March 9, 2023

- IWBI CIT 9, 2023





```
9
     Demonstrating the Cache-Miss Problem – Across Rows
#define NUM 10000
float Array[NUM][NUM];
double MyTimer();
main( int argc, char *argv[])
     float sum = 0.;
     double start = MyTimer();
     for( int i = 0; i < NUM; i++ )
         for( int j = 0; j < NUM; j++)
         {
              sum += Array[ i ][ j ];
                                       // access across a row
    double finish = MyTimer();
     double row_secs = finish - start;
                                                                              mjb - March 9, 2023
```

# **Demonstrating the Cache-Miss Problem – Down Columns**

```
float sum = 0.;
double start = MyTimer();
for( int i = 0; i < NUM; i++ )
{
    for( int j = 0; j < NUM; j++ )
    {
        sum += Array[j][i];  // access down a column
    }
}
double finish = MyTimer();
double col secs = finish - start;</pre>
```



njb – March 9, 2023

10





```
13
      Computer Graphics is often a Good Use for Array-of-Structures:
               X0
                Y0
                                  struct xyz
                Z0
                                               float x, y, z;
               X1
                                 } Array[N];
               Y1
                Ζ1
               X2
                                  \begin{split} & \text{glBegin( GL\_LINE\_STRIP );} \\ & \text{for( int } i = 0; i < N; i ++ ) \end{split}
                Y2
                Z2
               Х3
                                               glVertex3f( Array[ i ].x, Array[ i ].y, Array[ i ].z );
               Y3
                                  glEnd();
                Z3
Oregon State
University
Computer Graphics
                                                                                                                  mjb - March 9, 2023
```

```
14
                         A Good Use for Structure-of-Arrays:
            X0
            X1
                             float X[N], Y[N], Z[N];
            X2
                             float Dx[N], Dy[N], Dz[N];
            X3
                             for( int i = 0; i < N; i++)
            Y0
            Y1
                                       Dx[i] = X[i] - Xnow;
            Y2
                                       Dy[i] = Y[i] - Ynow;
            Y3
                                       Dz[i] = Z[i] - Znow;
                            }
            . . .
            Z0
            Ζ1
            Z2
            Z3
            . . .
Oregon State
University
Computer Graphics
                                                                                            mjb - March 9, 2023
```

























# How Different Cores' Cache Lines Keep Track of Each Other

27

Each core has its own separate L2 cache, but a write by one can impact the state of the others.

For example, if one core writes a value into one of its own cache lines, any other core using a copy of that same cache line can no longer count on its values being up-to-date. In order to regain that confidence, the core that wrote must flush that cache line back to memory and the other core must then reload its copy of that cache line.

To maintain this organization, each core's L2 cache has 4 states (MESI):

- 1. Modified
- 2. Exclusive
- 3. Shared
- 4. Invalid



mjb - March 9, 2023

## A Simplified View of How MESI Works

28

- 1. Core A reads a value. Those values are brought into its cache. That cache line is now tagged Exclusive.
- 2. Core B reads a value from the same area of memory. Those values are brought into its cache, and now both cache lines are re-tagged Shared.
- 3. If Core B writes into that value. Its cache line is re-tagged Modified and Core A's cache line is re-tagged Invalid.

| Step |                | Cache Line A | Cache Line B |  |
|------|----------------|--------------|--------------|--|
|      | <b>1</b>       | Exclusive    |              |  |
|      | <b>2</b> 2     | Shared       | Shared       |  |
| _    | <b>→</b> 3     | Invalid      | Modified     |  |
| _    | <del>,</del> 4 | Shared       | Shared       |  |

4. Core A tries to read a value from that same part of memory. But its cache line is tagged Invalid. So, Core B's cache line is flushed back to memory and then Core A's cache line is reloaded from memory. Both cache lines are now tagged Shared.



This is a huge performance hit, and is referred to as False Sharing

Note that False Sharing doesn't create incorrect results - it just creates a performance hit. Universide If anything, False Sharing prevents getting incorrect results.

14

























































### malloc'ing on a cache line

56

What if you are malloo'ing, and want to be sure your data structure starts on a cache line boundary?

Knowing that cache lines start on fixed 64-byte boundaries lets you do this. Consider a memory address. The top N-6 bits tell you what cache line number this address is a part of. The bottom 6 bits tell you what offset that address has within that cache line. So, for example, on a 32-bit memory system:

32 - 6 = 26 bits 6 bits: 0-63

Cache line number

Oregon State
University
Computer Graphics

Offset in that cache line

So, if you see a memory address whose bottom 6 bits are 000000, then you know that that memory location begins on a cache line boundary.

njb – March 9, 202

### malloc'ing on a cache line

57

Let's say that you have a structure and you want to malloc an ARRAYSIZE array of them. Normally, you would do this:

```
struct xyzw *p = (struct xyzw *) malloc( (ARRAYSIZE)*sizeof(struct xyzw) ); struct xyzw *Array = &p[0]; . . . Array[ i ].x = 10. ;
```

If you wanted to make sure that array of structures started on a cache line boundary, you would do this:

```
unsigned char *p = (unsigned char *) malloc( 64 + (ARRAYSIZE)*sizeof(struct xyzw) ); int offset = (long int)p & 0x3f;  // 0x3f = bottom 6 bits are all 1's struct xyzw *Array = (struct xyzw *) &p[64-offset]; . . . Array[ i ].x = 10. ;
```

Remember that when you want to free this malloc'ed space, be sure to say: free(  $\mathfrak p$  );

not:



mjb - March 9, 2023

### Now, Consider This Type of Computation

58



Should you allocate the data as one large global-memory block (i.e., shared)? Or, should you allocate it as separate blocks, each local to its own core (i.e., private)?

Does it matter? Yes!

If you allocate the data as one large global-memory block, there is a risk that you will get False Sharing at the individual-block boundaries. Solution: make sure that each individual-block starts and ends on a cache boundary, even if you have to pad it. (Fix #1!)

If you allocate the data as separate blocks, then you don't have to worry about False Sharing (**Fix #2!**), but you do have to worry about the logic of your program remembering where to find each Node #i-1 and Node #i+1.

Oreg Uni Compu...