Fundamental Idea locality of reference in compiler optimization

Table Of Contents

Locality of reference also termed as principle of locality is the common idea in area of compiler optimization. In a nutshell, It means that program execution are faster when they are written in such a way that memory access is made within the small range of previous memory access. This process is done because accessing the content from cpu cache itself is far more performant than accessing from main memory

# Introduction

In context of modern CPU, when a processor access a memory location then to gain better performance It caches handful of memory location just next to currently accessed memory. Suppose, at some instant of time cpu accessed the memory location 0x55ac388b49d0. Then depending upon the size of L-cache memory, processor will cache the content of memory address 0x55ac388b49d1 0x55ac388b49d2 0x55ac388b49d3 and so on.

This mean if now processor access the memory 0x55ac388b49d1 It do not have to do anything to fetch the data as the content is already in the cache so simple that content will be used.

But in the mean time after accessing 0x55ac388b49d1 program tends to read the memory at location 0xffe34445ffff then it probably won’t be in the cache memory (given the fact that in average laptop cahce memory are around 6-12 mb). So to do any operation on that given location it will have to step away from the cache and access main memory. Which is worse than accessing from the cache or register itself.

For quick refreshment this is the order of component in increasing latency cost of accessing them

accessing cost of: cpu register < L1 cache < L2 cache < Main memory < secondary storage like SSD/HDD

But the money cost of them are in opposite order. i.e cost of having more equal amount of L1 cache in your PC is far far expensive than having same amount of Main memory. Thats why our PC have only few Mb of cpu cache, tens of Gb of Main memory and hundreds of Gb of secondary storage. Thay are in order of increasing capacity, decreasing cost and decreasing read/write speed.

Having this in mind, Let’s have a quick example

Understand with example

Have a quick glance at below example

fn main() {
    let array_one = [true; 10_00_000]; // allocate 1000 bytes of memory 1 for each bool value
    let array_two = [false; 10_00_000]; // allocate another 1000 bytes of memory
    
    array_one[0] = false; 	// location: 0x7ffc93f560a8
    array_one[1] = false; 	// location: 0x7ffc93f560a9
    array_one[2] = false; 	// location: 0x7ffc93f560aa
    						// ............
    						// .............
    						// location: 0x7ffc93f5648f for last element
    
    array_two[0] = true; 	// location: 0x7ffc9404a2e8
    array_two[1] = true; 	// location: 0x7ffc9404a2e9
    array_two[2] = true; 	// location: 0x7ffc9404a6cf
    						// .............
    						// .............
    						// location: 0x7ffe041caf97 for last element
}

Now lets look at this program line by line

  1. Decleration of main function

  2. Allocate 1 million true boolean value in array names array_one. memory required to represent single boolean value in rust is 1 byte so this statement allocates 1 million bytes of memory. Lets say from 0x7ffc93f560a8 to 0x7ffc93f5648f

  3. Allocate another 1 million bytes of memory for 1 million false boolean value in variable array_two from 0x7ffc9404a2e8 to 0x7ffe041caf97

  4. Access first element of array_one which is in the memory location 0x7ffc93f560a8

  5. Access secons element of array_one which is in memory location 0x7ffc93f560a9. Note that by this time, cpu probably already have cached content of array_one[2], array_one[3], array_one[4] and so on

  6. Access third element of array_one which is in memory location 0x7ffc93f560aa. As this value is already in cache it will be super fast to execute next statement. If the program had also accessed array_one[3] it would also be super instant as that value is also in cache already.

  7. Access the first element of array_two which is in memory location 0x7ffc9404a2e8. Before this point, CPU cache was holding the value of address 0x7ffc93f560aa and following memory location. But here the program have to access the content of memory 0x7ffc9404a2e8 which is about 1 million bytes far. As cpu caches are not much big to hold million bytes of cache from single program, now processor have to access the content from main memory which would be slow compared to accessing from cache

  8. Access the second element of array_two. At this point, depending upn the cahcing alogrithm of your cpu, it may need to clear the previous caches from array_one and write new caches from array_two

  9. Access third element of array_three which is hopefully in cache so may be instant.

    What will Compiler do

    Before talking about how and what compiler would do to optimize the program so that it better align with principle of localoty, If the logic in your code is somewhat complex for compiler then it is free to do absolutely nothing. Also optimization differs across compilers for eg: Compiler for costum embeded processor may not even have anything that deals with optimizing things.

    In addition to that, compiler should preserve the behaviour of the program i.e in greed of optimizing the code it is never intended to get behaviour of program different from what programmer have written. As someone quoted correclty,

    Corectness of program should be prioritized over performance

    Ok, Now let’s look at another program which most of compiler will hopefully optimizing by taking principle of localoty in mind.

fn main() {
    let a = vec![2; 1000];
    let b = vec![1; 1000];
    let c = vec![3; 1000];
    
    // Set all elements to 0
    for i in 0..1000 {
        a[i] = 0;
        b[i] = 0;
        c[i] = 0;
    }
}

If elements of vector A is stored in continous memory starting from address Q. Thay are than arranges in memory location as Q -> Q + 1 -> Q + 2 -> ........ -> Q+998 -> Q+999

Similarly, Elements of B starts from R -> R + 1 -> .... -> R+999,

Also, elements of C from S -> S + 1 -> ... -> `S + 999

Then according to the program defined, memory access will be in order of

Q -> R -> S -> Q+1 -> R+1 -> S+1 -> ... -> Q+999 -> R+999 -> S+999

Now if CPU caches Q+1 Q+2 ans so on while reading Q then it would complety be useless as next access is made to location R

Again, if R is accessed then cpu may cache the memory address R+1 R+2 ans so on. This will still be useless as next access of memory is to the location S

This will repeat again and again. Now we can realize how an innocent looking program be so bad in terms of caching. If we were given to write the program with exact same behaviour but in such a way that utilization of cache is maximum then it may look something like

fn main(){
    let a = vec![2; 1000];
    let b = vec![1; 1000];
    let c = vec![3; 1000];
    
    for i in 0..1000 {
        a[i] = 0
    }
    for i in 0..1000 {
        b[i] = 0;
    }
    for i in 0..1000 {
        c[i] = 0;
    }
}

Now the memory access is as:

P –> P+1 –> P+2 …..P+1000 -> Q -> Q+1 –> … Q+1000 –> R –> R+1 –> R+2 …. R+1000

Again if we are to break down the access, then When accessing the location P, P+2, P+3 cpu would have cache the content of location P+4, P+5, up to more element depending upon the size of cache memory. And when cache finishes let say around P+31 then it need to access P+32, P+33 from main memory bu up to that point other location would have been cached. Hence this is way more better utilization of cpu than previous version Compilier optimizer is supposed to do same thing. So as long as your compiler have an optimizer, from previous version of this program, it may generates the assembly that would be equivalent to assembly of second program as long as behaviour of program is preserved. So if you are writing code in modern and matured compiler like backed by llvm or clangd and you compiled previous version of progarm with optimizer enabled (in release mode in context of rust) then you can leave those tasks upto compiler for optimizier. But if your program logic is somewhat complex and you don’t think compiler may do that then it is of course better idea to reconsider as long as doing so will not make your codebase a mess and harder to reason about.

# Further Readings

  1. Temporal locality

    This is the supposion that same memory address will be accessed next time

  2. Saptial locality

    This is the supposition that if a memory is accessed then next time some nearby memory will be accessed. Abive example is the examle of this type of locality as processor will suppose that is P memory is address then something near as P+1 will be accessed next time

  3. Branch locality

  4. *Equidistant locality


You have the power

You can Edit this article or even Submit new articles. Want to Learn more?