CS61C Review#
Parallel Processing#
Loop Unrolling#
Loop unrolling decreases the number of branch instructions and comparators which are the bottlenecks of datapath.
Rolled#
int x;
for (x = 0; x < 100; x++)
{
delete(x);
}
There is 100 sets of branch instrucitons.
Unrolled#
int x;
for (x = 0; x < 20; x += 5 )
{
delete(x); // For loop here?
delete(x + 1);
delete(x + 2);
delete(x + 3);
delete(x + 4);
}
There is now only 20 sets of branch instrucitons.
One may make a for loop to do the unrolled statements. Many compilers are able to recognize these kinds of for loops.
SIMD - Intel SSE Intrinsics#
Single instrucitons multiple data (SIMD) is a data-level parallelism design that operates a single instruction onto multiple data. Let’s use the Intel SSE Intrinsics as an example.
Implemented by using larger registers called scalar registers.
SSE scalar registers are named
xxm0, xxm1,...xxmN
and are128-bit
each.Often used for
int
andfloat
thus holding four32 bit
elements
Syntax#
Syntax |
Example |
|
---|---|---|
Datatype |
|
|
Functions |
|
|
Functions#
Syntax |
Description |
Example |
|
---|---|---|---|
Load |
|
Load the next vector from |
|
Arithmetic |
|
Apply |
|
Initialize / Set |
|
Create a vector with all values equal to |
|
Store |
|
Store the elements of vector |
|
Example of SSE Code#
float eval_lagrange_fast(float *X, float *Y, float c, size_t n, size_t k) {
float retval = 1, m[4];
size_t i;
__m128 ret_vec = _mm_set1_ps(1); // single precision (sp) floating point vector vector <1, 1, 1, 1>
for (i = 0; i < n/4; i += 1) {
if (i == k/4)
continue;
ret_vec = _mm_mul_ps(ret_vec, _mm_sub_ps(_mm_set1_ps(c),
_mm_loadu_ps(X + i*4)));
ret_vec = _mm_div_ps(ret_vec, _mm_sub_ps(_mm_load1_ps(X + k),
_mm_loadu_ps(X + i*4)));
}
for (i = k/4*4; i < k/4*4 + 4; i += 1) {
if (i == k)
continue;
retval *= c - X[i];
retval /= X[k] - X[i];
}
_mm_storeu_ps(m, ret_vec);
return Y[k] * retval * m[0] * m[1] * m[2] * m[3];
}
MIMD - OpenMP#
Multiple instruction multiple data (MIMD) is a thread-level paralellism design.
Multithread in a single core allows the illusion of simultaneous multiprocessing. This is done by smart task managing by the operating system.
Each thread has separate registers and PC but the memory is shared.
Let’s use OpenMP as an example.
OpenMP Directives#
Directives are placed in the code to signify which block of code are to be in distributed and what type of multithread process is to be done with those blocks.
##pragma omp [options]
{
[body]
}
Parallel#
The parallel
option is used when specific parts of the memory can be assigned to each thread.
The example below uses 2 threads to compute \(2^{100}\). Logically this should be 2x faster than a single thread (not really since its implementation dependent).
NUM_THREADS = 2
double prod[0] = 2
double prod[1] = 2
double result = 1
##pragma parallel {
id = omp_get_thread_num()
for (int i = 0; i < 100; i += NUM_THREADS) {
prod[id] *= 2
}
}
result *= prod[0] * prod[1]
Critical#
The critical
option creates a critical section for threads required to access shared memory.
NUM_THREADS = 2
double prod[0] = 2
double prod[1] = 2
double result = 1
##pragma omp parallel {
id = omp_get_thread_num()
for (int i = 0; i < 50; i += NUM_THREADS) {
prod[id] *= 2
}
##pragma omp critical
result *= prod[id]
}
Parallel For#
The parallel for
option can be assigned to each thread by consecutive chunks.
##pragma omp parallel for{
int a = [1,2,3,4];
int i;
for (i = 0; i < len(a); i++) {
print(a[i])
}
}
// prints 1 2 3 4 in any order depending on which threads finishes first.
int NUM_THREADS = 0;
int result;
int prod[NUM_THREADS]
##pragma omp parallel for{
int id = omp_get_thread_num()
for (int i = 0; i < 100; i++) {
prod[id] *= 2
}
}
Cache Coherency#
Bus : Connection between each cores.
Snoop : A signal asking the bus if anything changes.
MOESI Protocol#
Reminder: Dirty means memory is not up to date.
Modified
Only this cache has a copy
Dirty
Owned
More than one cache has a copy
Dirty
This cache responsible to write to memory
Exclusive
Only this cache has a copy
Clean
Shared
More than one cache has a copy
Either dirty or clean
This cache is not responsible for writing to memory
Invalid
Unusable since the block is not updated
graph RL;
M --R/W--> M
M --Bus R--> O
M --Bus W--> I
O --R / Bus R--> O
O --Bus W--> I
O --W--> M
E --R--> E
E --W--> M
E --Bus W--> I
E --Bus R--> S
S --R / Bus R--> S
S --W--> M
S --Bus W--> I
I --Reset--> I
I --R Miss, Shared--> S
I --W Miss--> M
I --R Miss, Exclusive--> E
Cache#
Cache Policy#
Write Through#
All changes are written to the memory8