SPO600 - Hyouk Sun Kwon

Saturday, April 18, 2020

Project - Stage 3

This is the last part of the optimization project.
I choose the "Zopfli" open source for optimization.
As a result of profiling, the hottest part of Zopfli is the "ZopfliUpdateHash()" function.

Here is the result of profiling.

Here is the hottest part in the "ZopfliUpdateHash()" function.

Here is the "ZopfliUpdateHash" function code in hash.c .

I plan to optimize "while loof" in this function as profiling result with several options.

First, GCC optimization options.

When I change GCC option with -o flag, the result is very interesting. I changed only the option 0 to 3 and fast, the results are quite different. when choose the -o fast, real and sys time is faster. Aslo, when compare option o0 to o3, when the option was increased, the speed also faster.

-O0- no optimization

-O1- first level optimization

-O2 – second level optimization

-O3 – highest optimization

-Ofast – optimize for speed only

time	-O0	-O1	-O2	-O3	-Ofast
real	3m39.489s	1m41.897s	1m33.908s	2m19,740s	1m23.712s
user	3m39.023s	1m41.606s	1m33.620s	1m22.996s	1m23.455s
sys	0m0.142s	0m0.127s	0m0.138s	0m0.163s	0m0.119s

Second, Code Rewriting optimizations

* Original code

while (pos + amount + 1 < end && array[pos] == array [pos + amount +1]

&& amount < (unsigned short)(-1)){

amount++;

}

1) Common subexpression elimination.

In this while loof code, subexpression(pos + amount +1) is evaluated twice. It could be evaluated once and used twice, saving an expensive it.

size_t newVal = pos +1;

while( newVal < end && array[pos] == array[bewVa] && amount < )unsigned short)(-1){

amount ++;

newVal= newVal + amount ;

}

2) Short Circuit

In this while loof, there are three conditions. when evaluating a condition with a logical AND, both halves of the condition must be true for the entire condition to be true. therore, it is not necessary to evaluate the second, third of the condition if the first is false.

int newVal = pose + 1;

while( newVal < end){

while (array[pos] == array[bewVa]){

while (amount < )unsigned short)(-1){

amount ++;

newVal= newVal+ amount ;

}}}

Third, SIMD

Vectorization is the unrolling of a loop combined with the generation of packed SIMD instructions by the compiler. Because the packed instructions operate on more than one data element at a time, the loop can execute more efficiently. It is sometimes referred to as auto-vectorization, to emphasize that the compiler identifies and optimizes suitable loops on its own, without requiring any special action by the programmer.

* Original code

void ZopfliUpdateHash(const unsigned char* array, size_t pos, size_t end, ZopfliHash* h) { .......

......

while (pos + amount + 1 < end &&

array[pos] == array[pos + amount + 1] && amount < (unsigned short)(-1)) {

amount++;

}

.....

......

}

* optimization code

Inform to the compiler that no overlap is possible. This is done in standard C by using the restrict qualifier for the pointers.

void ZopfliUpdateHash(const unsigned char* restrict array, size_t pos, size_t end, ZopfliHash* restrict h) { .......

......

__builtin_assume_aligned(array, 16);
__builtin_assume_aligned(h, 16);

while (pos + amount + 1 < end &&

array[pos] == array[pos + amount + 1] && amount < (unsigned short)(-1)) {

amount++;

}

.....

......

}

Finally, I will try to optimize "zopfliUpdateHash()" with below code.

void ZopfliUpdateHash(const unsigned char* restrict array, size_t pos, size_t end, ZopfliHash* restrict h) { .......

......

__builtin_assume_aligned(array, 16);
__builtin_assume_aligned(h, 16);

size_t newVal= pos + 1;

while (newVal < end ){

while ( array[pos] == array[newVal ] {

while( amount < (unsigned short)(-1)) {

amount++;

newVal = newVal + amount ;

}}}

.....

......

}

What I learn...

During this project, I learned two things.

First, when I do coding, I have to think whether my code is optimized. My coding style can affect to the processing speed. so, during programming, I have to consider about the speed. Second, there are variety way for optimization. My optimization plan are some of them. I will try to learn other way. This project is very interesting.

Tuesday, April 7, 2020

Project - Stage2

1. Profile your software to identify where the software is spending its execution time and the amount of memory that the software is using. Perform this profiling using the test case you created in Stage 1, on both x86_64 and AArch64 platforms. Do not compare the absolute performance on different platforms, but do compare the relative performance on a given platform.

What is profiling?

profiling is a form of program analysis. It measures the time and space complexity of a program, the frequency of function calls and other info. profiling is mainly done with the thought of program optimization.

How get profiling?

1. To get the profiling I did the below steps with gprof.

What is Gprof: instrumentation and sampling. So it gives us a top-quality call graph, but, when we use this gprof we need special instrumentation code.

1) Modify the Makefile : add -pg option

2) Do "make clean"

3) Do "make all"

4) Do " gprof zopfilepndg gmon.out -p > result.txt "

Therefore I got the below profiles for each size and each server.

AArch64 : 10MB

AArch64 : 20 MB

AArch64 : 30MB

X86-64 10MB

X86-64 20MB

X86-64 30MB

2. To get the profiling I did the below steps with perf.

* perf: This is not instrument only sampling. it does not give us call graph.

1) Modify the Makefile : remove -pg option. I added this option for gprof.

2) Do "make clean"

3) Do "make all"

4) Do "perf recode ./10mb.png 10test.png"

5) open the profiling "perf report"

6) I got profiling report.

* 10MB

*20MB

When you see the profile, first ZofliUpdateHash function is about 30% of running time and ZofliFindLongest funstion is 19% , ZofliStoreLitLenDsit function is 15%. However, when I profiling with Gprof, the most hootest function is ZofliFindLongest.

Therefore I will see more detail about ZofliFindLongest part.

* 10MB

* 20MB

3. Identify the functions or methods that are taking the majority of the CPU time.

The most of time are taken by 2 funcions. First one is ZopfliFindLongestMatch. It tooks around 58% of total time on x86-64server. (40% time of total time on aarch server). Other one is encodeLZ77(univector*, Hash*, unsigned char const*, unsigned long, unsigned long, unsigned int, unsigned int, unsigned int, unsigned int). It tooks around 23% of total time on x86-64 server.( 13% of running time of total runnint time on aarch server). However, acording to the profiling with perf, the hottest funstion is ZofliUpdateHash(30%) and second is ZopfliFindLongestMatch(19%). I will optimize one of those function for satage 3.

Monday, April 6, 2020

Lab 6 - Algorithm Selection Lab

In this lab, we will investigate the impact of different algorithms which produce the same effect.
We will test and select one of three algorithms for adjusting the volume of PCM audio sample based on benchmarking.

vol1.c : the basic or Navie algorithm
This approach multiplies each sound sample by 0.75, casting from signed 16-bit integer to floating point and back again. Casting between integer and floating point can be expensive operations.

vol2.c: A lookup-based algorithm
This approach uses a pre-calculated table of all 65536 possible results, and looks up each sample in that table instead of multiplying.

vol3.c: A fixed-point algorithm
This approach uses fixed-point math and bit shifting to perform the multiplication without using floating-point math.

Make a prediction
we will process with 500000000 samples.
When I read the code vol3 and vol2 is similar and vol1 is most slow.
most of part of code is similar but, the "interesting part" slightly different.
In case of vol1.c, using the "data[x] = scale_sample(data[x], 0.75);" in the loop. However vol3.c using this calculation code "vf = 0.75 * 65536;" out of the loop and use the variable "vf" in the loop.

Here id the code we have for vol1.c
// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}

int main() {

// Allocate memory for large data array
int16_t* data;
data = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

int x;
int ttl = 0;

// Seed the pseudo-random number generator
srand(1);

// Fill the array with random data
clock_t start, stop;
int i;
start =clock();

for (x = 0; x < SAMPLES; x++) {
data[x] = (rand()%65536)-32768;
}
printf("\n\n");
stop = clock();

// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
data[x] = scale_sample(data[x], 0.75);
}
// ######################################

// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+data[x])%1000;
}

// Print the sum
printf("Result: %d\n", ttl);

return 0;

}

Here id the code we have for vol1.c
[hskwon@bbetty algorithm_selection_lab]$ cat vol2.c
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"

// Function to initialize the scaling table
static inline void scale_init(int16_t* table, float volume_factor) {
for (int i=-32768; i<32767; i++) {
table[(uint16_t) i] = ((float) i) * volume_factor;
}
return;
}

int main() {

// Allocate memory for large data array
int16_t* data;
data = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

int x;
int ttl = 0;

int16_t table[65536];

// Seed the pseudo-random number generator
srand(1);

// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
data[x] = (rand()%65536)-32768;
}

// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
scale_init(table, 0.75);
for (x = 0; x < SAMPLES; x++) {
data[x] = table[(uint16_t) data[x]];
}
// ######################################

// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+data[x])%1000;
}

// Print the sum
printf("Result: %d\n", ttl);

return 0;

}

Here id the code we have for vol1.c
[hskwon@bbetty algorithm_selection_lab]$ cat vol3.c
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"

int main() {

// Allocate memory for large data arrray
int16_t* data;
data = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

int x;
int ttl = 0;
int vf;

int16_t table[65536];

// Seed the pseudo-random number generator
srand(1);

// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
data[x] = (rand()%65536)-32768;
}

// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
vf = 0.75 * 65536;

for (x = 0; x < SAMPLES; x++) {
data[x] = (data[x] * vf) >>16;
}
// ######################################

// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+data[x])%1000;
}

// Print the sum
printf("Result: %d\n", ttl);

return 0;

}

Build and test each of the programs

Find a way to measure performance without the time taken to perform the test setup pre-processing (generating the samples) and post-processing (summing the results) so that you can measure only the time taken to scale the samples.

I used below with #include <sys/time.h> .

Change the number of samples so that each program takes a reasonable amount of time to execute
* sample: 500,000,000

Arcitecture		1st	2nd	3rd
Aarch 64 - bbetty	Vol1	934.655029	935.343994	934.682007
	Vol2	481.979004	479.429993	480.885986
	Vol3	782.955017	782.963013	782.968018
X84-64 : xerxes	Vol1	342.062012	341.799988	342.772003
	Vol2	273.645020	274.656006	273.539001
	Vol3	212.174011	212.164993	212.121002

* sample: 1,000,000,000

Arcitecture		1st	2nd	3rd
Aarch 64 - bbetty	Vol1	1871.629028	1871.630981	1870.562012
	Vol2	85.615005	76.682007	73.014008
	Vol3	1568.022949	1568.958008	1568.334961
X84-64 : xerxes	Vol1	682.671997	682.812988	683.776978
	Vol2	546.969971	581.843018	548.658020
	Vol3	425.340027	424.556030	425.140991

* sample: 2,000,000,000

Arcitecture		1st	2nd	3rd
Aarch 64 - bbetty	Vol1	3743.591064	3743.652100	3744.966064
	Vol2	1099.267090	1089.684082	1064.688965
	Vol3	3138.756104	3138.928955	3138.953125
X84-64 : xerxes	Vol1	1401.909912	1523.831055	1409.390015
	Vol2	1119.677002	1138.452026	1105.764038
	Vol3	874.038025	848.578003	873.478027

vol1 is most slow and vol2 is most fast in Aarch64 server. In xerxes server, vol3 is most fast and vol1 is most slow. When the number of samples was doubled (500,000,000 -> 1,000,000,000), the speed was also doubled. However, When the number of samples was increased 4 times(50,000,000 - > 20,000,000), the speed was also 4 times slower exept vol2.

Memory usage of each program
Aarch64 : bbetty

X86-64 : xerxes

What I learn..
Even if we have same file, the running speed is different depending on the server.