SPO600 - Hyouk Sun Kwon: March 2020

Project – stage 1.

Spo600 final project is that I choose one open source to optimize.

For fist stage, choose open source package and build software. Benchmark the performance of the current implementation of the software on AArch64 and x86-64 systems. Lastly experiment with build options to determine if this has any impact on the performance.

1. Choose open source

I choose “Zopfli (https://github.com/google/zopfli)” open source. It is compression algorithm made by Google. Zopfli is written in C for portability. It is a compression-only library. Zopfli is bit-stream compatible with compression used in gzip, Zip, PNG, HTTP requests, and others.

If you compare compression algorithm, zopfli is slower than others. When compare the fast one, gzip 9, zopfli is slower more than 80times.

*https://www.lifehacker.com.au/2013/03/a-look-at-zopfli-googles-open-source-compression-algorithm/

Also, I saw “Zopfli Compression Algorithm is a compression library programmed in C to perform very good, but slow, deflate or zlib compression.” in zopfli GitHub.

So, I want to optimize this one.

2. Build the software

2-1. x86_64

1) clone the code to server

I cloned the data from the Zopfli GitHub.

2) add image to the server

For my testing, I will choose the 10mb png file from https://www.sample-videos.com/download-sample-png-image.php.

3) Benchmark the performance

To benchmark the performance, I used 10mb, 20mb and 30mb files.

Real: elapsed real (wall clock) time used by the process, in seconds.

User: total number of CPU-seconds that the process used directly (in user mode), in seconds.

Sys: total number of CPU-seconds used by the system on behalf of the process (in kernel mode), in seconds.

time	10mb	20mb	30mb
real	2m19.740s	3m38.955s	5m38.456s
user	1m22.996s	3m38.287s	5m37.188s
sys	0m0.163s	0m0.320s	0m0.728s

* 10mb

* 20mb

* 30mb

I chosen 10mb file, executed 5times with O3 building option.

10mb	1st	2^nd	3^rd	4^th	5^th
Real	1m23.846s	1m23.742s	1m23,551s	1m24.355s	4m58.047s
User	1m23.571s	1m23.469s	1m23.282s	1m24.060s	1m33.316s
Sys	0m0.132s	0m0.133s	0m0.132s	0m0.150s	0m0.154s

4) Experiment with build option.

I use the 10mb.png file with various build options.

-O0- no optimization

-O1- first level optimization

-O2 – second level optimization

-O3 – highest optimization

-Ofast – optimize for speed only

time	-O0	-O1	-O2	-O3	-Ofast
real	3m39.489s	1m41.897s	1m33.908s	2m19,740s	1m23.712s
user	3m39.023s	1m41.606s	1m33.620s	1m22.996s	1m23.455s
sys	0m0.142s	0m0.127s	0m0.138s	0m0.163s	0m0.119s

*-O0

*- O1

*-O2

*-O3

*-Ofast

2-2. AArch64

1) Benchmark the performance

To benchmark the performance, I used 10mb, 20mb and 30mb files.

time	10mb	20mb	30mb
real	8m8.914s	18m56.510s	29m23.732s
user	8m7.776s	18m53.535s	29m18.217s
sys	0m0.229s	0m0.847s	0m2.001s

*10mb

* 20mb

*30mb

I chosen 10mb file, executed 5times with buildin option O3.

10mb	1st	2^nd	3^rd	4^th	5^th
Real	8m19.454s	8m42.835s	8m31.470s	8m24.040s	8m8.914s
User	8m18.075s	8m41.506s	8m30.148s	8m22.796s	8m7.776s
sys	0m0.339s	0m0.299s	0m0.339s	0m0.319s	0m0.229s

2) Experiment with build option.

I use the 10mb.png file with various build options.

time	-O0	-O1	-O2	-O3	-Ofast
real	23m39.314s	10m5.855s	8m49.292s	8m8.914s	8m7.720s
user	23m36.622s	10m4.457s	8m48.034s	8m7.776s	8m6.445s
sys	0m0.330s	0m0.329s	0m0.310s	0m0.229s	0m0.349s

*-O0

*-O1

*-O2

*-O3

*-Ofast

When I change the building option the running time is also changed. no optimization(O0) is most slow and -Ofast is most fast. When the building option is changed except the code changing, the performance is changed. It is really interesting to me.

For stage2, I will profile the software to determine which part of the code is doing most of the work.

Through this Lab5 we can compare with Aarch64 and x86-64. we will print "Hello, World" 30 times with order number.

First part: Aarch64 assembler

1. Use the "objdump -d" command and find <main>

"hello"

this hello object code comes from below c code.

"hello2"

this hello2 object code comes from below c code.

"hello3"

this hello3 object code comes from below c code.

2. How to solve ...

To add order number after "Hello, World" we need to know the order number location and how to add ASCII character as the order number.

so, we use digit_1, and digit_2 subroutines and loop and print subroutines.

digit_1 : add one-digit number behind the "Hello, World"

digit_2 : calculate the number of two-digit behind the "Hello, World"

print: print the "Hello, World" and increased the order number until max

loop: compare the order number whether is one_digit or two_digitnumbers

3. whole code

.text
.globl _start

_start:
mov x19,min /*store the min value into x19 as a loop index*/
mov x18,division /*store the division value(10) into x18*/
loop:
mov x0, 1 /* file descriptor: 1 is stdout */

cmp x19, 9 /* compare x19(loop index value) with 9*/
b.gt digit_2 /* if the value is greater than 9(2-digit), go to the subroutine digit_2*/
bl digit_1 /* if the value is less or equal than 9(1-digit), go to the subroutine digit_1*/

digit_1:
add x20, x19, '0' /* ascii number character*/

adr x30, msg+14 /* the digit location within string */
strb w20, [x30] /* store the digit at the location */
bl print /* go to the print subroutine */

digit_2:
udiv x25, x19, x18 /* divide the value by 10 and store the value into the x25 */
msub x26, x25, x18, x19 /* store the remainder into x26 */

add x21, x25, '0' /* ascii number character */
add x20, x26, '0' /* ascii number character */

adr x25, msg+14 /* the digit location within string */
strb w21, [x25] /* store the digit at the location */
adr x26, msg+15 /* the digit location within string */
strb w20, [x26] /* store the digit at the location */

bl print

print:
adr x1, msg /* store the locatio of message */
mov x2, len /* store the string length into x2 */

mov x8, 64 /* write is syscall #64 */
svc 0 /* invoke syscall */

add x19, x19, 1 /* increment x19 value which is loop index */
cmp x19, max /* compare x19 with max value */

b.ne loop /* if the value is not equal to max value, loop it again */

mov x0, 0 /* status -> 0 */
mov x8, 93 /* exit is syscall #93 */
svc 0 /* invoke syscall */

.data
msg: .ascii "Hello World!: \n"
len= .- msg
min = 0
max = 30
division=10

3. The result

4. What I learn..

Through this lab5 I learned another assembler language Aarch64. It is similar to 6502. I was familiar with this coding process because of 6502. Still not easy but I will try.

SPO600 - Hyouk Sun Kwon

Tuesday, March 31, 2020

Project - stage1

Friday, March 13, 2020

Lab 5 - Aarch64