Tuesday, March 31, 2020

Project - stage1



Project – stage 1.

Spo600 final project is that I choose one open source to optimize.
For fist stage, choose open source package and build software. Benchmark the performance of the current implementation of the software on AArch64 and x86-64 systems. Lastly experiment with build options to determine if this has any impact on the performance.


1.     Choose open source

I choose “Zopfli (https://github.com/google/zopfli)” open source. It is compression algorithm made by Google.  Zopfli is written in C for portability. It is a compression-only library. Zopfli is bit-stream compatible with compression used in gzip, Zip, PNG, HTTP requests, and others.
If you compare compression algorithm, zopfli is slower than others. When compare the fast one, gzip 9, zopfli is slower more than 80times.

*https://www.lifehacker.com.au/2013/03/a-look-at-zopfli-googles-open-source-compression-algorithm/
 
Also, I saw “Zopfli Compression Algorithm is a compression library programmed in C to perform very good, but slow, deflate or zlib compression.” in zopfli GitHub.
So, I want to optimize this one.



    2.     Build the software
    2-1.  x86_64
    1)    clone the code to server
I cloned the data from the Zopfli GitHub.


2) add image to the server

For my testing, I will choose the 10mb png file  from https://www.sample-videos.com/download-sample-png-image.php.

3) Benchmark the performance
To benchmark the performance, I used 10mb, 20mb and 30mb files.
Real: elapsed real (wall clock) time used by the process, in seconds.
User: total number of CPU-seconds that the process used directly (in user mode), in seconds.
Sys: total number of CPU-seconds used by the system on behalf of the process (in kernel mode), in seconds.

time
10mb
20mb
30mb
real
2m19.740s
3m38.955s
5m38.456s
user
1m22.996s
3m38.287s
5m37.188s
sys
0m0.163s
0m0.320s
0m0.728s

 * 10mb

* 20mb
* 30mb


I chosen 10mb file, executed 5times with O3 building option.
10mb
1st
2nd
3rd
4th
5th
Real
1m23.846s
1m23.742s1m23,551s1m24.355s4m58.047s
User
1m23.571s
1m23.469s1m23.282s1m24.060s1m33.316s
Sys
0m0.132s
0m0.133s0m0.132s0m0.150s0m0.154s


4) Experiment with build option.
I use the 10mb.png file with various build options.
-O0- no optimization
-O1- first level optimization
-O2 – second level optimization
-O3 – highest optimization
-Ofast – optimize for speed only
time
-O0
-O1
-O2
-O3
-Ofast
real
3m39.489s
1m41.897s1m33.908s2m19,740s1m23.712s
user
3m39.023s
1m41.606s1m33.620s1m22.996s
 1m23.455s
sys
0m0.142s
0m0.127s0m0.138s0m0.163s0m0.119s

 *-O0

*- O1


*-O2

*-O3

*-Ofast


2-2.  AArch64
1) Benchmark the performance
To benchmark the performance, I used 10mb, 20mb and 30mb files.
time
10mb
20mb
30mb
real
8m8.914s
18m56.510s
29m23.732s
user
8m7.776s
18m53.535s
29m18.217s
sys
0m0.229s
0m0.847s
0m2.001s

*10mb
* 20mb
*30mb

I chosen 10mb file, executed 5times with buildin option O3.
10mb
1st
2nd
3rd
4th
5th
Real
8m19.454s
8m42.835s8m31.470s
8m24.040s
8m8.914s
User
8m18.075s
8m41.506s8m30.148s
8m22.796s
8m7.776s
sys
0m0.339s
0m0.299s0m0.339s
0m0.319s
0m0.229s



2) Experiment with build option.
I use the 10mb.png file with various build options.
time
-O0
-O1
-O2
-O3
-Ofast
real
23m39.314s
10m5.855s
8m49.292s
8m8.914s
8m7.720s
user
23m36.622s
10m4.457s
8m48.034s
8m7.776s
8m6.445s
sys
0m0.330s
0m0.329s
0m0.310s
0m0.229s
0m0.349s

*-O0


*-O1


*-O2


*-O3


*-Ofast


When I change the building option the running time is also changed. no optimization(O0) is most slow and -Ofast is most fast. When the building option is changed except the code changing, the performance is changed. It is really interesting to me.
For stage2, I will profile the software to determine which part of the code is doing most of the work.




Friday, March 13, 2020

Lab 5 - Aarch64



Through this Lab5 we can compare with  Aarch64 and x86-64.  we will print "Hello, World" 30 times with order number.

First part: Aarch64 assembler

1. Use the "objdump -d" command and find <main>

"hello"
this hello object code comes from below c code.

"hello2"


this hello2 object code comes from below c code.

"hello3"


this hello3 object code comes from below c code.


2. How to solve ...

To add order number after "Hello, World" we need to know the order number location and how to add ASCII character as the order number.

so, we use digit_1, and digit_2 subroutines and loop and print subroutines.
digit_1 : add one-digit number behind the "Hello, World"
digit_2 : calculate the number of two-digit behind the "Hello, World"
print: print the "Hello, World" and increased the order number until max
loop: compare the order number whether is one_digit or two_digitnumbers

3. whole code

.text
.globl _start

_start:
        mov     x19,min         /*store the min value into x19 as a loop index*/
        mov     x18,division    /*store the division value(10) into x18*/
loop:
        mov     x0, 1           /* file descriptor: 1 is stdout */

        cmp     x19, 9          /* compare x19(loop index value) with 9*/
        b.gt    digit_2         /* if the value is greater than 9(2-digit), go to the subroutine digit_2*/
        bl      digit_1         /* if the value is less or equal than 9(1-digit), go to the subroutine digit_1*/

digit_1:
        add     x20, x19, '0'   /* ascii number character*/

        adr     x30, msg+14     /* the digit location within string */
        strb    w20, [x30]      /* store the digit at the location */
        bl      print           /* go to the print subroutine */


digit_2:
        udiv    x25, x19, x18   /* divide the value by 10 and store the value into the x25 */
        msub    x26, x25, x18, x19 /* store the remainder into x26 */

        add     x21, x25, '0'   /* ascii number character */
        add     x20, x26, '0'   /* ascii number character */

        adr     x25, msg+14     /* the digit location within string */
        strb    w21, [x25]      /* store the digit at the location */
        adr     x26, msg+15     /* the digit location within string */
        strb    w20, [x26]      /* store the digit at the location */

        bl      print

print:
        adr     x1, msg         /* store the locatio of message */
        mov     x2, len         /* store the string length into x2 */

        mov     x8, 64          /* write is syscall #64 */
        svc     0               /* invoke syscall */

        add     x19, x19, 1     /* increment x19 value which is loop index */
        cmp     x19, max        /* compare x19 with max value */

        b.ne    loop            /* if the value is not equal to max value, loop it again */

        mov     x0, 0           /* status -> 0 */
        mov     x8, 93          /* exit is syscall #93 */
        svc     0               /* invoke syscall */

.data
msg:    .ascii      "Hello World!:   \n"
len= .- msg
min = 0
max = 30
division=10

3. The result




























4. What I learn..

Through this lab5 I learned another assembler language Aarch64. It is similar to 6502.  I was familiar with this coding process because of 6502. Still not easy but I will try.