I have "inherited" a OmniOS (illumos based) server.
I noticed rsync
is significantly slower in respect to my reference, FreeBSD 12-CURRENT, running on exactly same hardware.
Using same hardware, same command with same source and target disks, OmniOS r151026 gives:
test@omniosce:~# time rsync -aPt /zarc/images /home/test/
real 17m25.428s
user 28m33.792s
sys 2m46.217s
In FreeBSD 12-CURRENT:
test@freebsd:~ % time rsync -aPt /zarc/images /home/test/
374.651u 464.028s 11:30.63 121.4% 567+210k 791583+780083io 2pf+0w
-
I noticed that, under FreeBSD, rsync
was running as 3 processes, all with nice=0
, two of them consistently using 50% to 70% CPU time.
-
On OmniOS rsync
was also running as 3 processes, also with nice=0
, but each one never more than 3%.
Probably different CPU usage is the reason execution time on same hardware is so different on FreeBSD and illumos?
I tried to renice the rsync
process, to -20, with similar results.
I am also aware of priocntl
and scheduling classes, but I was unable to change the speed.
How can I improve `rsync` execution time?
Thank you in advance.
Is this a local or remote rsync? Does this involve a network protocol? Or are you rsync'ing between local disks?
If it is remote, then poor performance can be caused by the operating system delaying ACK's (in trying to aggregate them into one packet). AIX and Solaris in particular try to do this.
1 Like
Thank you very much for your thoughts on this issue. The rsync is between two local disks, source is a 8-disk vdev, target an ssd. No network involved.
I am travelling and will be able to connect to server on Tuesday, but I will test using only 1 file, source and target on same local disk, to eliminate as many factors as possible.
That rsync is slow while using very little CPU points to an I/O bottleneck. They're spending most of their time waiting for disk. Maybe your system ended up using a fallback driver of some sort.
1 Like
I think rsync uses 3 processes, and the hand-shakes between them might not be optimal in a Solaris-based OS.
The Internet suggests to use -W option to reduce hand-shakes. Also an upgrade to rsync > 3.0 might help.
1 Like
Thanks to the advice received here, I was able to refine testing, and believe now it could be an issue with OmniOS rsync in core. I have opened a ticket on github.com (omniosorg/omnios-build/issues/820). I will report back here.
Are you running with a ZFS filesystem which is over 80% full? That can cause a real "go slow".
1 Like
No, I have been running tests on a new system, with just OS install,
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 476G 136G 340G - - 0% 28% 1.00x ONLINE -
After advice on this forum, I simplified testing, now only on local SSD.
I created a 10G file (with urandom, to avoid zfs caching), and had following results (tested many times, best result reported):
### cp to same SDD
root@omniosce:~# time cp random.bin random1.bin
real 0m5.671s
user 0m0.134s
sys 0m4.974s
root@omniosce:~# time rsync -a random.bin random2.bin
real 1m25.644s
user 2m24.261s
sys 0m14.273s
### rsync'ing to same SDD with Joyent pkgsrc rsync
root@omniosce:~# time /opt/local/bin/rsync -a random.bin random1.bin
real 0m31.302s
user 0m40.634s
sys 0m13.994s
The last result is very close to what I get with FreeBSD on exact same hardware.
So, it seems to me, a possible reason could lie in OmniOS core rsync.
The mystery has finally been solved, by switching OmniOS rsync to 64-bit and enable optimisation. That more than doubles the speed of the checksumming code.
Now, considering exact same hardware, rysnc on OmniOS is even a bit faster than FreeBSD's.
Thank you to all here, and thanks to the excellent OmniOS devs that helped so promptly and exhaustively.
And thank you for updating us!
FWIW - on an x86-based server, processes running in 32-bit mode have access to eight 32-bit general purpose registers, three of which are PC (program counter), SP (stack pointer), and FP (frame pointer). So, unless the compilation process includes optimizations like '-fomit-frame-pointer', the process gets all of five general-purpose registers.
In 64-bit mode, processing have access to sixteen 64-bit general-purpose registers.
So an unoptimized 32-bit process gets to actually use five 32-bit registers, and an optimized 64-bit process gets to actually use fourteen 64-bit registers.
Guess which one's faster on the exact same hardware.
1 Like
Thank you for the explanation!