Rsync quite slow (using very little cpu): how to improve its speed?

I have "inherited" a OmniOS (illumos based) server.

I noticed rsync is significantly slower in respect to my reference, FreeBSD 12-CURRENT, running on exactly same hardware.

Using same hardware, same command with same source and target disks, OmniOS r151026 gives:

test@omniosce:~# time rsync -aPt /zarc/images /home/test/
real    17m25.428s
user    28m33.792s
sys     2m46.217s
 

In FreeBSD 12-CURRENT:

 test@freebsd:~ % time rsync -aPt /zarc/images /home/test/

374.651u 464.028s 11:30.63 121.4%    567+210k 791583+780083io 2pf+0w
  • I noticed that, under FreeBSD, rsync was running as 3 processes, all with nice=0 , two of them consistently using 50% to 70% CPU time.

  • On OmniOS rsync was also running as 3 processes, also with nice=0 , but each one never more than 3%.

Probably different CPU usage is the reason execution time on same hardware is so different on FreeBSD and illumos?

I tried to renice the rsync process, to -20, with similar results.

I am also aware of priocntl and scheduling classes, but I was unable to change the speed.

How can I improve `rsync` execution time?

Thank you in advance.

Is this a local or remote rsync? Does this involve a network protocol? Or are you rsync'ing between local disks?

If it is remote, then poor performance can be caused by the operating system delaying ACK's (in trying to aggregate them into one packet). AIX and Solaris in particular try to do this.

1 Like

Thank you very much for your thoughts on this issue. The rsync is between two local disks, source is a 8-disk vdev, target an ssd. No network involved.

I am travelling and will be able to connect to server on Tuesday, but I will test using only 1 file, source and target on same local disk, to eliminate as many factors as possible.

That rsync is slow while using very little CPU points to an I/O bottleneck. They're spending most of their time waiting for disk. Maybe your system ended up using a fallback driver of some sort.

1 Like

I think rsync uses 3 processes, and the hand-shakes between them might not be optimal in a Solaris-based OS.
The Internet suggests to use -W option to reduce hand-shakes. Also an upgrade to rsync > 3.0 might help.

1 Like

Thanks to the advice received here, I was able to refine testing, and believe now it could be an issue with OmniOS rsync in core. I have opened a ticket on github.com (omniosorg/omnios-build/issues/820). I will report back here.

Are you running with a ZFS filesystem which is over 80% full? That can cause a real "go slow".

1 Like

No, I have been running tests on a new system, with just OS install,

NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
 rpool   476G   136G   340G        -         -     0%    28%  1.00x  ONLINE  -

After advice on this forum, I simplified testing, now only on local SSD.

I created a 10G file (with urandom, to avoid zfs caching), and had following results (tested many times, best result reported):

### cp to same SDD

root@omniosce:~# time cp random.bin random1.bin

real    0m5.671s
user    0m0.134s
sys     0m4.974s

root@omniosce:~# time rsync -a random.bin random2.bin

real    1m25.644s
user    2m24.261s
sys     0m14.273s

### rsync'ing to same SDD with Joyent pkgsrc rsync

root@omniosce:~# time /opt/local/bin/rsync -a random.bin random1.bin

real    0m31.302s
user    0m40.634s
sys     0m13.994s
 

The last result is very close to what I get with FreeBSD on exact same hardware.

So, it seems to me, a possible reason could lie in OmniOS core rsync.

The mystery has finally been solved, by switching OmniOS rsync to 64-bit and enable optimisation. That more than doubles the speed of the checksumming code.

Now, considering exact same hardware, rysnc on OmniOS is even a bit faster than FreeBSD's.

Thank you to all here, and thanks to the excellent OmniOS devs that helped so promptly and exhaustively.

And thank you for updating us!

FWIW - on an x86-based server, processes running in 32-bit mode have access to eight 32-bit general purpose registers, three of which are PC (program counter), SP (stack pointer), and FP (frame pointer). So, unless the compilation process includes optimizations like '-fomit-frame-pointer', the process gets all of five general-purpose registers.

In 64-bit mode, processing have access to sixteen 64-bit general-purpose registers.

So an unoptimized 32-bit process gets to actually use five 32-bit registers, and an optimized 64-bit process gets to actually use fourteen 64-bit registers.

Guess which one's faster on the exact same hardware. :wink:

1 Like

Thank you for the explanation!