ZFS Filesystem

tharmendran · August 23, 2015, 11:40pm

Hi,
Recently we have new server T5 Oracle. We set up it for our database. For out database files we set one zfs filesystem. When i use iostat -xc the output as below. As you see the value for vdc4 is quite high.

                 extended device statistics                      cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b  us sy wt id
vdc0      0.6    3.9   10.8   37.5  0.0  0.0    1.9   0   0  10  7  0 83
vdc1     12.9    2.6 1644.2  309.9  0.0  0.1    7.6   0   1
vdc2      9.5    2.8 1208.8  351.9  0.0  0.1    8.4   0   1
vdc3      0.2    2.4   11.9   38.1  0.0  0.0    1.9   0   0
vdc4    266.6   83.1 32967.7 7561.5  0.0  3.2    9.1   0  65
vdc5      2.4    3.3  301.1  378.2  0.0  0.1   12.6   0   1
vdc6      5.8   52.1  715.3  718.0  0.0  0.1    2.4   0   6
vdc7      3.9   52.1  474.5  717.9  0.0  0.1    2.1   0   6
vdc8      0.0    0.0    0.0    0.0  0.0  0.0    2.3   0   0
nfs1      0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0

When i look on the output for memstat. The ZFS filesystem taking high mem percentange.

> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     355350              2776    8%
ZFS File Data             1660358             12971   40%
Anon                      1874388             14643   45%
Exec and libs               12338                96    0%
Page cache                 176508              1378    4%
Free (cachelist)             6483                50    0%
Free (freelist)            108879               850    3%
Total                    4194304             32768

Is this normal? When we database full backup the db will hang although the server load is normal during backup. Is that maybe related to the zfs filesystem setting? Hope can engligthen me on these.

Don_Cragun · August 24, 2015, 12:51am

It has been a long time since I worked with a ZFS filesystem, but I don't think it is unusual for ZFS to consume memory that is otherwise unused as a cache for ZFS disk data.

Reading 33Mb/s and writing 7.5Mb/s may seem high, but with 0 wait time on the device, it doesn't appear to be a problem.

Are you seeing a high swap rate (or any indication that running processes are running poorly due to a lack of available memory)?

Peasant · August 24, 2015, 1:55am

It is not unusual for ZFS to eat almost all available memory.
You don't want that with database. Even if you are running on ZFS filesystems.

I would not recommend running databases on ZFS filesystems, since it requires alot of tuning to get it right. There is also an unresolved issue of fragmentation and for large implementation i would avoid ZFS for DB. ASM is the law

Are those FC or internal disks ?
What is the patchset you are running at (hypervisor & ldom - since i see it is a ldom) ?

Can you please tell what are the values kernel parameters :

ssd:ssd_max_throttle
zfs:zfs_vdev_max_pending
zfs:zfs_arc_max

Can you post output of following command during the problem ?

sar -d 2 10

Take a look at the avque , i suspect it is very high during the non responsive period.
If not, possibly your issue resides with arc_max (confirm that the machine is not swaping as Don suggested). Lower it to a sane value so your database doesn't run out of PGA space (it will start swapping then, causing extreme slowness).

This is wrong, please take a look at the following documentation, and read it well :
Tuning ZFS for Database Products - Oracle Solaris 11.1 Tunable Parameters Reference Manual

In short, you will need multiple zpools on different spindles with different setups for various DB functionality (REDO, ARCH, DATA) and keep them under 80% (this is very important).

tharmendran · August 24, 2015, 5:20am

Hi Peasant,

We are using NetApps SAN external storage. Actually we are using multiple zpools for our our database.

The values for our kernel parameters are also below;

How to get the ssd:ssd_max_throttle value?

The output of the sar -d 2 10 as below;

17:18:59   device        %busy   avque   r+w/s  blks/s  avwait  avserv
17:19:01   nfs1              0     0.0       0       0     0.0     0.0
           vdc0              0     0.0       0       0     0.0     0.0
           vdc1              0     0.0       5      16     0.0     0.8
           vdc2             79     7.8     363   87485     0.0    21.6
           vdc3              0     0.0       0       0     0.0     0.0
           vdc4            100    10.0     383   95701     0.0    26.1
           vdc5              0     0.0       0       0     0.0     0.0
           vdc6              0     0.0       0       0     0.0     0.0
           vdc7              0     0.0       0       0     0.0     0.0
           vdc8              0     0.0       0       0     0.0     0.0
17:19:03   nfs1              0     0.0       0       0     0.0     0.0
           vdc0              0     0.0       0       0     0.0     0.0
           vdc1              3     0.1      39     502     0.0     3.3
           vdc2              0     0.0       0       0     0.0     0.0
           vdc3              0     0.0       0       0     0.0     0.0
           vdc4            100    10.0     555  141827     0.0    18.0
           vdc5              0     0.0       0       0     0.0     0.0
           vdc6              0     0.0       0       0     0.0     0.0
           vdc7              0     0.0       0       0     0.0     0.0
           vdc8              0     0.0       0       0     0.0     0.0
17:19:05   nfs1              0     0.0       0       0     0.0     0.0
           vdc0              0     0.0       0       0     0.0     0.0
           vdc1              0     0.0       0       0     0.0     0.0
           vdc2             79     7.8     470  119166     0.0    16.7
           vdc3              0     0.0       0       0     0.0     0.0
           vdc4            100    10.0     448  114029     0.0    22.3
           vdc5              0     0.0       0       0     0.0     0.0
           vdc6              0     0.0       0       0     0.0     0.0
           vdc7              0     0.0       0       0     0.0     0.0
           vdc8              0     0.0       0       0     0.0     0.0
17:19:07   nfs1              0     0.0       0       0     0.0     0.0
           vdc0              0     0.0       0       0     0.0     0.0
           vdc1              0     0.0       0       0     0.0     0.4
           vdc2            100    10.0     528  133947     0.0    18.9
           vdc3              0     0.0       0       0     0.0     0.0
           vdc4            100    10.0     358   89443     0.0    27.9
           vdc5              0     0.0       0       0     0.0     0.0
           vdc6              0     0.0       0       0     0.0     0.0
           vdc7              0     0.0       0       0     0.0     0.0
           vdc8              0     0.0       0       0     0.0     0.0
17:19:09   nfs1              0     0.0       0       0     0.0     0.0
           vdc0              0     0.0       0       0     0.0     0.0
           vdc1              0     0.0       0       0     0.0     0.0
           vdc2            100     9.6     589  144250     0.0    16.2
           vdc3              0     0.0       0       0     0.0     0.0
           vdc4            100    10.0     434  109155     0.0    23.0
           vdc5              0     0.0       0       0     0.0     0.0
           vdc6              0     0.0       0       0     0.0     0.0
           vdc7              0     0.0       0       0     0.0     0.0
           vdc8              0     0.0       0       0     0.0     0.0
17:19:11   nfs1              0     0.0       0       0     0.0     0.0
           vdc0              0     0.0       0       0     0.0     0.0
           vdc1              0     0.0       0       0     0.0     0.0
           vdc2            100    10.0     658  167231     0.0    15.2
           vdc3              0     0.0       0       0     0.0     0.0
           vdc4            100    10.0     379   97137     0.0    26.3
           vdc5              0     0.0       0       0     0.0     0.0
           vdc6              0     0.0       0       0     0.0     0.0
           vdc7              0     0.0       0       0     0.0     0.0
           vdc8              0     0.0       0       0     0.0     0.0
17:19:13   nfs1              0     0.0       0       0     0.0     0.0
           vdc0              0     0.0       0       0     0.0     0.0
           vdc1              0     0.0       0       0     0.0     0.0
           vdc2            100    10.0     585  148283     0.0    17.1
           vdc3              0     0.0       0       0     0.0     0.0
           vdc4            100    10.0     424  108129     0.0    23.6
           vdc5              0     0.0       0       0     0.0     0.0
           vdc6              0     0.0       0       0     0.0     0.0
           vdc7              0     0.0       0       0     0.0     0.0
           vdc8              0     0.0       0       0     0.0     0.0
17:19:15   nfs1              0     0.0       0       0     0.0     0.0
           vdc0              0     0.0       0       0     0.0     0.0
           vdc1              0     0.0       0       0     0.0     0.0
           vdc2             14     1.2      95   14436     0.0    12.8
           vdc3              0     0.0       0       0     0.0     0.0
           vdc4            100    10.0     462  118123     0.0    21.6
           vdc5              0     0.0       0       0     0.0     0.0
           vdc6              0     0.0       0       0     0.0     0.0
           vdc7              0     0.0       0       0     0.0     0.0
           vdc8              0     0.0       0       0     0.0     0.0
17:19:17   nfs1              0     0.0       0       0     0.0     0.0
           vdc0              0     0.0       0       0     0.0     0.0
           vdc1              0     0.0       0       0     0.0     0.0
           vdc2             40     3.9     225   56873     0.0    17.3
           vdc3              0     0.0       0       0     0.0     0.0
           vdc4            100    10.0     485  122068     0.0    20.5
           vdc5              0     0.0       0       0     0.0     0.0
           vdc6              0     0.0       0       0     0.0     0.0
           vdc7              0     0.0       0       0     0.0     0.0
           vdc8              0     0.0       0       0     0.0     0.0
17:19:19   nfs1              0     0.0       0       0     0.0     0.0
           vdc0              0     0.0       0       0     0.0     0.0
           vdc1              0     0.0       0       0     0.0     0.0
           vdc2            100    10.0     542  137370     0.0    18.4
           vdc3              0     0.0       0       0     0.0     0.0
           vdc4            100    10.0     459  109974     0.0    21.7
           vdc5              0     0.0       0       0     0.0     0.0
           vdc6              0     0.0       0       0     0.0     0.0
           vdc7              0     0.0       0       0     0.0     0.0
           vdc8              0     0.0       0       0     0.0     0.0

Average    nfs1              0     0.0       0       0     0.0     0.0
           vdc0              0     0.0       0       0     0.0     0.0
           vdc1              0     0.0       4      52     0.0     3.0
           vdc2             71     7.0     405  100893     0.0    17.3
           vdc3              0     0.0       0       0     0.0     0.0
           vdc4            100    10.0     439  110551     0.0    22.8
           vdc5              0     0.0       0       0     0.0     0.0
           vdc6              0     0.0       0       0     0.0     0.0
           vdc7              0     0.0       0       0     0.0     0.0
           vdc8              0     0.0       0       0     0.0     0.0

We are using guest domain (VM). Any advise?

DukeNuke2 · August 24, 2015, 7:25am

Maybe a good starting point:

jlliagre · August 24, 2015, 4:41pm

There is a lot of misunderstanding around this topic. All file systems will eat as much memory as they find useful, not just ZFS, unused memory being wasted memory anyway.

The big differences are:

ZFS memory, including the ARC, is reported as used/unavailable while other file systems memory, the buffer cache and the page cache, is reported as free/available.
ZFS memory is released asynchronously and gradually by observing RAM demand while other file system's memory is released synchronously and (almost) instantaneously. Where that matters is when an application requests a very large amount of non pageable memory as the allocation might fail. The arc_max tuning prevents ZFS to use all the RAM helping these allocations to succeed.

os2mac · August 24, 2015, 7:22pm

also. snapshots. minimize them. more snapshots will mean more i/o. I had issues similar to this a while back and it all came down to snapshots and zfs_arc_max.

jlliagre · August 24, 2015, 8:34pm

Can you elaborate on that point?
I would not expect the presence of snapshots to have a significant effect on the number of I/Os.

os2mac · August 24, 2015, 9:15pm

more snapshots = more writing to the delta log which contributes to greater I/O. if you have complex zfs systems it's exponental. snapshot -r /rpool will snapshot every subordinate fs and will then cause any changes to have to be written to each snapshot. 50 subs, 10 snapshots you get the idea...

jlliagre · August 24, 2015, 9:33pm

I'm afraid I don't get it. Snapshots are read-only by design so they cannot be the target of write operations. On the other hand creating them can have a small overhead and destroying them might have a bigger overhead. The latter is to be balanced with the fact having snapshots reduce the number of I/Os in case of file removal, as the data blocks, being still referenced by the snapshot(s) need not to be marked as free.

os2mac · August 24, 2015, 10:40pm

while admittedly I don't know the specifics of how it works I do know that a zfs snapshot is a delta value of the FS. so it must be recording those deltas somewhere. The older the snap the larger the file and the more snaps the more writes.

I can only tell you from practical experience that removing snapshots DOES improve performance.

Peasant · August 25, 2015, 1:20am

The delta is not written. The data already exists in the original filesystem.

For instance, you have 4 files sized 20 GB on a zfs filesystem inside a zpool sized 100 GB.
Zpool current space utilization is 80%
For the sake of argument, we have only one filesystem in that zpool.

A snapshot has been made on that zfs filesystem.
You delete 1 of 4 files sized 20 GB.

Zpool will remain on 80%, since the snapshot is referencing on the deleted data, the data is not actually deleted from the zpool.

You issue zfs destroy on the snapshot. This operation actually deletes data from the zpool.

This is how i understand it, feel free to correct me

As for ARC :

Problem is it doesn't work well if a program requests a very large memory chunk (such as Oracle database), since it will request memory, if it is not given in certain time it will start swapping.

This is why i avoid zfs filesystems in general for Oracle database and use ASM with limited ZFS arc.
Take a look at the documentation regarding ZFS and databases. It requires a lot of love and attention.

I'd rather give that love to something else and run ASM
Chip in a good SSD or a local flash cache card as a CACHE device for Oracle, pin a couple of monster indexes in it and go get some beer

On the other hand, on several TB Solaris Cluster with ZFS which is a NFS server, i haven't touched that tunable.
Machines work fine with 95% memory consumed, mostly by filesystems, using the memory as ARC (which is desired).

jlliagre · August 25, 2015, 2:40am

This is incorrect, a snapshot is a frozen dataset content. What you call the delta value is written once on the live file system.

The snapshots delta is already there, no need to record it.

Yes, if the file system is evolving.

No, there is no write inflation.

Perhaps had you rolling snapshots in place?

tharmendran · August 25, 2015, 3:45am

Hi Jllagre,Peasant,Don

Any idea on how to solve my issue regarding DB hang when backup(RMAN)? I already give sar -d output earlier. How to fine tune the zfs parameter especially the arc_max?
Is that change the parameter will cause data loss? Please help.

jlliagre · August 25, 2015, 5:43am

Did you read the document DukeNuke2 posted?

tharmendran · August 25, 2015, 8:39am

Hi jlliagre,
I have read it and found need to change some zfs value parameter. Is that safe to change those parameters recommended? Is that will affect the data in that filesystem?

jlliagre · August 25, 2015, 8:54am

Which ones?
What is the busy file system used for?

tharmendran · August 25, 2015, 9:34am

Hi jlliagre,

Currently we use the affected filesystem to store database related files such tables,indexes.

Below is the our current ZFS setting. Mostly i heard arc_max,zfs:zfs_vdev_max_pending,ssd:ssd_max_throttle parameter need to fine tune. Is that rite?

arc_reduce_dnlc_percent = 0x3
zfs_arc_max = 0x0
zfs_arc_min = 0x0
arc_shrink_shift = 0x7
zfs_mdcomp_disable = 0x0
zfs_prefetch_disable = 0x0
zfetch_max_streams = 0x8
zfetch_min_sec_reap = 0x2
zfetch_block_cap = 0x100
zfetch_array_rd_sz = 0x100000
zfs_default_bs = 0x9
zfs_default_ibs = 0xe
metaslab_aliquot = 0x80000
mdb: variable reference_tracking_enable not found: unknown symbol name
mdb: variable reference_history not found: unknown symbol name
spa_max_replication_override = 0x3
spa_mode_global = 0x3
zfs_flags = 0x0
zfs_txg_synctime_ms = 0x1388
zfs_txg_timeout = 0x1e
zfs_write_limit_min = 0x2000000
zfs_write_limit_max = 0xfb4d0c00
zfs_write_limit_shift = 0x3
zfs_write_limit_override = 0x0
zfs_no_write_throttle = 0x0
zfs_vdev_cache_max = 0x4000
zfs_vdev_cache_size = 0x0
zfs_vdev_cache_bshift = 0x10
vdev_mirror_shift = 0x15
zfs_vdev_max_pending = 0xa
zfs_vdev_min_pending = 0x4
zfs_vdev_future_pending = 0xa
zfs_scrub_limit = 0xa
zfs_no_scrub_io = 0x0
zfs_no_scrub_prefetch = 0x0
zfs_vdev_time_shift = 0x6
zfs_vdev_ramp_rate = 0x2
zfs_vdev_aggregation_limit = 0x20000
fzap_default_block_shift = 0xe
zfs_immediate_write_sz = 0x8000
zfs_read_chunk_size = 0x100000
zfs_nocacheflush = 0x0
zil_replay_disable = 0x0
metaslab_gang_threshold = 0x100001
metaslab_df_alloc_threshold = 0x100000
metaslab_df_free_pct = 0x4
zio_injection_enabled = 0x0
zvol_immediate_write_sz = 0x8000

One more thing the documentation say need to change the logbias setting for database filesystem

My current database fileystem as below;

NAME           PROPERTY       VALUE         SOURCE
ora2pool/ora2  primarycache   all           default
ora2pool/ora2  recordsize     128K          default
ora2pool/ora2  compressratio  1.00x         -
ora2pool/ora2  compression    off           default
ora2pool/ora2  available      351G          -
ora2pool/ora2  used           484G          -
ora2pool/ora2  quota          none          default
ora2pool/ora2  logbias        latency       default

jlliagre · August 25, 2015, 11:03am

You missed the "Number One rule".

As the file systems stores tables and indexes, tune the recordsize setting. It should probably be 8k vs 128k but it is too late for the parameter to affect the existing files. Look for "Important Note:" in the white paper for a workaround.
Properly tuning the record size is know to dramatically reduce the number of I/Os in some use cases although not necessarily with yours.

achenle · August 28, 2015, 9:07am

A little late here, but...

It's much worse than that on a server running Oracle database instance(s). The ZFS ARC does not play nice with Oracle databases. At all:

ZFS ARC expands to use all free memory - as 4k pages.
Oracle DB has a transient demand for memory - but it requests large pages (4 MB IIRC).
Entire server comes to an effective screeching halt while VM management is hung coalescing large pages.
Oracle DB releases the large pages, ZFS ARC grabs them and fragments them.
Repeat.

If the server is used just as a database server, limit the ARC to under 1 GB, if not smaller. After rebooting, check to be sure the ARC is actually limited to what you specified - if you go too small your limit will be ignored.