Memory fragmentation in a Linux settop box

Fischreiher · July 6, 2014, 9:21am

Being a moderator at openATV, a forum for Linux settup boxes, I have seen reports, and sometimes am experiencing myself, artefacts during video playback or timeshift.
As the artefacts are non repetetive (rewinding and watching again does not show artefacts), I can exclude a corrupted video source.

We found that each artefact (up to one per minute in average) 100% correlates with an entry in /var/log/messages like

Jun 29 14:54:54 ventonhdx user.warn kernel: enigma2: page allocation failure: order:5, mode:0xd0
Jun 29 14:54:54 ventonhdx user.warn kernel: Call Trace:
Jun 29 14:54:54 ventonhdx user.warn kernel: [<805ff9d0>] dump_stack+0x8/0x34
Jun 29 14:54:54 ventonhdx user.warn kernel: [<80091f0c>] warn_alloc_failed+0xe4/0x124
Jun 29 14:54:54 ventonhdx user.warn kernel: [<80094690>] __alloc_pages_nodemask+0x434/0x6e8
Jun 29 14:54:54 ventonhdx user.warn kernel: [<800cb6c8>] cache_alloc_refill+0x318/0x8c0
Jun 29 14:54:54 ventonhdx user.warn kernel: [<800cbdc4>] __kmalloc+0x154/0x19c
Jun 29 14:54:54 ventonhdx user.warn kernel: [<800a8900>] memdup_user+0x24/0x94
Jun 29 14:54:54 ventonhdx user.warn kernel: [<8045e198>] dvbdmx_write+0x48/0xd0
Jun 29 14:54:54 ventonhdx user.warn kernel: [<800cf8b0>] vfs_write+0x9c/0x184
Jun 29 14:54:54 ventonhdx user.warn kernel: [<800cfcd8>] sys_write+0x50/0xb0
Jun 29 14:54:54 ventonhdx user.warn kernel: [<8000e928>] stack_done+0x20/0x44
Jun 29 14:54:54 ventonhdx user.warn kernel: Mem-Info:
Jun 29 14:54:54 ventonhdx user.warn kernel: Normal per-cpu:
Jun 29 14:54:54 ventonhdx user.warn kernel: CPU    0: hi:  186, btch:  31 usd:   0
Jun 29 14:54:54 ventonhdx user.warn kernel: CPU    1: hi:  186, btch:  31 usd: 173
Jun 29 14:54:54 ventonhdx user.warn kernel: active_anon:11037 inactive_anon:11111 isolated_anon:0
Jun 29 14:54:54 ventonhdx user.warn kernel:  active_file:4120 inactive_file:24772 isolated_file:0
Jun 29 14:54:54 ventonhdx user.warn kernel:  unevictable:0 dirty:6121 writeback:1050 unstable:0
Jun 29 14:54:54 ventonhdx user.warn kernel:  free:12176 slab_reclaimable:1301 slab_unreclaimable:1821
Jun 29 14:54:54 ventonhdx user.warn kernel:  mapped:997 shmem:69 pagetables:129 bounce:0
Jun 29 14:54:54 ventonhdx user.warn kernel: Normal free:62716kB min:2876kB low:3592kB high:4312kB active_anon:44148kB inactive_anon:44444kB active_file:16480kB inactive_file:85132kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:518144kB mlocked:0kB dirty:
Jun 29 14:54:54 ventonhdx user.warn kernel: lowmem_reserve[]: 0 0
Jun 29 14:54:54 ventonhdx user.warn kernel: Normal: 3575*4kB 4431*8kB 2239*16kB 78*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 88260kB
Jun 29 14:54:54 ventonhdx user.warn kernel: 21111 total pagecache pages
Jun 29 14:54:54 ventonhdx user.warn kernel: 3465 pages in swap cache
Jun 29 14:54:54 ventonhdx user.warn kernel: Swap cache stats: add 4601, delete 1136, find 7/9
Jun 29 14:54:54 ventonhdx user.warn kernel: Free swap  = 14448kB
Jun 29 14:54:54 ventonhdx user.warn kernel: Total swap = 32764kB
Jun 29 14:54:54 ventonhdx user.warn kernel: 131072 pages RAM
Jun 29 14:54:54 ventonhdx user.warn kernel: 58359 pages reserved
Jun 29 14:54:54 ventonhdx user.warn kernel: 14100 pages shared
Jun 29 14:54:54 ventonhdx user.warn kernel: 34321 pages non-shared
Jun 29 14:54:54 ventonhdx user.warn kernel: SLAB: Unable to allocate memory on node 0 (gfp=0xd0)
Jun 29 14:54:54 ventonhdx user.warn kernel:   cache: size-131072, object size: 131072, order: 5
Jun 29 14:54:54 ventonhdx user.warn kernel:   node 0: slabs: 4/4, objs: 4/4, free: 0

This seems to indicate severe memory fragmentation. Although enough total memory is available, the supply of 128k blocks is low.
This is not always the case. After starting the box, or after "echo 3 > /proc/sys/vm/drop_caches", there is a lot of memeory available:

root@gbquad:~# cat /proc/buddyinfo
Node 0, zone   Normal    133    314    249    919   1558    655    178     23      0      0      1

Within the next few minutes, the caches fill up until approx. 6MB RAM are left. In the "good state", fragmentation is low, note the 4MB segment:

root@gbquad:~# cat /proc/buddyinfo
Node 0, zone   Normal    232    160      0      0      0      0      0      0      0      0      1

In the "bad state", memory is severely fragmented, resulting in allocation failures and playback artefacts:

root@gbquad:~# cat /proc/buddyinfo
Node 0, zone   Normal   1409    350     16      1      0      0      0      0      0      0      0

Unfortunately, we have not yet found out what is causing the "bad state". I have so far only seen it after configuring the timeshift buffer to be on a USB stick and then moving it back to the HDD.

We have tried some approaches trying to cure symptoms (not addressing the root cause):

Clearing caches
"echo 3 > /proc/sys/vm/drop_caches" is freeing up memory, and executing this every 3 minutes in a cron job seems to be helpful.
Many Linux users may say that dropping caches is a bad idea. And yes, dropping them and allowing them to fill again in a cyclic manner is definitely a waste of performance, so avoiding or reducing caching from the start would probably be better. In contrast to a Linux PC executing the OS and programs from HDD, these settop boxes never execute code from HDD but from built-in Flash memory, so caching of CPU code is not required. The cache used for video data may be required for "reading ahead" and thus guaranteeing a continuous stream, but data actually may be dropped after playing it. Having said this, this may actually be critical, there is a risk that data is dropped that is just about to be played. I know very little about this and cannot say whether this is an issue. I'm also not sure what other data is being cached, I can see the cache fill up (much more slowly) with timeshift disabled.

Memory compaction
"echo 1 > /proc/sys/vm/compact_memory", (with CONFIG_COMPACTION=y), executed regularly in a cron job may be helpful as well, though I have not been able to test yet whether in the "bad state" the fragmentation is actually improving (on my box, the "bad state" is rare).

Swap
Some users reported an improvement after installing swap on a USB stick. Other experiments show that swap, though installed, is hardly being used, and I'm also a bit concerned about the access time of swap on a USB stick.

I would be grateful for thoughts and hints, espacially about strategies for finding the root cause of the memory fragmentation, knowing that this may be very difficult without detailed knowledge of the settop box internals.

MadeInGermany · July 7, 2014, 7:56am

I have made the observation that the Linux kernel has performance hick-ups if there is continuous I/O.
This has to do with buffering all and everything, according to the dogma "every unused byte of memory is wasted memory". If allocation happens too fast, freeing buffers can take a long time.
But the term "memory fragmentation" does not fit here.
All commercial Unix kernels have configurable limits for buffers and caches; and behave smoothly in such a situation (and little slower in others).
In the very past I had a SuSE PAE kernel that even invoked OOM-killer at continuous I/O. The following program, invoked every 5 minutes, helped:

#!/bin/bash
dropcaches=/proc/sys/vm/drop_caches
meminfo=/proc/meminfo
[ -e $dropcaches ] || exit
lowfree=`
while read key val kb
do
 if [ "$key" = "LowFree:" ]; then
  [ $val -lt 102400 ] && echo $val
  break
 fi
done < $meminfo
`
[ -z "$lowfree" ] && exit
#enable shortly
echo 0 > $dropcaches
sleep 1
#disable buffercache, leave only dentry cache
sync
echo 1 > $dropcaches

---------- Post updated at 06:56 AM ---------- Previous update was at 03:14 AM ----------

In the case of a 64-bit kernel only the following should be called every 5 or 10 minutes:

#!/bin/bash
dropcaches=/proc/sys/vm/drop_caches
echo 1 > $dropcaches

In addition try to write smaller portions of data to the disk, in /etc/sysctl.conf:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 15

Corona688 · July 7, 2014, 11:17am

I don't think that dropping caches will help much at all. They are already considered "Free", and the non-free stuff remains non-free, so how can dropping caches defragment anything?

What is your kernel?

Do you know if your application uses hugepages?

achenle · July 7, 2014, 12:57pm

What's using all that cache?

IO? For streaming large files, caching is irrelevant because you're almost never going to need to replay any one block. Use direct IO and bypass the page cache, and quit thrashing memory.

MadeInGermany · July 7, 2014, 3:34pm

The cache cannot be considered free if the mentioned drop_caches takes over 1 minute... please measure yourself!

Fischreiher · July 8, 2014, 3:08pm

Thanks for your replies.

@MadeInGermany, such script is certainly better than always dropping caches every few minutes. I may adjust it so that it drops caches based on fragmentation. Still it's a (better) way of trying to fix symptoms only.

@achenle: Yes, it's IO cache, and I share your view that for video streams data should not be cached. Can you give me more details on how a cache can be bypassed, or is this too specific to the application?

@Corona688: It's a 3.3.8 kernel. How would I detect whether hugepages are used? By looking at a defconfig file? Can you give me a keyword?

In the meantime I found that in the "bad state" executing "echo 1 > /proc/sys/vm/compact_memory" only improves fragmentation a little bit. A colleague thinks this is due to many small memory pages having been blocked.

Is there any command for finding out which memory pages have been allocated by which process?

jim_mcnamara · July 8, 2014, 3:52pm

Does your app request physically contiguous pages of heap memory for cache? - Oracle does that. This triggers the onset of heap (memory) fragmentation especially quickly in a NUMA environment - i.e., most modern commodity cpus. Multicore x86, for example. It also causes system performance problems, by requiring the core to access memory increasingly inefficiently, all the way across the bridge via another core.

This article discusses ways to invoke memory compaction, and among other things, deals with memory fragmentation. It may not help if your app does something verging on the unreasonable like the oracle db engine does. Also note: java apps that use huge heap sizes may force the jvm to reallocate memory and cause memory fragmentation for the same reason oracle does. Usually this is controllable with java parameter settings for max jvm heap size.

Memory compaction [LWN.net]

achenle · July 9, 2014, 7:22pm

You use direct IO on Linux by calling open() with the O_DIRECT flag. The buffers used for read()/write() calls might need to have a specific alignment, most likely page-aligned, see the man pages for valloc() and memalign(). Also, both the number of bytes transferred by each read()/write() call and the offset read from/written to might also need to be an exact multiple of a fundamental filesystem or hardware block size. That could cause problems reading/writing the last block of a file, depending on the version of Linux you have, the file system you're using, and even maybe your hardware.

Corona688 · July 10, 2014, 12:31pm

I can see why it would cause issues when contiguous pages aren't available, but why using contiguous pages cause pages to become discontinuous?

achenle · July 10, 2014, 2:17pm

Oracle doesn't technically request "contiguous" pages, it requests large pages. To get those large pages, the OS must coalesce smaller pages.

Oracle uses the large page, releases it, then something else requests normal size pages and the large page gets fragmented. Oracle comes back and requests larges pages...

Rinse, lather, repeat.

The coalescing necessary to create the large pages can have some nasty performance impacts as it tends to lock up virtual memory management while it's happening. And for any OS instance, VM management tends to get very single threaded when memory gets tight. Processes won't start or hang because fork() and brk() calls block while the virtual memory manager thrashes about.

This can be really bad with Oracle on Solaris using ZFS because the ZFS ARC cache uses lots of small pages and isn't exactly quick in letting them go, and the standard recommendation is "let the ARC cache get as big as it wants, it doesn't hurt anything." Umm, wrong.

Corona688 · July 10, 2014, 3:21pm

Wonderful explanation, thank you.

Peasant · July 11, 2014, 6:54am

I had ZFS ARC 'issues' on Oracle database machines (database using ASM).

Since there is alot of free memory, ZFS eats most of it. Then application start to attack the database causing significant rise of PGA.

Altho the ZFS holds alot of GB, it is slow to release, and the machine starts swapping causing everything to come to a almost halt.

I resolved by lowering ARC cache max to 4 GB on every database machine.
These are LDOMs on Solaris 11.1 T5 Sparc machines.

Regards
Peasant.

netnerd · September 13, 2014, 10:04am

You can also look at changing the kern parmetere vm.swappiness .

They best this is to have enough memory, so you don't swap. If have enough memory (lots), you can look at vm.nr_hugepages to improve system performane, so it doesn't have to scan through as many pages. It all depends on the applications and whatelse you are running.

Fischreiher · September 17, 2014, 4:22am

Yes, I have enough memory (typ. ~400MB "MemAvailable" used for caches, with ~6MB really free, and no swap required or ever used), so I will look into "hugepages", thanks.

But I'm wondering whether my understanding of the problem was maybe wrong all the time. When the "page allocation failure" is reported (see the log in post #1 and the one below), what does this mean?:

Jul  1 23:31:47 gbquad user.warn kernel: enigma2: page allocation failure: order:4, mode:0xd0
Jul  1 23:31:47 gbquad user.err kernel: Call Trace:
Jul  1 23:31:47 gbquad user.err kernel: [<806092f8>] dump_stack+0x8/0x34
Jul  1 23:31:47 gbquad user.err kernel: [<8008b60c>] warn_alloc_failed+0xe4/0x124
Jul  1 23:31:47 gbquad user.err kernel: [<8008dcc4>] __alloc_pages_nodemask+0x448/0x6e0
Jul  1 23:31:47 gbquad user.err kernel: [<800beb60>] cache_alloc_refill+0x390/0x65c
Jul  1 23:31:47 gbquad user.err kernel: [<800bf004>] __kmalloc+0x108/0x130
Jul  1 23:31:47 gbquad user.err kernel: [<8009ecd0>] memdup_user+0x24/0x94
Jul  1 23:31:47 gbquad user.err kernel: [<80488218>] dvbdmx_write+0x44/0xd8
Jul  1 23:31:47 gbquad user.err kernel: [<e1125910>] dev_dmx_demux_write_hook+0xc0/0xec [dvb]
Jul  1 23:31:47 gbquad user.err kernel: [<e1125c50>] dev_dmx_dvr_write_hook+0x15c/0x190 [dvb]
Jul  1 23:31:47 gbquad user.err kernel: [<800c29dc>] vfs_write+0x9c/0x184
Jul  1 23:31:47 gbquad user.err kernel: [<800c2e04>] sys_write+0x50/0xb0
Jul  1 23:31:47 gbquad user.err kernel: [<8000d8e8>] stack_done+0x20/0x44
Jul  1 23:31:47 gbquad user.err kernel: Mem-Info:
Jul  1 23:31:47 gbquad user.err kernel: Normal per-cpu:
Jul  1 23:31:47 gbquad user.err kernel: CPU    0: hi:  186, btch:  31 usd:   0
Jul  1 23:31:47 gbquad user.err kernel: CPU    1: hi:  186, btch:  31 usd:  11
Jul  1 23:31:48 gbquad user.err kernel: active_anon:17978 inactive_anon:48 isolated_anon:0
Jul  1 23:31:48 gbquad user.err kernel:  active_file:54255 inactive_file:55640 isolated_file:0
Jul  1 23:31:48 gbquad user.err kernel:  unevictable:0 dirty:1955 writeback:0 unstable:0
Jul  1 23:31:48 gbquad user.err kernel:  free:1490 slab_reclaimable:1770 slab_unreclaimable:3355
Jul  1 23:31:48 gbquad user.err kernel:  mapped:2828 shmem:80 pagetables:116 bounce:0
Jul  1 23:31:48 gbquad user.err kernel: Normal free:5712kB min:4072kB low:5088kB high:6108kB \
    active_anon:71912kB inactive_anon:192kB active_file:217020kB inactive_file:222832kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1038336kB mlocked:0kB dirty:
Jul  1 23:31:48 gbquad user.err kernel: lowmem_reserve[]: 0 0
Jul  1 23:31:48 gbquad user.err kernel: Normal: 716*4kB 298*8kB 38*16kB 3*32kB 4*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6208kB
Jul  1 23:31:48 gbquad user.err kernel: 109816 total pagecache pages
Jul  1 23:31:48 gbquad user.err kernel: 0 pages in swap cache
Jul  1 23:31:48 gbquad user.err kernel: Swap cache stats: add 0, delete 0, find 0/0
Jul  1 23:31:48 gbquad user.err kernel: Free swap  = 0kB
Jul  1 23:31:48 gbquad user.err kernel: Total swap = 0kB
Jul  1 23:31:48 gbquad user.err kernel: 327680 pages RAM
Jul  1 23:31:48 gbquad user.err kernel: 184924 pages reserved
Jul  1 23:31:48 gbquad user.err kernel: 19064 pages shared
Jul  1 23:31:48 gbquad user.err kernel: 123681 pages non-shared

(a) No contiguous 128k block is available in the "really free" 6MB, but such block will then be freed up within the 400MB "MemAvailable" automatically by dropping caches, thus the block can be allocated in the end, but the time required for this is resulting in a dropout and the artefacts in the video stream.

(b) No contiguous 128k block is available at all, neither in the "really free" 6BM nor in the 400MB "MemAvailable", but can be made available by "memory compaction" (again, time required resulting in the artefacts in the video stream).

(c) No contiguous 128k block is available at all and cannot be made available, not even by "memory compaction", as the whole ~400MB memory is really badly fragmented with blocked small blocks all over, so that the failure of allocating a 128k block is a permanent one, and the missing block is causing the dropout in the video stream.

Is anyone able to answer this reliably, or is any additional information from side is required?

jim_mcnamara · September 17, 2014, 8:57am

This:
lowmem_reserve[]: 0 0

and this:
active_anon:71912kB inactive_anon:192kB active_file:217020kB

Say the following to me. (I mention zfs because I'm most familiar with it. Not because
I think you have zfs filesystems.):

anon is heap in use. active file is file caching

anon requires contiguous blocks of a certain minimum new size. But file caching ate it all. By chewing up small chunks all over the place. And file caching is considered 'available space'. Making it available can take eons in computer time concepts. Think of having to take a scrambled deck of cards, ordering it, every time you have to make a new play in your card game. Things slow down in game play. A lot.

Or.

Think of a half full parking lot. You have 100 spaces available total. 50 are used. 50 are free. What is the largest block of free spaces? Knowing 50 are free does not mean 50 spaces all next to each other are available. The parking lot is owned by Mr Kernel, who thinks he has lots of room for more parking. But you by some arbitrary dictate require the place where you park 40 more cars all to be contiguous.

A lot of applications require allocating contiguous pages. But there can be small pieces between them. What happens when the process that has the small chunk goes away? You get a memory fragment. So, stingy memory programming in one app and major land grabs of memory in another often leads to memory fragmentation.

As achenle mentioned, some OS implementations of zfs do the stingy thing for caching, oracle does the county-wide land grab memory hog thing. The two do not play well together. Contrariwise, large amounts of RAM is more likely to trigger worse fragmentation. Because caching "thinks" the whole world is open. And pollutes larger areas with small chunks. Cleanup becomes more and more costly time-wise.

In your case it seems to be that the player is directly reading huge files (in large chunks) into RAM. Then file caching may be polluting free space by littering the landscape with small file cache chunks.

Fischreiher · September 17, 2014, 9:18am

Thank you, that sounds very reasonable.

I would think that the main process for playing back a file, or the timeshift function, are caching video data in large chunks as you describe. So if these processes were running alone, memory could probably be freed up quickly when needed, right?

Now my problem is identifying which process(es) are allocating the "small chunks" that cause the problem. Any idea? Is there maybe some tool for displaying which memory chunk has been allocated by which process?

achenle · September 17, 2014, 12:44pm

That depends on how you're doing your IO operations. How are your apps coded to do read and/or stream your files? Do you have control over your application IO? What filesystem(s) are you using for data?

Assuming you're streaming video files without much random searching, you should be using direct IO and bypassing the cache since it's extremely unlikely that the proper file data will be cached when you do any searching.

Direct IO will be faster and it won't fragment memory because you won't be using the page cache.

Caching of file data only helps when data can be held in memory long enough to allow multiple reads of the same data, or when write operations are small and/or slow enough to be effectively coalesced into a smaller number of write operations. Streaming or copying large files fits neither of those criteria.

Fischreiher · September 17, 2014, 2:23pm

I agree that caching a video stream does not make sense during pure playback (as long as there is a buffer for "reading ahead", guaranteeing a continuous data stream if IO is briefly interrupted), however there could be situations like rewinding or jumping backwards where the cache may be useful, and I guess it has always been like this in enigma2 and may be difficult to change for practical or political reasons.

The filesystem is ext4. Your questions about how the apps are coded are correct of course, though there are so many system tasks and plugins active that an approach of analyzing each of them is unfortunately not practicable, so I was looking for a way of dumping information on which process has allocated which memory blocks, hoping to identify one worst process that could be re-written, using larger segments for lower fragmentation.

achenle · September 17, 2014, 10:40pm

Enigma2? Isn't that written all in Python?

You're trying to run a service that pushes the limits of the hardware you're running on that's written in a scripting language?

You can't run hardware at its design limits and ignore the design. Script languages ignore underlying details.

Fischreiher · September 18, 2014, 2:57am

The GUI of enigma2 as well as the plugins ('apps') are written in Python, but all the timing critical core stuff including video playback and timeshift are written in C++ so that, running on new chips, the CPU load is usually extremely low during such basic tasks, even when recording multiple services in parallel.

But you may be right that even a single Phyton script running on top of this may be allocating memory in small chunks, causing the problems.