OS: Tuning Manajemen Memory

Sumber: http://doc.opensuse.org/documentation/html/openSUSE/opensuse-tuning/cha.tuning.memory.html

Untuk mengerti dan men-tune perilaku manajemen memory di kernel, akan sangat penting untuk mengetahui bagaimana cara dia bekerja dengan subsistem lainnya.

Subsistem menajemen memory, biasa di sebut virtual memory manager, yang selanjutnya akan di sebut "VM". Fungsi VM adalah untuk mengatur alokasi dari memory fisik (RAM) untuk seluruh kernel dan user program. VM juga bertanggung jawab untuk memberikan lingkungan virtual memory untuk keperluan user (di managed melalui API POSIX dengan extensi Linux). Terakhir, VM bertanggung jawab untuk membebaskan RAM jika terjadi kekurangan, baik dengan memotong cache atau swapping "anonymous" memory.

Hal paling penting untuk dimengerti ketika mempelajari dan men-tune VM adalah bagaimana cache di manaje. Tujuan utama VM cache adalah untuk meminimalkan cost dai I/O yang di sebabkan oleh swaping dan operasi file sistem (termasuk network file system). Hal ini dapat dicapai dengan menghindari I/O secara keseluruhan, atau dengan memasukan I/O ke pola yang lebih baik.

Free memory akan digunakan dan di isi oleh cache jika dibutuhkan. Semakin banyak memory yang tersedia untuk cache dan anonymous memory, semakin effektif operasi cache dan swapping. Akan tetapi, jika terjadi kekurangan memory, cache akan dipangkas atau memory akan di swap-kan.

Untuk beban kerja tertentu, hal pertama yang dapat dilakukan untuk memperbaiki performance adalah menaikan memory dan mengurangi frekuensi dimana memory harus dipangkas atau di swap-kan. Hal kedua adalah ubah cara mengatur cache dengan mengubah parameter kernel.

Terakhir, beban kerja itu sendiri harus dipelajari dan di tune. Jika aplikasi di ijinkan untuk menjalankan lebih banyak proses atau thread, effektifitas cache VM akan berkurang, jika setiap proses beroperasi di wilayah-nya sendiri di file system. Overhead memory akan bertambah. Jika aplikasi meng-alokasikan sendiri buffer atau cache yang digunakan, semakin besar cache berarti lebih sedikit memory yang tersedia untuk cache VM. Akan tetapi, semakin banyak proses dan thread juga berarti lebih banyak kemungkinan untuk overlap dan pipeline I/O, dan akan lebih baik dalam mengambil manfaat dari multiple core. Experimen perlu dilakukan untuk memperoleh hasil yang terbaik.

Penggunaan Memory

Alokasi memory secara umum dapat di lihat sebagai "pinned" (juga dikenal sebagai “unreclaimable”), “reclaimable” atau “swappable”.

Anonymous Memory

Anonymous memory cenderung merupakan program heap dan stack memory (contoh, >malloc()). Memory ini reclaimable, kecuali kasus khusus seperti mlock atau jika tidak tersedia space swap. Anonymous memory harus di tulis ke swap sebelum dia dapat di reclaim. Swap I/O (baik swapping in dan swapping out page) cenderung untuk tidak effisien daripada pagecache I/O, karena allocation dan pola akses.

15.1.2. Pagecache

A cache of file data. When a file is read from disk or network, the contents are stored in pagecache. No disk or network access is required, if the contents are up-to-date in pagecache. tmpfs and shared memory segments count toward pagecache.

When a file is written to, the new data is stored in pagecache before being written back to a disk or the network (making it a write-back cache). When a page has new data not written back yet, it is called “dirty”. Pages not classified as dirty are “clean”. Clean pagecache pages can be reclaimed if there is a memory shortage by simply freeing them. Dirty pages must first be made clean before being reclaimed.

15.1.3. Buffercache

This is a type of pagecache for block devices (for example, /dev/sda). A file system typically uses the buffercache when accessing its on-disk “meta-data” structures such as inode tables, allocation bitmaps, and so forth. Buffercache can be reclaimed similarly to pagecache.

15.1.4. Buffer Heads

Buffer heads are small auxiliary structures that tend to be allocated upon pagecache access. They can generally be reclaimed easily when the pagecache or buffercache pages are clean.

15.1.5. Writeback

As applications write to files, the pagecache (and buffercache) becomes dirty. When pages have been dirty for a given amount of time, or when the amount of dirty memory reaches a particular percentage of RAM, the kernel begins writeback. Flusher threads perform writeback in the background and allow applications to continue running. If the I/O cannot keep up with applications dirtying pagecache, and dirty data reaches a critical percentage of RAM, then applications begin to be throttled to prevent dirty data exceeding this threshold.

15.1.6. Readahead

The VM monitors file access patterns and may attempt to perform readahead. Readahead reads pages into the pagecache from the file system that have not been requested yet. It is done in order to allow fewer, larger I/O requests to be submitted (more efficient). And for I/O to be pipelined (I/O performed at the same time as the application is running).

15.1.7. VFS caches

15.1.7.1. Inode Cache

This is an in-memory cache of the inode structures for each file system. These contain attributes such as the file size, permissions and ownership, and pointers to the file data.

15.1.7.2. Directory Entry Cache

This is an in-memory cache of the directory entries in the system. These contain a name (the name of a file), the inode which it refers to, and children entries. This cache is used when traversing the directory structure and accessing a file by name.

15.2. Reducing Memory Usage

15.2.1. Reducing malloc (Anonymous) Usage

Applications running on openSUSE 12.3 can allocate more memory compared to openSUSE 10. This is due to glibc changing its default behavior while allocating userspace memory. Please see http://www.gnu.org/s/libc/manual/html_node/Malloc-Tunable-Parameters.html for explanation of these parameters.

To restore a openSUSE 10-like behavior, M_MMAP_THRESHOLD should be set to 128*1024. This can be done with mallopt() call from the application, or via setting MALLOC_MMAP_THRESHOLD environment variable before running the application.

15.2.2. Reducing Kernel Memory Overheads

Kernel memory that is reclaimable (caches, described above) will be trimmed automatically during memory shortages. Most other kernel memory can not be easily reduced but is a property of the workload given to the kernel.

Reducing the requirements of the userspace workload will reduce the kernel memory usage (fewer processes, fewer open files and sockets, etc.)

15.2.3. Memory Controller (Memory Cgroups)

If the memory cgroups feature is not needed, it can be switched off by passing cgroup_disable=memory on the kernel command line, reducing memory consumption of the kernel a bit.

15.3. Virtual Memory Manager (VM) Tunable Parameters

When tuning the VM it should be understood that some of the changes will take time to affect the workload and take full effect. If the workload changes throughout the day, it may behave very differently at different times. A change that increases throughput under some conditions may decrease it under other conditions.

15.3.1. Reclaim Ratios

/proc/sys/vm/swappiness

This control is used to define how aggressively the kernel swaps out anonymous memory relative to pagecache and other caches. Increasing the value increases the amount of swapping. The default value is 60.

Swap I/O tends to be much less efficient than other I/O. However, some pagecache pages will be accessed much more frequently than less used anonymous memory. The right balance should be found here.

If swap activity is observed during slowdowns, it may be worth reducing this parameter. If there is a lot of I/O activity and the amount of pagecache in the system is rather small, or if there are large dormant applications running, increasing this value might improve performance.

Note that the more data is swapped out, the longer the system will take to swap data back in when it is needed.

/proc/sys/vm/vfs_cache_pressure

This variable controls the tendency of the kernel to reclaim the memory which is used for caching of VFS caches, versus pagecache and swap. Increasing this value increases the rate at which VFS caches are reclaimed.

It is difficult to know when this should be changed, other than by experimentation. The slabtop command (part of the package procps) shows top memory objects used by the kernel. The vfs caches are the "dentry" and the "*_inode_cache" objects. If these are consuming a large amount of memory in relation to pagecache, it may be worth trying to increase pressure. Could also help to reduce swapping. The default value is 100.

/proc/sys/vm/min_free_kbytes

This controls the amount of memory that is kept free for use by special reserves including “atomic” allocations (those which cannot wait for reclaim). This should not normally be lowered unless the system is being very carefully tuned for memory usage (normally useful for embedded rather than server applications). If “page allocation failure” messages and stack traces are frequently seen in logs, min_free_kbytes could be increased until the errors disappear. There is no need for concern, if these messages are very infrequent. The default value depends on the amount of RAM.

15.3.2. Writeback Parameters

One important change in writeback behavior since openSUSE 10 is that modification to file-backed mmap() memory is accounted immediately as dirty memory (and subject to writeback). Whereas previously it would only be subject to writeback after it was unmapped, upon an msync() system call, or under heavy memory pressure.

Some applications do not expect mmap modifications to be subject to such writeback behavior, and performance can be reduced. Berkeley DB (and applications using it) is one known example that can cause problems. Increasing writeback ratios and times can improve this type of slowdown.

/proc/sys/vm/dirty_background_ratio

This is the percentage of the total amount of free and reclaimable memory. When the amount of dirty pagecache exceeds this percentage, writeback threads start writing back dirty memory. The default value is 10 (%).

/proc/sys/vm/dirty_ratio

Similar percentage value as above. When this is exceeded, applications that want to write to the pagecache are blocked and start performing writeback as well. The default value is 40 (%).

These two values together determine the pagecache writeback behavior. If these values are increased, more dirty memory is kept in the system for a longer time. With more dirty memory allowed in the system, the chance to improve throughput by avoiding writeback I/O and to submitting more optimal I/O patterns increases. However, more dirty memory can either harm latency when memory needs to be reclaimed or at data integrity (sync) points when it needs to be written back to disk.

15.3.3. Readahead parameters

/sys/block/<bdev>/queue/read_ahead_kb

If one or more processes are sequentially reading a file, the kernel reads some data in advance (ahead) in order to reduce the amount of time that processes have to wait for data to be available. The actual amount of data being read in advance is computed dynamically, based on how much "sequential" the I/O seems to be. This parameter sets the maximum amount of data that the kernel reads ahead for a single file. If you observe that large sequential reads from a file are not fast enough, you can try increasing this value. Increasing it too far may result in readahead thrashing where pagecache used for readahead is reclaimed before it can be used, or slowdowns due to a large amount of useless I/O. The default value is 512 (kb).

15.3.4. Further VM Parameters

For the complete list of the VM tunable parameters, see /usr/src/linux/Documentation/sysctl/vm.txt (available after having installed the kernel-source package).

15.4. Non-Uniform Memory Access (NUMA)

Another increasingly important role of the VM is to provide good NUMA allocation strategies. NUMA stands for non-uniform memory access, and most of today's multi-socket servers are NUMA machines. NUMA is a secondary concern to managing swapping and caches in terms of performance, and there are lots of documents about improving NUMA memory allocations. One particular parameter interacts with page reclaim:

/proc/sys/vm/zone_reclaim_mode

This parameter controls whether memory reclaim is performed on a local NUMA node even if there is plenty of memory free on other nodes. This parameter is automatically turned on on machines with more pronounced NUMA characteristics.

If the VM caches are not being allowed to fill all of memory on a NUMA machine, it could be due to zone_reclaim_mode being set. Setting to 0 will disable this behavior.

15.5. Monitoring VM Behavior

Some simple tools that can help monitor VM behavior:

vmstat: This tool gives a good overview of what the VM is doing. See Section 2.1.1, “vmstat” for details.

/proc/meminfo: This file gives a detailed breakdown of where memory is being used. See Section 2.4.2, “Detailed Memory Usage: /proc/meminfo” for details.

slabtop: This tool provides detailed information about kernel slab memory usage. buffer_head, dentry, inode_cache, ext3_inode_cache, etc. are the major caches. This command is available with the package procps

Referensi

http://doc.opensuse.org/documentation/html/openSUSE/opensuse-tuning/cha.tuning.memory.html