LinuxDevCenter.com
oreilly.comSafari Books Online.Conferences.

advertisement


When Linux Runs Out of Memory
Pages: 1, 2

Inside the Allocator

The real work actually takes place inside the glibc memory allocator. The allocator hands out blocks to the application, carving them from the heap that comes (however infrequently) from the kernel.



The allocator is the manager, while the kernel is the worker. With this in mind, it's easy to understand that maximum efficiency comes from a good allocator, not from the kernel.

glibc uses an allocator named ptmalloc. Wolfram Gloger created it as a modified version of the original malloc library created by Doug Lea. The allocator manages the allocated blocks in terms of "chunks." Chunks represent the memory block you actually requested, but not its size. There is an extra header added inside this chunk besides the user data.

The allocator uses two functions to get a chunk of memory from the kernel:

  • brk() sets the end of the process's data segment.
  • mmap() creates a new VMA and passes it to the allocator.

Of course, malloc() uses these functions only if there are no more free chunks in the current pool.

The decision on whether to use brk() or mmap() requires one simple check. If the request is equal or larger than M_MMAP_THRESHOLD, the allocator uses mmap(). If it is smaller, the allocator calls brk(). By default, M_MMAP_THRESHOLD is 128KB, but you may freely change it by using mallopt().

In the OOM context, how ptmalloc frees memory blocks is interesting. Blocks allocated via mmap() get freed via an unmap() call, and thus become completely released. Freeing blocks allocated via brk() means marking them as free, but they remain under the allocator's control. It can reassign free chunks to satisfy another malloc() if the request's size is less than or equal to the chunk's size. The allocator can consolidate multiple free chunks, as long as they are adjacent. It may even split a free chunk into smaller chunks to satisfy smaller future requests.

This implies that a free chunk may go abandoned if the allocator cannot fit future requests within it. Failure to coalesce free chunks may also trigger faster OOM. This is usually an indication of moderate to bad memory fragmentation.

Recovery

Once an OOM situation occurs, now what? The kernel will terminate one process for sure. Why kill? This is the only way to stop further memory requests. The kernel can not assume there is a sophisticated mechanism inside the process to stop further requests automatically, so it has no other choice but to kill it.

How does the kernel know exactly which process to kill? The answer lies inside mm/oom_kill.c of the Linux source code. This C code represents the so-called OOM killer of the Linux kernel. The function badness() give a score to each existing processes. The one with highest score will be the victim. The criteria are:

  1. VM size. This is not the sum of all allocated pages, but the sum of the size of all VMAs owned by the process. The bigger the VM size, the higher the score.
  2. Related to #1, the VM size of the process's children are important too. The VM size is cumulative if a process has one or more children.
  3. Processes with task priorities smaller than zero (niced processes) get more points.
  4. Superuser processes are important, by assumption; thus they have their scores reduced.
  5. Process runtime. The longer it runs, the lower the score.
  6. Processes that perform direct hardware access are more immune.
  7. The swapper (pid 0) and init (pid 1) processes, as well as any kernel threads immune from the list of potential victims.

The process with the biggest score "wins" the election and the OOM killer will kill it very soon.

The heuristic isn't perfect, but usually it works well for most situations. Criteria #1 and #2 clearly show that it is the VMA size that matters, not the number of actual pages a process has. You might think that measuring VMA size will trigger a false alarm, but luckily it doesn't. The badness() call occurs inside the page allocation functions when there are few free pages left and page frame reclamation fails, so the VMA size closely matches the number of pages owned by the process.

Why not just count the actual number of pages? That would require more time and require the use of locks, thus making the procedure too expensive to make a fast decision. Knowing that OOM killer isn't perfect, you must be ready for a wrong kill.

The kernel uses the SIGTERM signal to inform the target process that it should stop.

How to Reduce OOM Risk

The simple rule to avoid OOM risk is actually simple: don't allocate beyond the machine's current free space. However, many factors come into play, so there are further refinements to the strategy.

Reduce Fragmentation by Properly Ordering Allocation

There is no need to use any sophisticated allocator. You can reduce fragmentation by properly ordering memory allocation and deallocation. As holes easily happen, employ the LIFO strategy: the last one you allocate is the first you need to free.

For example, instead of doing:

        void *a;
        void *b;
        void *c;
        ............
        a = malloc(1024);
        b = malloc(5678);
        c = malloc(4096);

        ......................

        free(b);
        b = malloc(12345);

It's better to do:

        a = malloc(1024);
        c = malloc(4096);
        b = malloc(5678);
        ......................

        free(b);
        b = malloc(12345);

This way, there won't be any hole between the a and c chunks. You can also consider realloc() to resize any existingmalloc()-ed blocks.

Two example programs (fragmented1.c and fragmented2.c) demonstrate the effect of allocation rearrangement. Reports at the end of both programs give the number of bytes allocated by the system (kernel and glibc allocator) and the number of bytes actually used. For example, on kernel 2.6.11.1, with glibc 2.3.3-27 and executing without giving an explicit parameter, fragmented1 wasted 319858832 bytes (about 305 MB) while fragmented2 wasted 2089200 bytes (about 2MB). That's 152 times smaller!

You can do further experiments by passing various numbers as the program parameter. This parameter acts as the request size of the malloc() call.

Tweak Kernel's Overcommit Behavior

You can change the behavior of the Linux kernel through the /proc filesystem, as documented in Documentation/vm/overcommit-accounting in the Linux kernel's source code. You have three choices when tuning kernel overcommit, expressed as numbers in /proc/sys/vm/overcommit_memory:

  • 0 means that the kernel will use predefined heuristics when deciding whether to allow such an overcommit. This is the default.
  • 1 always overcommits. Perhaps you now realize the danger of this mode.
  • 2 prevents overcommit from exceeding a certain watermark. The watermark is also tunable through /proc/sys/vm/overcommit_ratio. Within this mode, the total commit can not exceed the swap space(s) size + overcommit_ratio percent * RAM size. By default, the overcommit ratio is 50.

The default mode usually work quite fine in most situation, but mode #2 offers better protection toward overcommit. On the other hand, mode #2 requires you to predict carefully how much space all running applications need. You certainly don't want to see your application unable to get more memory chunks just because the limit is too strict. However, mode #2 is a best way to avoid having a program killed suddenly.

Suppose that you have 256MB of RAM and 256MB of swap and you want to limit overcommit at 384MB. That means 256 + 50 percent * 256MB, so put 50 on /proc/sys/vm/overcommit_ratio.

Check for NULL Pointer after Memory Allocation and Audit for Memory Leak

This is a simple rule, but it sometimes goes omitted. By checking for NULL, at least you know that the allocator could extend the memory area, although there is no obvious guarantee that it will allocate the needed pages later. Usually, you need to bail out or delay the allocation for a moment, depending on your scenarios. Together with overcommit tunables, you have a decent tool to anticipate OOM because malloc() will return NULL if it believes that it cannot acquire free pages later.

Memory leak is also a source of unnecessary memory consumption. A leaked memory block is one that the application no longer tracks, but that the kernel will not reclaim because, from the kernel's point of view, the task still has it under control. Valgrind is a nice tool to find out such occurrences inside your code without the need to re-code.

Always Consult Memory Allocation Statistics

The Linux kernel provides /proc/meminfo as a way to find complete information about memory conditions. This /proc entry is also an information source for utilities such as top, free, and vmstat.

What you need to check is the free and reclaimable memory. The word "free" needs no further explanation, but what does "reclaimable" mean? It refers to buffers and page caches--the disk cache. They are reclaimable because, when memory is tight, the Linux kernel can simply flush them out back to the disk. These are file-backed pages. I've lightly edited this example of memory statistics:

   $ cat /proc/meminfo
   MemTotal:       255944 kB
   MemFree:          3668 kB
   Buffers:         13640 kB
   Cached:         171788 kB
   SwapCached:          0 kB
   HighTotal:           0 kB
   HighFree:            0 kB
   LowTotal:       255944 kB
   LowFree:          3668 kB
   SwapTotal:      909676 kB
   SwapFree:       909676 kB

Based on this above output, the free virtual memory is MemFree + Buffers + Cached + SwapFree = 1098772 kB.

I failed to find any formalized C (glibc) function to find out free (including reclaimable) memory space. The closest I found is by using get_avphys_pages() or sysconf() (with the _SC_AVPHYS_PAGES parameter). They only report the amount of free memory, not the free + reclaimable amount.

That means to get precise information, you must programmatically parse the /proc/meminfo and calculate it by yourself. If you're lazy, take the procps source package as a reference on how to do it. This package contains tools such as ps, top, and free. It is available under the GPL.

Experiments with Alternative Memory Allocators

Different allocators yield different ways to manage memory chunks and to shrink, expand, and create virtual memory areas. Hoard is one example. Emery Berger from the University of Massachusetts wrote it as a high performance memory allocator. Hoard seems to work best for multi-threaded applications; it introduces the concept of per-CPU heap.

Use 64-bit Platforms

Users who need larger user address spaces can consider using 64-bit platforms. The Linux kernel no longer uses the 3:1 VM split for these machines. In other words, user space becomes quite large. It can be a good match for machines with more than 4GB of RAM.

This has no connection to extended addressing schemes, such as Intel's Physical Address Extension (PAE), which allows a 32-bit Intel processor to address up to 64GB of RAM. This addressing deals with physical address, while in the virtual address context, the user space itself is still 3GB (assuming the 3:1 VM split). This extra memory is reachable, but not all mappable into the address space. Unmappable portions of RAM are unusable.

Consider Packed Types on Structures

Packed attributes can help to squeeze the size of structs, enums, and unions. This is a way to save more bytes, especially for array of structs. Here is a declaration example:

struct test
   {
        char a;
        long b;
   } __attribute__ ((packed));

The con for this action is that it makes certain field(s) unaligned and thus it costs more CPU cycles to access the field. "Aligned" here means the variable's address is a multiple of its data type's natural size. The net result is that, depending on the data access frequency, the runtime may get relatively slower. However, take into account page alignment and cache coherence.

Use ulimit() for User Processes

With ulimit -v, you can limit the address space a process can allocate with mmap(). When you reach the limit, all mmap(), and hence malloc(), calls will return 0 and the kernel's OOM killer will never start. This is most useful in a multi-user environment where you cannot trust all of the users and want to avoid killing random processes.

Acknowledgement

The author gives credits to several people for their assistance and help: Peter Ziljtra, Wolfram Gloger, and Rene Hermant. Mr. Gloger also contributed the ulimit() technique.

References

  1. "Dynamic Storage Allocation: A Survey and Critical Review," by Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles. Proceeding 1995 International Workshop of Memory Management.
  2. Hoard: A Scalable Memory Allocator for Multithreaded Applications, by Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson
  3. "Once upon a free()" by Anonymous, Phrack Volume 0x0b, Issue 0x39, Phile #0x09 of 0x12.
  4. "Vudo: An Object Superstitiously Believed to Embody Magical Powers," by Michel "MaXX" Kaempf. Phrack Volume 0x0b, Issue 0x39, Phile #0x08 of 0x12.
  5. "Policy-Based Memory Allocation," by Andrei Alexandrescu and Emery Berger. C/C++ Users Journal.
  6. "Security of memory allocators for C and C++," by Yves Younan, Wouter Joosen, Frank Piessens, and Hans Van den Eynden. Report CW419, July 2005
  7. Lecture notes (CS360) about malloc(), by Jim Plank, Dept. of Computer Science, University of Tennessee.
  8. "Inside Memory Management: The Choices, Tradeoffs, and Implementations of Dynamic Allocation," by Jonathan Bartlett
  9. "The Malloc Maleficarum," by Phantasmal Phantasmagoria
  10. Understanding The Linux Kernel, 3rd edition, by Daniel P. Bovet and Marco Cesati. O'Reilly Media, Inc.

Mulyadi Santosa is a freelance writer who lives in Indonesia.


Return to the Linux DevCenter.


Linux Online Certification

Linux/Unix System Administration Certificate Series
Linux/Unix System Administration Certificate Series — This course series targets both beginning and intermediate Linux/Unix users who want to acquire advanced system administration skills, and to back those skills up with a Certificate from the University of Illinois Office of Continuing Education.

Enroll today!


Linux Resources
  • Linux Online
  • The Linux FAQ
  • linux.java.net
  • Linux Kernel Archives
  • Kernel Traffic
  • DistroWatch.com


  • Sponsored by: