>

How much of a bottleneck is memory allocation/deallocation in typical real-world programs? Answers from any type of program where performance typically matters are welcome. Are decent implementations of malloc/free/garbage collection fast enough that it’s only a bottleneck in a few corner cases, or would most performance-critical software benefit significantly from trying to keep the amount of memory allocations down or having a faster malloc/free/garbage collection implementation?

Note: I’m not talking about real-time stuff here. By performance-critical, I mean stuff where throughput matters, but latency doesn’t necessarily.

Edit: Although I mention malloc, this question is not intended to be C/C++ specific.

It’s significant, especially as fragmentation grows and the allocator has to hunt harder across larger heaps for the contiguous regions you request. Most performance-sensitive applications typically write their own fixed-size block allocators (eg, they ask the OS for memory 16MB at a time and then parcel it out in fixed blocks of 4kb, 16kb, etc) to avoid this issue.

In games I’ve seen calls to malloc()/free() consume as much as 15% of the CPU (in poorly written products), or with carefully written and optimized block allocators, as little as 5%. Given that a game has to have a consistent throughput of sixty hertz, having it stall for 500ms while a garbage collector runs occasionally isn’t practical.

for long running applications, fragmentation is the biggest allocaiton problem.

“Long running”, nor “Heap-y” aren’t great indicators of heap performance. Like using CPU caches well, technique is. My financial simulations ran for ~ 8hrs, but objects were allocated high up in the call tree, so used billions of times, but allocated once. 99% memory was from the heap. Microsoft used to support multiple heaps (maybe still does) for a single process, so a tree and a linked list could allocate their own sizes and avoid the fragmentation that would result otherwise. Likewise, keeping allocations per heap multiples of some basic unit size helps. These 2 cannons help a lot.

Stack usage is more about the lifetime of the object than performance. Performance is identical in a well-constructed program. Stack allocation does make for easy cleanup when you exit the scope. _alloca() is a nice cheat for dynamic memory allocation from the stack, but except for easy cleanup, and maybe preventing fragmentation, has no advantage over malloc().
https://caligari.dartmouth.edu/doc/ibmcxx/en_US/doc/libref/concepts/cumemmng.htm

Nearly every high performance application now has to use threads to exploit parallel computation. This is where the real memory allocation speed killer comes in when writing C/C++ applications.

In a C or C++ application, malloc/new must take a lock on the global heap for every operation. Even without contention locks are far from free and should be avoided as much as possible.

Java and C# are better at this because threading was designed in from the start and the memory allocators work from per-thread pools. This can be done in C/C++ as well, but it isn’t automatic.

First off, since you said malloc, I assume you’re talking about C or C++.

Memory allocation and deallocation tend to be a significant bottleneck for real-world programs. A lot goes on “under the hood” when you allocate or deallocate memory, and all of it is system-specific; memory may actually be moved or defragmented, pages may be reorganized–there’s no platform-independent way way to know what the impact will be. Some systems (like a lot of game consoles) also don’t do memory defragmentation, so on those systems, you’ll start to get out-of-memory errors as memory becomes fragmented.

A typical workaround is to allocate as much memory up front as possible, and hang on to it until your program exits. You can either use that memory to store big monolithic sets of data, or use a memory pool implementation to dole it out in chunks. Many C/C++ standard library implementations do a certain amount of memory pooling themselves for just this reason.

No two ways about it, though–if you have a time-sensitive C/C++ program, doing a lot of memory allocation/deallocation will kill performance.

In general the cost of memory allocation is probably dwarfed by lock contention, algorithmic complexity, or other performance issues in most applications. In general, I’d say this is probably not in the top-10 of performance issues I’d worry about.

Now, grabbing very large chunks of memory might be an issue. And grabbing but not properly getting rid of memory is something I’d worry about.

In Java and JVM-based languages, new’ing objects is now very, very, very fast.

Here’s one decent article by a guy who knows his stuff with some references at the bottom to more related links: http://www.ibm.com/developerworks/java/library/j-jtp09275.html

In Java (and potentially other languages with a decent GC implementation) allocating an object is very cheap. In the SUN JVM it only needs 10 CPU Cycles. A malloc in C/c++ is much more expensive, just because it has to do more work.

Still even allocation objects in Java is very cheap, doing so for a lot of users of a web application in parallel can still lead to performance problems, because more Garbage Collector runs will be triggered. Therefore there are those indirect costs of an allocation in Java caused by the deallocation done by the GC. These costs are difficult to quantify because they depend very much on your setup (how much memory do you have) and your application.

Regards, Markus (http://kohlerm.blogspot.com/)

http://stackoverflow.com/questions/470683/memory-allocation-deallocation-bottleneck