Lab Guide: Embedded Memory Management

(I wrote this article for students using our simple RTOS; we are not using the RTOS in SENG 466 any more, but much of this information is still relevant.  The summary is: don’t use C++ or dynamic memory allocation.  You might lose marks if you do.)

Development tools like WinAVR and Atmel Studio allows programmers to write embedded code using C++; this enables tantalizingly familiar programming approaches that are popular among high-level application developers but are not always a judicious choice for embedded development.  For example, many of the libraries provided by Arduino are written as C++ classes.  C++ allows low-level control in a similar way to C, and it allows us to avoid “old-fashioned” C coding structures that can, on occasion, seem awkward and abstruse.  Unfortunately it’s not necessarily good to take advantage of C++ in embedded systems.

When we’re writing desktop computer applications we often operate under an assumption that memory and storage are infinite.  A few kilobytes of waste is nothing in that context, and small memory leaks are common and not really worth fixing.  It sounds almost dystopian, but it truly doesn’t matter because in many ways memory and storage are effectively infinite on desktop computers.

On embedded systems, memory and storage are extremely limited: an Arduino Mega comes with an ATmega1280, which has all of 8 KB of RAM; the AT90USBKey comes with a luxurious 16 MB of external Flash storage, although each 512 byte page takes thousands of cycles to access.  If we develop firmware in the same way we develop desktop applications, we’ll quickly run into trouble.  The biggest risk in memory management on embedded systems is dynamic memory allocation.  C and C++ both provide dynamic memory allocation but because C++ supports object-oriented programming, and because OOP patterns are designed for (and learned with) dynamic allocation on fast desktop computers with lots of memory, we must be especially cautious when using C++.

The information in this document is for avr-gcc, which is the compiler behind WinAVR, Atmel Studio, and Arduino.  Other compilers may do things differently, especially the interrupt syntax.

Data Memory Layout

The programmer has full control of where memory is laid out in the AVR’s Flash and SRAM, but usually we use the default.  There are four main default segments (or sections) of data memory:

  1. .data
  2. .bss
  3. Heap
  4. Stack

The .data section holds space for initialized data, like string literals and initialized static variables.  The initial data are copied from program memory (the Flash) to the .data section by the C runtime when the program resets.  The .bss section is where uninitialized static data are allocated.  The C runtime loops over the .bss section and sets everything to 0 when the program resets.  Then comes the heap, where dynamic memory is allocated, starting at the address after the end of the .bss segment; and finally the stack, which starts at the maximum data address and grows downwards towards the .data, .bss, and heap segments.  Note that memory on the stack, while it is allocated “as needed” is not dynamic allocation since the sizes of stack allocations are known at compile time and can’t be changed on the fly.  This article uses the term dynamic allocation to refer to heap allocation performed using the new keyword in C++ or the malloc() functions in C.  Static allocations occur in the .data and .bss segments.

Dynamic Memory Allocation

The version of gcc that comes with WinAVR does not include libstdc++, so the new operator is not implemented (n.b. The Arduino library implements new).  This is great, because dynamic memory allocation should be avoided in embedded systems in all but the most exceptional cases.

AVR gcc does implement malloc for dynamic allocation in both C and C++, and it should be avoided just like new.  Using dynamic memory allocation is risky for several reasons, including:

  • Heap allocation, and therefore how data get laid out in memory, is difficult to predict.
  • When you allocate on the heap you run the risk of memory leaks—that is, allocating memory and not freeing it.  Memory leaks are critically dangerous in low-memory systems, for obvious reasons.  These embedded systems do not have virtual memory to fall back on if they run out of space.
  • The memory management system adds processing and memory overhead for every allocation and de-allocation (in contrast, stack and static allocation are nearly free).
  • Bad references are more likely to happen when dealing with pointers to freed dynamic memory, and there’s no OS oversight to report segmentation violations.
  • malloc() and free() are not re-entrant so they need some boilerplate when used in a program that uses interrupts.
  • Unexpected collisions between the stack and the heap are possible, especially when the heap becomes fragmented.

The last point is especially important in the RTOS that we use in CSC 460.  The obvious way to detect a stack-heap collision is to check if the heap space is being allocated above the stack pointer, or if the stack pointer is being incremented to inside the heap space.  But our RTOS only uses the main stack area for the kernel; it gives each task its own stack, located in the (static) .bss segment of memory.  While the system runs, the stack pointer register is going to jump around the .bss segment as the RTOS moves from task to task.  Most of the time there’s simply no way for the heap manager to identify when a heap allocation will clobber the kernel stack.

If dynamic allocation is absolutely required, it should not be performed inside of loops.  It should only be used when the stack pointer can be used to detect collisions, preferably at the start of the program.

Avoiding Dynamic Allocation

The key to avoiding dynamic memory allocation is to make thoughtful design choices.  In software engineering we are constantly told never to make assumptions or hard-code limitations into our software, but in embedded systems it is almost always necessary to do so.  For example, the RTOS we use has a hard limit of 8 concurrent tasks, and each task has a default limit of 256 bytes of stack space.  Having these values be dynamic is not only infeasible, it’s also not very useful.  We are in full control over the software that runs on the microcontroller—that is to say, notwithstanding abstraction layers like the Arduino library, there is very little “spooky action at a distance“—and we can design and document limitations to make more efficient use of the scarce resources available to us.

We can use object-oriented techniques without dynamic memory allocation.  The Arduino libraries define objects statically in the global namespace as shown in the following template:

/** ClassName.h */
#ifndef __CLASSNAME_H__
#define __CLASSNAME_H__
class ClassName
{
private:
    uint8_t arga;
    uint8_t argb;
public:
    ClassName(uint8_t arga, uint8_t argb);
    // rest of the definition
};
 
extern ClassName object1;
extern ClassName object2;
#endif
/** ClassName.cpp */
 
ClassName::ClassName(uint8_t a, uint8_t b) :
    arga(a),  // initialize the instance variables
    argb(b)
{ }
ClassName object1(default1a, default1b);  // create and instantiate the static objects
ClassName object2(default2a, default2b);

This is a good way to do it because the objects are available globally and can by accessed by any module without having to muck about with pointers.  Global variables are not inherently evil in embedded development.  Furthermore, the space for the objects is allocated statically, in the .data (I think) segment of memory.

Anything that allocates global static memory must be put in a source code file (.c or .cpp).  If it’s allocated in the header then the compiler will try to allocate the same memory every time the header is processed, and it will detect multiple definitions.  The object1 definition in the .cpp file allocates static global memory for the instance and calls the constructor with the hard-coded default arguments.  The extern declaration in the header makes the object1 variable visible to external modules that include ClassName.h, but it doesn’t allocate space for a new object.

Recursion

Computer science theorists love recursion as much as software engineers love dynamic allocation.  Men and women of insight will recognize the evil of recursion in all cases, not just in embedded systems (just kidding, sort of).

Recursion, like dynamic memory allocation, is not easy to predict at compile time and increases the risk of stack overruns.  This is especially true in the RTOS, where each task has only 256 bytes for its stack.  Recursion should only be used when the function’s recursion depth is predictably small (e.g. O(log n) functions).

Interrupts

With the RTOS, an interrupt service routine will run on the stack belonging to whatever task is running at the time the interrupt was triggered (interrupts are disabled when the kernel stack is active).  Even if a task itself will never overflow its own stack, an interrupt might occur while that task is running and push its stack pointer out of its workspace.  When an interrupt is triggered it pushes the CPU execution context, which is several dozen bytes, onto the stack.  This is a very erratic and difficult bug to track down, and the solution is to keep the stack small by moving data out of the stack (see “Memory Usage Tracking” below) and by avoiding recursion.  This isn’t so much of a problem when using just one stack space, because presumably there is plenty of room for the stack in SRAM.

Interrupt Handling in C++

Interrupt service routines are not object oriented in avr-gcc.  One might be able to stick an ISR definition into a C++ class, but I haven’t figured out how to do that.  If using C++ is required for some reason we can pretend that the ISR is part of the class using the friend keyword, which gives the ISR access to the class’s private members.  To do this we have to decompose the ISR() macro as the following example for the timer 1B match interrupt vector shows:

/** ClassName.h */
 
#ifndef __CLASSNAME_H__
#define __CLASSNAME_H__
// prototype from the ISR macro
extern "C" void TIMER1_COMPB_vect (void) __attribute__ ((signal, used, externally_visible));
 
class ClassName
{
    friend void TIMER1_COMPB_vect(void);    // give the ISR access to private members
    ...
private:
    static uint8_t var1;
}
#endif
/** ClassName.cpp */
 
// static class members need to be initialized (see this page's comments section below)
ClassName::var1 = 0;
// function definition from the ISR macro
void TIMER1_COMPB_vect()
{
     ClassName::var1 = 1;
}

The RTOS uses this same technique of ISR decomposition so that it can add the naked attribute to the ISR prototype (I believe avr-gcc now has a more elegant mechanism for accomplishing that in the ISR() macro).  The above code is exactly the same as declaring ISR(TIMER1_COMPB_vect) in the .cpp file, except the function prototype is moved above the class definition so that it is in the right scope to declare it a friend.

But really, that whole thing is kind of goofy and useless.  The better solution is to ignore OOP habits and just make the class member public.  The code is clearer, since the familiar WinAVR ISR macro sugar remains intact.  Why not?  It’s not like we’re creating million-line multi-team libraries for widespread public dispersal here.  Clarity and simplicity are much more important than strictly adhering to OOP rules.

Memory Usage Tracking

It’s hard to tell how much memory a program is actually using, because many data are allocated at runtime on the stack (or, if you’re using the RTOS, stacks).  When we compile a program the compiler reports how much data memory is allocated statically, but not how much memory will be allocated at run-time.

AVR is based on the Harvard architecture, so it has two areas of memory.  Program Flash memory is large and contains executable code.  One can also instruct the compiler to put read-only data (e.g. large look-up tables) in  program memory, but program memory can’t be written to at run-time, more or less [PDF].  Data memory, held in expensive SRAM, is much smaller than program memory, but it is read-write at run-time.  Data memory contains several compiler-defined segments, as mentioned previously.  Data that are allocated at run-time on the stack(s) and heap obviously can’t be tracked easily at compile-time, but if you tell the compiler to allocate them statically, in the .bss or .data segments, then it can include them in its memory usage calculation.  (There are tools for predicting stack usage in very simple programs.)  Normally at the end of compilation the AVR tools will report the sizes of the compiled code and statically allocated data memory.

The following example illustrates stack allocation vs. static allocation.  This is stack allocation:

void do_something()
{
    char str[64];      // 64-byte string is allocated on the stack.
    snprintf(str, sizeof(str), "Some text\n\r");
}

The following two functions showing static allocation are equivalent, aside from the str and str2 variables’ scopes:

void do_something()
{
    static char str[64];       // 64-byte string is allocated in static memory (.bss)
    snprintf(str, sizeof(str), "Some text\n\r");
}
 
char str2[64];       // 64-byte string is allocated in static memory (.bss)
void do_something_else()
{
    snprintf(str2, sizeof(str2), "Some text\n\r");
}

If you compile the first function, the compiler will report 0 extra bytes of data memory, because the “str” variable will be allocated on the stack at run-time.  For the second and third functions the compiler will report 64 extra bytes each of data memory, because “str” and “str2” will be placed in static memory.  This has two benefits: it gives a more accurate picture of memory usage at compile time, and it makes the stack smaller by offloading the data to static memory.  Stack space will still be used for the function call itself (the frame pointer, arguments, etc.), but that’s just a few bytes.

The downside is that it is less memory efficient.  In the first function, the array will be de-allocated from the stack when the function returns.  In the second and third functions the static memory used by the string is permanently allocated in the .bss segment and can’t be recovered.  The problem of too many data being allocated statically can be identified at compile time, in which case some of the memory load can be placed back on the stack.

One thing to keep in mind with static allocation in a function is that if you initialize the statically allocated memory when it is defined, the initialization will happen once at system startup, and will probably only be valid the first time the function is executed.  You might have to re-initialize the data in an executable statement, e.g.:

void do_something()
{
    uint8_t i = 0;          // i is not static and is therefore initialized to 0 at every function call
    ...
    i = 0;                  // any decent compiler will optimize out this second assignment
    while (i != 1) { ... }
}
void  do_something()
{
    static uint8_t i;         // i is static and is therefore initialized to 0 once, at compile time
    ...
    i = 0;                    // i is re-initialized manually at runtime each time the function is called
    while (i != 1) { ... }
}

All static variables are implicitly initialized to 0, and it’s more efficient not to initialize them to 0 explicitly.  The second function doesn’t run any slower than the first function, but if you forget to re-initialize the static variable at run-time then it will be running with data left over from the last time the function ran.