Memory access and alignment
Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today! |
When developing kernel code, it is usually important to consider constraints and requirements of architectures other than your own. Otherwise, your code may not be portable to other architectures, as I recently discovered when an unaligned memory access bug was reported in a driver which I develop. Not having much familiarity with the concepts of unaligned memory access, I set out to research the topic and complete my understanding of the issues.
Certain architectures rule that memory
accesses must meet some certain alignment criteria or are otherwise
illegal. The exact criteria that determines whether an access is suitably
aligned depends upon the address being accessed and the number of bytes
involved in the transaction, and varies from architecture to architecture.
Kernel code is typically written to obey natural alignment
constraints, a scheme that is sufficiently strict to ensure portability to
all supported architectures. Natural alignment requires that every N byte
access must be aligned on a memory address boundary of N. We can express
this in terms of the modulus operator: addr % N
must be
zero. Some examples:
- Accessing 4 bytes of memory from address 0x10004 is aligned
(
0x10004 % 4 = 0
). - Accessing 4 bytes of memory from address 0x10005 is unaligned
(
0x10005 % 4 = 1
).
movb
, movw
, movl
in x86 assembly). It is relatively easy to relate these to C statements,
for example the instructions that are generated when the following code is
compiled would likely include a single instruction that accesses two bytes
(16 bits) of data from memory:
void example_func(unsigned char *data) { u16 value = *((u16 *) data); [...] }
The effects of unaligned access vary from architecture to architecture. On architectures such as ARM32 and Alpha, a processor exception is raised when an unaligned access occurs, and the kernel is able to catch the exception and correct the memory access (at large cost to performance). Other architectures raise processor exceptions but the exceptions do not provide enough information for the access to be corrected. Some architectures that are not capable of unaligned access do not even raise an exception when unaligned access happens, instead they just perform a different memory access from the one that was requested and silently return the wrong answer.
Some architectures are capable of performing unaligned accesses without having to raise bus errors or processor exceptions, i386 and x86_64 being some common examples. Even so, unaligned accesses can degrade performance on these systems, as Andi Kleen explains:
At the end of the day, if you write code that causes unaligned accesses then your software will not work on some systems. This applies to both kernel-space and userspace code.
The theory is relatively easy to get to grips with, but how does this apply
to real code? After all, when you allocate a variable on the stack, you
have no control over its address. You don't get to control the addresses
used to pass function parameters, or the addresses returned by the memory
allocation functions. Fortunately, the compiler understands the alignment
constraints of your architecture and will handle the common cases just
fine; it will align your variables and parameters to suitable boundaries,
and it will even insert padding inside structures to ensure the access to
members is suitably aligned. Even when using the GCC-specific packed
attribute (which tells GCC not to insert padding), GCC will
transparently insert extra instructions to ensure that standard accesses to
potentially unaligned structure members do not violate alignment
constraints (at a cost to performance).
In order to illustrate a situation that might cause unaligned memory
access, consider the example_func()
implementation from
above. The first line of the function accesses two bytes (16 bits) of data
from a memory address passed in as a function parameter; however, we do not
have any other information about this address. If the data
parameter points to an odd address (as opposed to even), for example
0x10005
, then we end up with an unaligned access. The main
places where you will potentially run into unaligned accesses are when
accessing multiple bytes of data (in a single transaction) from a pointer,
and when casting variables to types of increased lengths.
Conceptually, the way to avoid unaligned access is to use byte-wise memory
access because accessing single bytes of memory cannot violate alignment
constraints. For example, for a little-endian system we could replace the
example_func()
implementation with the following:
void fixed_example_func(unsigned char *data) { u16 value = data[0] | data[1] << 8; [...] }
memcpy()
is another possible alternative in the general case,
as long as either the source or destination is a pointer to an 8-bit data
type (i.e. char
). Inside the kernel, two macros are provided
which simplify unaligned accesses: get_unaligned()
and
put_unaligned()
. It is worth noting that using any of these
solutions is significantly slower than accessing aligned memory, so it is
wise to completely avoid unaligned access where possible.
Another option is to simply document the fact that
example_func()
requires a 16-bit-aligned data parameter, and
rely on the call sites to ensure this or simply not use the
function. Linux's optimized routine for comparing two ethernet addresses
(compare_ether_addr()
) is a real life example of this: the
addresses must be 16-bit-aligned.
I have applied my newfound knowledge to the task of writing some kernel documentation, which covers this topic in more detail. If you want to learn more, you may want to read the most recent revision (as of this writing) of the document. Additionally, the initial revision of the document generated a lot of interesting discussion, but be aware that the initial attempt contained some mistakes. Finally, chapter 11 of Linux Device Drivers touches upon this topic.
I'd like to thank everyone who helped me improve my understanding of unaligned access, as this article would not have been possible without their assistance.
Index entries for this article | |
---|---|
GuestArticles | Drake, Daniel |
(Log in to post comments)
Memory access and alignment
Posted Dec 6, 2007 5:00 UTC (Thu) by cventers (guest, #31465) [Link]
One place where alignment does matter on x86 is in SMP. As noted by the glibc documentation, aligned word-sized reads and writes are atomic on all known POSIX platforms. If you respect memory visibility issues, there are certain ways you can exploit this fact to avoid the overhead of locks. In fact, if you notice, the kernel's atomic_t type is pretty straightforward on most platforms - especially the simple read and store operations. The only requirement is alignment and then it is atomic for free.
Memory access and alignment
Posted Dec 6, 2007 21:12 UTC (Thu) by dsd (guest, #49212) [Link]
A slightly tweaked version of the document has now been submitted for inclusion with the kernel documentation. You can read it here: http://article.gmane.org/gmane.linux.kernel/609571
Memory access and alignment
Posted Dec 9, 2007 18:28 UTC (Sun) by oak (guest, #2786) [Link]
On some ARM hardware some of the unaligned accesses may not provide an exception, see "Unaligned memory access" here: http://www.wirelessnetdesignline.com/howto/199901786
Here's an idea
Posted Dec 9, 2007 23:27 UTC (Sun) by pr1268 (subscriber, #24648) [Link]
Here's an idea: Let's all go back to 8-bit architectures and we won't have this problem anymore. ;-)
Okay, that was my one dorky sarcastic comment for the day.
Seriously, I'm curious about what happens without programmer intervention: Recently I had to code for a struct that looked like this (a similar example is given in the packed attribute link in the article):
struct my_object { uint32_t a; char c; char filler[3]; uint32_t* p1; char** p2; char** p3; };
I'm using a 32-bit computer, so I know all pointers occupy 4 bytes. Deal is, the char filler[3] array was not going to be used in any shape or form in my program, but I instinctively put it there to pad the whole structure to a multiple of 4 bytes. Would GCC have done that for me automatically if I had not included the char filler[3]? Or, would GCC have re-arranged things had I moved the char filler[3] to the bottom of the structure (leaving char c where it is)? How does the -Os optimization affect this? Thanks!
Padding structures elsewhere
Posted Dec 9, 2007 23:34 UTC (Sun) by pr1268 (subscriber, #24648) [Link]
By the way, in a prior life I programmed mainframes in COBOL where we used fixed-length records. Thus explaining my choice of the identifier filler. Filler padding gets interesting in COBOL when working with packed-decimal numbers (not to mention the joys of a S0C7 exception), but I digress...
Here's an idea
Posted Dec 10, 2007 7:25 UTC (Mon) by jzbiciak (guest, #5246) [Link]
Yes. On an architecture with alignment constraints, the "packed[3]" field isn't necessary. The compiler will insert padding. You can check this out with the offsetof() macro.
For example, if I compile the following program on my 64-bit Opteron, you can see the pointers all get aligned to 8 byte boundaries like they're supposed to. If I compile it on a 32 bit machine, they get aligned to 4 byte boundaries. This is regardless of whether that filler field is there.
#include <stdio.h> #include <stddef.h> typedef unsigned int uint32_t; typedef struct obj_1 { uint32_t a, b; char c; char filler[3]; uint32_t* p1; char** p2; char** p3; } obj_1; typedef struct obj_2 { uint32_t a, b; char c; uint32_t* p1; char** p2; char** p3; } obj_2; int main() { printf("offset of obj_1.a: %5d bytes\n", offsetof(obj_1, a)); printf("offset of obj_1.b: %5d bytes\n", offsetof(obj_1, b)); printf("offset of obj_1.c: %5d bytes\n", offsetof(obj_1, c)); printf("offset of obj_1.filler: %5d bytes\n", offsetof(obj_1, filler)); printf("offset of obj_1.p1: %5d bytes\n", offsetof(obj_1, p1)); printf("offset of obj_1.p2: %5d bytes\n", offsetof(obj_1, p2)); printf("offset of obj_1.p3: %5d bytes\n", offsetof(obj_1, p3)); putchar('\n'); printf("offset of obj_2.a: %5d bytes\n", offsetof(obj_2, a)); printf("offset of obj_2.b: %5d bytes\n", offsetof(obj_2, b)); printf("offset of obj_2.c: %5d bytes\n", offsetof(obj_2, c)); printf("offset of obj_2.p1: %5d bytes\n", offsetof(obj_2, p1)); printf("offset of obj_2.p2: %5d bytes\n", offsetof(obj_2, p2)); printf("offset of obj_2.p3: %5d bytes\n", offsetof(obj_2, p3)); putchar('\n'); printf("sizeof(int) = %d bytes\n", sizeof(int)); printf("sizeof(long) = %d bytes\n", sizeof(long)); printf("sizeof(void*) = %d bytes\n", sizeof(void*)); return 0; }
Output on a 32-bit machine:
offset of obj_1.a: 0 bytes offset of obj_1.b: 4 bytes offset of obj_1.c: 8 bytes offset of obj_1.filler: 9 bytes offset of obj_1.p1: 12 bytes offset of obj_1.p2: 16 bytes offset of obj_1.p3: 20 bytes offset of obj_2.a: 0 bytes offset of obj_2.b: 4 bytes offset of obj_2.c: 8 bytes offset of obj_2.p1: 12 bytes offset of obj_2.p2: 16 bytes offset of obj_2.p3: 20 bytes sizeof(int) = 4 bytes sizeof(long) = 4 bytes sizeof(void*) = 4 bytes
Output on a 64-bit machine:
offset of obj_1.a: 0 bytes offset of obj_1.b: 4 bytes offset of obj_1.c: 8 bytes offset of obj_1.filler: 9 bytes offset of obj_1.p1: 16 bytes offset of obj_1.p2: 24 bytes offset of obj_1.p3: 32 bytes offset of obj_2.a: 0 bytes offset of obj_2.b: 4 bytes offset of obj_2.c: 8 bytes offset of obj_2.p1: 16 bytes offset of obj_2.p2: 24 bytes offset of obj_2.p3: 32 bytes sizeof(int) = 4 bytes sizeof(long) = 8 bytes sizeof(void*) = 8 bytes
Here's an idea
Posted Dec 10, 2007 7:40 UTC (Mon) by jzbiciak (guest, #5246) [Link]
Oh, and bit-fields (that is, fields of somewhat arbitrary bit widths) tend to be based around the size of "int" on a given machine. That is, they tend to be word oriented. The fields pack together to form words, and a field doesn't straddle two words.
Note that I say "tend to." Bit field layout and struct layout are actually ABI issues (ABI == Application Binary Interface). For example, here's the SVR4 i386 ABI. Take a look starting at page 27. In the case of the SVR4 ABI, it appears bitfields are actually packed in terms of their base type. I believe the latest C standard only wants you to use signed int, unsigned int and _Bool, though.
Bitfields in C++
Posted Dec 10, 2007 21:31 UTC (Mon) by pr1268 (subscriber, #24648) [Link]
Bjarne Stroustrup discusses a C++ Standard Template Library (STL) vector of bools (The C++ Programming Language, Special Edition, p. 458) - this is designed to overcome the wasted space of a 1-bit data structure taking up 16, 32, or 64 bits of memory.
Bitfields in C++
Posted Dec 11, 2007 14:24 UTC (Tue) by jzbiciak (guest, #5246) [Link]
Keep in mind that a bitvector (sometimes called a bitmap) is something rather different than a bitfield member of a struct or class. Bitvectors have array-like semantics and are quite typically used to represent things such as set membership. (eg. a 1 indicates membership, a 0 indicates lack of membership.) Bitfield members, on the other hand, are scalar quantities. There is no indexing associated with them, and their range is often much more than 0 to 1.
Bitfield members are often useful for storing values with limited range in a compact manner. For example, consider dates and times. The month number only goes from 1-12 and the day number only goes from 1-31. You can store those in 4 and 5 bits respectively. If you store year-1900 instead of the full 4 digit number, you can get by with 7 or 8 bits there. Hours fit in 5 bits, minutes and seconds in 6 bits each. That leads to something like this:
struct packed_time_and_date { unsigned int year : 7; /* 7 bits for year - 1900 */ unsigned int month : 4; /* 4 bits for month (1 - 12) */ unsigned int day : 5; /* 5 bits for day (1 - 31) */ unsigned int hour : 5; /* 5 bits for hour (0 - 23) */ unsigned int minute : 6; /* 6 bits for minute (0 - 59) */ unsigned int second : 6; /* 6 bits for second (0 - 59) */ };
Now, if I did my math right, that adds up to 33 bits. Under the x86 UNIX ABI, the compiler will allocate 2 32-bit words for this. The first 5 fields will pack into the first 32-bit word, and the 6th field will be in the lower 6 bits of the second word. That is, sizeof(struct packed_time_and_date) will be 8. If you can manage to squeeze a bit out of the year (say, limit yourself to a smaller range of years), this fits into 32 bits, and sizeof(struct packed_time_and_date) will be 4. In either case, the compiler will update these fields-packed-within-words with an appropriate sequence of loads, shifts, masks, and stores. No C++ or STL necessary.
(If you want to see a more in-depth (ab)use of this facility, check out this source file. It constructs a union of structs, each struct modeling a different opcode space in an instruction set.)
Anyhow, this is what I was referring to as "bit-fields" in the comment you replied to. As you can see, it's rather different than bitvectors. :-)
Bitvectors in C++ and STL have their own problems. I've been reading up on C++ and STL, and vectors of bits apparently have a checkered history in STL. There was one implementation (I believe at SGI) of bitvector (separate of vector<bool>) that didn't make it into the standard, and yet another (apparently a specialization of vector<> especially for bool) that many feel doesn't actually provide proper iterators. I wish I could provide references or details.
All I know is that it's enough for me to avoid STL on bitvectors, and just resort to coding these manually. It's served me well so far. And I don't see a need to invoke generics like sort on them. Something tells me most generic algorithms aren't as interesting on bitvectors. ;-)