ez-alloc, libeze, that music player thing
So it's been a pretty long week of hacking. I went deep on the memory allocator and eventually got one debugged and working. I made some very silly mistakes which made what is fairly straightforward code a lot harder to get working than it should have been but that's the nature of the beast.
It grew into about 1000 lines w/ comments, but then again GNU libc malloc is 5000 in just the main file.
I'm kinda bummed out about the performance after all that though, which is leading down another hole ...
So basically I started by taking a few ideas from GNU libc malloc:
- Separate free lists for every size of small allocations.
- Coalescing of free blocks is done before a large allocation not at free.
- Best fit.
- There is an 8 byte (size_t) allocation overhead which contains some management bits.
- The minimum allocation size is 32 bytes.
- There is a previous size field if the previous block is free to speed up coalescing.
Plus some other ideas and changes.
- Large free blocks are stored in a balanced binary tree rather than an approximate skip-list (I think this is what malloc uses).
- Tiny allocations use a completely different page based allocator.
- They have zero per-allocation overhead;
- The minimum size is 8 bytes (void *) with an 8 byte alignment;
- Power of 2 sizes are aligned to their size;
- Allocations are never coalesced.
- If a 'large' free block is within the tiny allocation range then it will never be found by the large allocator so to avoid polluting the index they are stored in a 'lost' list. These are not really lost and will be coalesced automatically.
Before I got this working I decided to drop the per-size small lists. It was cluttering the code and I wanted to see how it worked without it.
I also added a huge allocator which uses a separate mmap for very large allocations, which is also what GNU libc does.
For accuracy testing I created a LD_PRELOAD module which replaced the memory allocation calls with my own and ran some binaries like ls or xterm. I can also use it to create a log of memory actions which I can replay so I can actually debug the thing.
For performance and overhead testing I have a small not-particularly representative test using libavl (it just allocates 1 larger block then thousands of 32-byte blocks). I read /usr/share/dict/words into an array. Then in order I insert them all into the tree, look them all up, then remove them. And another also not-particularly representative loop which frees and allocates random amounts.
Things that work well
I think the tiny allocator quite neat. It uses page alignment and a magic number to be able to allow ez_free() to quickly check whether an allocation is a tiny allocation and handle it appropriately.
It helps avoid fragmentation in the main memory and can save a lot of memory if there are many allocations of the same size.
It would be a big problem if a pointer was mis-identified as a tiny address when it wasn't. The possibility of this is not zero without additional checks, although very very unlikely.
Using ez-list and ez-tree makes the code relatively straightforward and easy to read. Having the same node pointers being used for either is a big plus - I can move nodes between trees or lists easily. List nodes can be removed without knowing the list header so I can have many lists but not care where they come from when performing a coalsecing (I use a single status bit to indicate wheather the block is in a list or the one tree). This makes the coalescing function 'just work' without having to worry about maintaining other state as mergable blocks are merged.
Things I really fucked up
Probably the worst was that I forgot to mask out the status bits of the size in the tree compare function. So ... nodes would go missing when I fiddled with the bits! These bits don't get changed very often while the node is in the tree so that lead to difficult to track down problems.
gcc and weirdness
The C99+ aliasing rules broke certain sequences of ez-list calls
as well. For example if you perform node.remove on the head of
the list and then list.addhead, it would write the pointers back
in the wrong order. Another subtle and hard to find bug and gcc
didn't warn about it. I found a workaround using a soft
asm volatile ("" : : : "memory"); in
node.remove but I don't really like it as I can't be sure other
sequences might break. Using -fno-strict-aliasing works too of
course. Maybe i'll just make the list and node pointers volatile,
I don't think it will affect performance negatively doing this and
the code pretty much requires them to behave that way.
I've had some other very strange behaviour with gcc as well, sometimes the optimiser makes radically different decisions for no obvious reason. It appears to do it even when I haven't made any code changes, but I don't think my memory is reliable to be absolutely sure on that.
Another waste of time?
The xterm test was quite interesting. With the tiny allocator enabled it quickly reaches a steady state with 21 pages of tiny allocations having over 50% occupancy. There are only 6 free large blocks, and 13 lost ones.
However pmap indicates more total memory is consumed vs glibc malloc which was a bit of a bummer.
For the performance tests I fix my CPU at a specific frequency to get more consistent and repeatable results.
The tree test runs about 10% faster than system malloc and saves roughly 1MB of maximum rss (/usr/bin/time), about 8.5MB vs 9.5MB. At least that's a winner. Actually if I remove the lookup loop so that the alloc code is more of the total time then the performance nearly equalises. So most of the performance improvement is down to allocation alignment and not using a tiny allocator. Which is quite useful all the same.
However the random test runs about 2-3x slower than system malloc and uses about the same total amount of memory. That was a real bummer and every time I compile the code it seems to get a bit slower (and use more peak memory to boot). But ... this is absolutely nothing like the allocation used in a real program such as xterm so I don't even know if it's worth worrying about. I should try with something more demanding but they likely already use pools or the like.
I suppose even if it is a bit slower it can be used as a low-overhead free-supporting pool allocator which might be useful.
I did some shitty profiling and found that the biggest cost was probably the tree manipulation during ez_alloc(). There are some simplish things I haven't tried that should help it.
For example when i split a free block I always remove the node and then re-insert it if there's any of it left (either the lost list or back into the tree). If I can determine that the new node has not changed ordering then it is possible to just replace the entry directly and save both the remove and insert.
When I coalesce nodes I find full consecutive runs and remove all the ones i've found as I go. Then insert the coalesced one back. It should be possible to do the same check here and save one of the removes and the insert in some cases but it is a bit more difficult to determine (maybe just find the largest node that belongs to the tree and try that?).
All non-tiny allocs trigger a coalesce and then search the free block tree. GNU malloc can sometimes use the most recently freed block here, which in this case would avoid the tree altogether.
Finally, I could always add back the specific-sized 'small' lists for recently freed blocks but I want to avoid that for various reasons.
I can't see these doubling the speed for this silly random test-case but I could be wrong.
I suppose I could also try other indexing structures as an alternative to the balanced tree such as a skip-list (really, a skip-index-array I guess). I've not used them before but I had a break yesterday and today it seems like it shouldn't be too difficult.
Update: 3/2/2019 I tried the skip-index idea. And lo, once I got it working the random test is now on par with GNU libc. I guess I will go with that then. I suppose then all i've really done is re-implement GNU libc malloc in the end; although the tiny allocator might benefit some usages. And well - the code is 1/5 the size and considerably more readable. And i'll wrap it up in a way that can be used as a pool allocator. I'm back to work this week so i'll have to give it a break for a while anyway; i'm also exhausted and need to get away from it. Spent some time in the garden today although I discovered the 47C head might've killed more roses out the front. Sigh.
I haven't really looked into C profiling for years. Well apart from /usr/bin/time or gettimeofday() (although I just found clock_gettime()).
I suppose people use oprofile? I should at least look into it.
gprof is pretty much useless. Well I guess it did identify the tree manipulation as the major cost but looking at the code would tell you that anyway. It just doesn't sample often enough (i'm getting well under 1% sampling of running time).
Searches tend to turn up advice to use cachegrind, ... but I can't see anything resembling cycle counts in any of it's output. Ubuntu wont let me install kcachegrind because of some dependency issue (probably because i've blacklisted some snot like fucking-shit-audio) but I can't see how that can make up numbers which aren't there already
So ... I started working on my own as well. Yeah like that's a good idea! Ok, maybe it's really time to give it a break. It's really not something I need and i'm not sure -finstrument-functions is going to do a good enough job. I'll see i'll see.
I did snarf some more code out of ezesdk that I've added to libeze. I've got a very small ez-elf that lets you do basic but useful things like lookup function names.
That music player thing
So this has been a pretty extended foray well beyond the code I needed to implement a music player! I haven't touched that for weeks - on the other hand I have just left it running for those weeks and it's still going fine. The remote works well. I've got some tools to send commands around and it all just works.
The last time i mentioned it I was wondering how to do a menu but that night I came up with the obvious answer for a device with no display. Just make it talk!
So I poked around with espeak-ng for a while and fiddled with the player code so I can inject synthesised signal into the audio stream.
I got as far as implementing a say-this-song function. When you press the lips button it says the title and artist of the current song if they're available or just the filename if they're not. As I saw mention in a mailing list; it's quite a clear voice but unintentially quite hilarious at times! It also speaks too fast and doesn't pause anywhere nearly long enough for full stops but at least the former can be adjusted.
I then hit a bit of a design snag. The keyboard monitor takes the keys pressed and turns them into high level player commands. For example the cursor keys are fast-forward/rewind/next-track/previous-track. But during a menu these would change behaviour and be for navigating the menu.
So I wasn't sure where I should put the menu driver logic. If it was in the player then the keyboard module would just send through raw keypresses for the other keys and the player would behave differently depending on what's going on.
If it's in the keyboard module then it could route messages around to potentially other components and the player would implement higher level functions like 'enter menu mode', 'say this', 'preview that', 'rename playlist' and so on.
My initial ideas were a but muddled and the second case seemed a bit messy. But now I think about it again it makes a lot more sense and I can leave the player to playing music and it only really needs to maintain a 'playing' or 'not playing' state and do some low level stream pause management to make sure it all works with alsa.
Regardless, I do have a goal here and once I add a basic menu that would basically be a working player but i'm not sure when i'll get back to it.