modular cross-platform jjmpeg
So well this isn't an announcement of a new release because i'm
basically too lazy to package it up at the moment.
But I have created a modular jjmpeg for Java 11 and added support
for cross compilation. I'm going to be playing with some videos
for work so I thought it might be useful if I generate any tools
for other users.
As part of it i've moved it's use of CObject over to using the
previously mentioned nativez stuff. I also moved a bunch of the
JNI utility functions over to nativez because they're generally
Who knows when i'll get to publishing it.
ez-alloc, libeze, that music player thing
So it's been a pretty long week of hacking. I went deep on the
memory allocator and eventually got one debugged and working. I
made some very silly mistakes which made what is fairly
straightforward code a lot harder to get working than it should
have been but that's the nature of the beast.
It grew into about 1000 lines w/ comments, but then again GNU libc
malloc is 5000 in just the main file.
I'm kinda bummed out about the performance after all that though,
which is leading down another hole ...
So basically I started by taking a few ideas from GNU libc malloc:
- Separate free lists for every size of small allocations.
- Coalescing of free blocks is done before a large allocation not at free.
- Best fit.
- There is an 8 byte (size_t) allocation overhead which contains
some management bits.
- The minimum allocation size is 32 bytes.
- There is a previous size field if the previous block is free
to speed up coalescing.
Plus some other ideas and changes.
- Large free blocks are stored in a balanced binary tree rather
than an approximate skip-list (I think this is what malloc
- Tiny allocations use a completely different page based allocator.
- They have zero per-allocation overhead;
- The minimum size is 8 bytes (void *) with an 8 byte alignment;
- Power of 2 sizes are aligned to their size;
- Allocations are never coalesced.
- If a 'large' free block is within the tiny allocation range
then it will never be found by the large allocator so to avoid
polluting the index they are stored in a 'lost' list. These
are not really lost and will be coalesced automatically.
Before I got this working I decided to drop the per-size small
lists. It was cluttering the code and I wanted to see how it
worked without it.
I also added a huge allocator which uses a separate mmap for very
large allocations, which is also what GNU libc does.
For accuracy testing I created a LD_PRELOAD module which replaced
the memory allocation calls with my own and ran some binaries like
ls or xterm. I can also use it to create a log of memory actions
which I can replay so I can actually debug the thing.
For performance and overhead testing I have a small
not-particularly representative test using libavl (it just
allocates 1 larger block then thousands of 32-byte blocks). I
read /usr/share/dict/words into an array. Then in order I insert
them all into the tree, look them all up, then remove them.
And another also not-particularly representative loop which frees
and allocates random amounts.
Things that work well
I think the tiny allocator quite neat. It uses page alignment and
a magic number to be able to allow ez_free() to quickly check
whether an allocation is a tiny allocation and handle it
It helps avoid fragmentation in the main memory and can save a lot
of memory if there are many allocations of the same size.
It would be a big problem if a pointer was mis-identified as a
tiny address when it wasn't. The possibility of this is not zero
without additional checks, although very very unlikely.
Using ez-list and ez-tree makes the code relatively
straightforward and easy to read. Having the same node pointers
being used for either is a big plus - I can move nodes between
trees or lists easily. List nodes can be removed without knowing
the list header so I can have many lists but not care where they
come from when performing a coalsecing (I use a single status bit
to indicate wheather the block is in a list or the one tree).
This makes the coalescing function 'just work' without having to
worry about maintaining other state as mergable blocks are merged.
Things I really fucked up
Probably the worst was that I forgot to mask out the status bits
of the size in the tree compare function. So ... nodes would go
missing when I fiddled with the bits! These bits don't get
changed very often while the node is in the tree so that lead
to difficult to track down problems.
gcc and weirdness
The C99+ aliasing rules broke certain sequences of ez-list calls
as well. For example if you perform node.remove on the head of
the list and then list.addhead, it would write the pointers back
in the wrong order. Another subtle and hard to find bug and gcc
didn't warn about it. I found a workaround using a soft
asm volatile ("" : : : "memory"); in
node.remove but I don't really like it as I can't be sure other
sequences might break. Using -fno-strict-aliasing works too of
course. Maybe i'll just make the list and node pointers volatile,
I don't think it will affect performance negatively doing this and
the code pretty much requires them to behave that way.
I've had some other very strange behaviour with gcc as well,
sometimes the optimiser makes radically different decisions for no
obvious reason. It appears to do it even when I haven't
made any code changes, but I don't think my memory is reliable to
be absolutely sure on that.
Another waste of time?
The xterm test was quite interesting. With the tiny allocator
enabled it quickly reaches a steady state with 21 pages of tiny
allocations having over 50% occupancy. There are only 6 free
large blocks, and 13 lost ones.
However pmap indicates more total memory is consumed vs glibc
malloc which was a bit of a bummer.
For the performance tests I fix my CPU at a specific frequency to
get more consistent and repeatable results.
The tree test runs about 10% faster than system malloc and saves
roughly 1MB of maximum rss (/usr/bin/time), about 8.5MB vs 9.5MB.
At least that's a winner. Actually if I remove the lookup loop so
that the alloc code is more of the total time then the performance nearly
equalises. So most of the performance improvement is down to
allocation alignment and not using a tiny allocator. Which is quite useful all the same.
However the random test runs about 2-3x slower than system malloc
and uses about the same total amount of memory. That was a real
bummer and every time I compile the code it seems to get a bit
slower (and use more peak memory to boot). But ... this is
absolutely nothing like the allocation used in a real program such
as xterm so I don't even know if it's worth worrying about. I
should try with something more demanding but they likely already
use pools or the like.
I suppose even if it is a bit slower it can be used as a
low-overhead free-supporting pool allocator which might be useful.
I did some shitty profiling and found that the biggest cost was
probably the tree manipulation during ez_alloc(). There are some
simplish things I haven't tried that should help it.
For example when i split a free block I always remove the node and
then re-insert it if there's any of it left (either the lost list
or back into the tree). If I can determine that the new node has
not changed ordering then it is possible to just replace the entry
directly and save both the remove and insert.
When I coalesce nodes I find full consecutive runs and remove all
the ones i've found as I go. Then insert the coalesced one back.
It should be possible to do the same check here and save one of
the removes and the insert in some cases but it is a bit more
difficult to determine (maybe just find the largest node that
belongs to the tree and try that?).
All non-tiny allocs trigger a coalesce and then search the free
block tree. GNU malloc can sometimes use the most recently freed
block here, which in this case would avoid the tree altogether.
Finally, I could always add back the specific-sized 'small' lists
for recently freed blocks but I want to avoid that for various
I can't see these doubling the speed for this silly random
test-case but I could be wrong.
I suppose I could also try other indexing structures as an
alternative to the balanced tree such as a skip-list (really, a
skip-index-array I guess). I've not used them before but I had a
break yesterday and today it seems like it shouldn't be too
Update: 3/2/2019 I tried the skip-index idea. And lo, once
I got it working the random test is now on par with GNU libc. I
guess I will go with that then. I suppose then all i've really
done is re-implement GNU libc malloc in the end; although the tiny
allocator might benefit some usages. And well - the
code is 1/5 the size and considerably more readable. And
i'll wrap it up in a way that can be used as a pool allocator.
I'm back to work this week so i'll have to give it a break for a
while anyway; i'm also exhausted and need to get away from it.
Spent some time in the garden today although I discovered the 47C
head might've killed more roses out the front. Sigh.
I haven't really looked into C profiling for years. Well apart
from /usr/bin/time or gettimeofday() (although I just found
I suppose people use oprofile? I should at least look into it.
gprof is pretty much useless. Well I guess it did identify the
tree manipulation as the major cost but looking at the code would
tell you that anyway. It just doesn't sample often enough (i'm
getting well under 1% sampling of running time).
Searches tend to turn up advice to use cachegrind, ... but I can't
see anything resembling cycle counts in any of it's output.
Ubuntu wont let me install kcachegrind because of some dependency
issue (probably because i've blacklisted some snot like
fucking-shit-audio) but I can't see how that can make up numbers
which aren't there already
So ... I started working on my own as well. Yeah like that's a
good idea! Ok, maybe it's really time to give it a break. It's
really not something I need and i'm not sure
-finstrument-functions is going to do a good enough job. I'll see
I did snarf some more code out of ezesdk that I've added to
libeze. I've got a very small ez-elf that lets you do basic but
useful things like lookup function names.
That music player thing
So this has been a pretty extended foray well beyond the code I
needed to implement a music player! I haven't touched that for
weeks - on the other hand I have just left it running for those
weeks and it's still going fine. The remote works well. I've got
some tools to send commands around and it all just works.
The last time i mentioned it I was wondering how to do a menu but
that night I came up with the obvious answer for a device with no
display. Just make it talk!
So I poked around with espeak-ng for a while and fiddled with the
player code so I can inject synthesised signal into the audio
I got as far as implementing a say-this-song function. When you
press the lips button it says the title and artist of the current
song if they're available or just the filename if they're not. As
I saw mention in a mailing list; it's quite a clear voice but
unintentially quite hilarious at times! It also speaks too fast
and doesn't pause anywhere nearly long enough for full stops but
at least the former can be adjusted.
I then hit a bit of a design snag. The keyboard monitor takes the
keys pressed and turns them into high level player commands. For
example the cursor keys are
fast-forward/rewind/next-track/previous-track. But during a menu
these would change behaviour and be for navigating the menu.
So I wasn't sure where I should put the menu driver logic. If it
was in the player then the keyboard module would just send through
raw keypresses for the other keys and the player would behave
differently depending on what's going on.
If it's in the keyboard module then it could route messages around
to potentially other components and the player would implement
higher level functions like 'enter menu mode', 'say this',
'preview that', 'rename playlist' and so on.
My initial ideas were a but muddled and the second case seemed a
bit messy. But now I think about it again it makes a lot more
sense and I can leave the player to playing music and it only
really needs to maintain a 'playing' or 'not playing' state and do
some low level stream pause management to make sure it all works
Regardless, I do have a goal here and once I add a basic menu that
would basically be a working player but i'm not sure when i'll get
back to it.
Fixed a couple of bugs in the blogz code to do with post urls, and
added newer/older links on the individual post page.
I think I can now setup robots.txt so it doesn't over-index.
Should be able to just block '/tag' and '/blog?' I think. So it
just indexes the front page and then anything /post/* it finds
from it. Any cralwer should be able to find the whole blog from
one post now they're linked.
Well i've put it in and i'll keep an eye on the logs to see what
I can't add the post title to the prev/next post links yet but
i've been working on a database based index which will allow it
and some other useful things. I could just hack another index
into the compiled-in indices but I think that path has outlived
Memory, IPC. Tradeoffs.
I've been playing around with a few ideas for libeze but it's not
gotten anywhere particular yet. I'm getting stuck on the
tradeoffs - even though they don't really matter much.
IPC - Serialisation
It started when I read a rant against serialsiation formats; this
I agree with. An internal blob isn't that important but I thought
i'd look at some to see if any where useful.
This is old and simple. It was written by a smart bunch from
a smart company from a time when smarter decisions were being
made about things like this. The biggest drawback is the
big-endian nature, but I could always fudge it and just pass
native-endian values and call it RDX or something.
I then had the bright idea of being able to decode structures
without any additional memory - merely referencing the packet
data. But this creates a couple of additional areas where it
would diverge: strings would need to include their trailing nul
byte to be useful, and embedded arrays might need specific
So really, it's not quite RFX anymore either. OTOH R is
listed as `supporting' XDR but it's version is modified as
XDR is also not self describing but descriminated
descriminated unions go a long way to handling versioning
which is the main reason I might want such a feature.
- ASN.1 BER/DER/CER/PER
There is actually some stuff I quite like about BER. It's
self describing and relatively compact.
There's a lot that makes it a pain. A full implementation
needs to implement a lot as it is vastly over-engineered; this
means it is complex and likely to be buggy. REAL numbers are
encoded in a way that needs a lot of fiddling about on either
It's not all wrong, but it's not too right either.
- Protocol buffers
I've had the misfortune to have to deal with these for work.
The actual on-wire format isn't too bad. It's not
self-describing but you can safely skip buts you don't
understand. It's relatively compact. It's also a bit
The SDK is a complete pile of shit though. The Java
implementation compiles gigantic code that runs fucking slow
(List<Float>???). It moves all of the protocol variance
decisions into the calling code so it's a messy pain to use as
well: sort of negating one of the reasons to use a
self-describing format in the first place.
For work I wrote my own decoder when I could - a couple of
dozen lines of code. But some projects use such complex
structures you end up with a gigantic library to decode it
(supplied by protoc) and then a gigantic library to marshall
it's output into something usable (that you write by hand).
It's just fucked. Even the same implementation can't decide
how to canonically encode empty or single-element arrays, null
or emptry strings, and so on.
It's not very compact and while it's realtively simple to
parse it's not trivial either.
And then it's clumsy as fuck to use from any non-dynamic
language. Getting fields you even know are there is a pain.
Handling variances is worse.
- CBOR (and half a dozen others)
Well this is just binary JSON. It fixes the end-to-end
canonical encodingness and is arguably faster to parse but
it's still going to be clumsy to use and not significnatly
more compact except for arrays of small integers.
It also has some questionable design decisions and is already
bloating with a dozen extension types - it seems like it wants
to look as 'valid' as ASN.1/BER but with an even messier
The whole "look we have a completely-defined 256 element jump
table" is just a fucking weird thing to have in a
serialisation format RFC. And nothing says simple like 256
right? What on earth is that even about?
This is another RFC I came across along the way - not sure how
as it just took me 10 minutes to re-find it.
Honestly it's a weird arsed format that looks like someone's
internal stuff they decided to publish. I can't imagine
anyone uses it.
I would note that with some small changes the meta-data I have can
support pretty much all of these formats.
But I pretty much wasted a few hours of my life on this. After all
that fucking around I would be inclined to just dump structures
around if i didn't need strings! I have several desirable
features which can't be covered by a single format anyway,
although two would probably suffice.
e.g. here's some use cases.
- Persistent storage.
- Fast decoding
- Objects are good enough to be used directly without having
- Complex and nested structures?
- Self describing?
- Local IPC
- Very fast encoding and decoding
- Possibility to decode without auxilliary memory
- Simple and flat structures
- Keep security in mind
- Internet IPC, External interfaces.
- Possibly to encode and decode in a streaming fashion
- Simple and unambiguous to help avoid writing-in security issues
- Self describing?
Ok, maybe 3 formats. Although i'm not really interested in the
3rd at the moment.
The Local IPC case is basically a subset of XDR, aside from the
The rest? Plenty there but none are a great fit.
Memory allocation in C is simple enough but managing it can be a
bit of a pain, hence pools.
Another issue is that a given implementation enforces certain
constraints which can waste (significant) memory: for example on
an amd64 platform, malloc() allocates at least 32-bytes for each
allocation even if it's only a 5 byte string (this is is the
minimum required to track free blocks). And every allocation also
saves a size_t before the memory block so it knows how big it was
when you call free: when you have many small objects of the same
size this wastes a lot of memory. A more subtle problem is that
memalign() must also save this size, so that if you start
allocating many aligned structures you start to quickly get a
Dealing with strings in general is a bit of a pain (and a source
of many security issues) and only the GNU-sepcific obstack really
provides any good support for it.
So I experimented with a few ideas.
High level internals of malloc() & free()
An implementation of this pair basically all work the same.
malloc will find a block of free memory big enough to hold the
requested size plus space to hold that size. Some area of this
block will be reserved for the allocation and the free block will
either be reduced by the same amount, or removed (if the whole
block is used).
free() will take the saved size and address and mark it as free.
As part of this process it will find out if there are other free
blocks immediately adjacent to this new free block and coalesce
them into a single larger block if so. There are only 4 cases:
isolated free, adjacent before, adjacent after, adjacent before
The memory in the free block itself is used to hold the
controlling structures required for the two functions. They cease
to exist once the memory is allocated.
Additionally, for finding blocks there are two basic strategies:
The free list is stored sorted by address. It is scanned from
the start and the first block with enough memory to fit the
request is used.
The free list is sorted by size. It is scanned from a block
which is larger-or-equal to the desired size and the first one
Each has their benefits and trade-offs.
first-fit may be slower to allocate but will more effectively
utilise memory - less total memory will be required. I did some
basic and non-exhaustive testing that gave me this result. It can
also be implemented with very simple code and the same list can be
used for performing coalescing.
best-fit can be faster to allocate as you find the correct block
directly. Based on my testing it requires more total memory.
Additional data structures are required to implement the free()
So finally, these choices any other indices arequired determine
the data structures required to track and find the free blocks,
and thus determine the minimum allocation size.
GNU libc malloc() uses a size_t for the block size, a
double-linked list for the free list, and another size_t at the
end of the free block for fast coalescing. So the minimum size of
a free block (and hence the minimum allocation size) is 32 bytes
on a 64-bit system 16 bytes on a 32-bit system. I think it uses
The first memory allocator I knew was the one from AmigaOS via
AllocMem(). This only has a single constraint in that the minimum
allocation is 8 bytes. There is no fixed allocation overhead to
store the allocated size as you need to also provide the size to
FreeMem(). Free nodes are stored as a single-linked list.
It takes very little code to implement this and it utilises memory
very well (it can't really be more utilised than 100%).
The drawback is that both allocation and deallocation are O(N)
operations since both have to walk the free list. I guess 'slow'
is relative as they did ok on a 7Mhz CPU with a global(!) shared
memory list. But because one knew it was slow you wrote code to
avoid it (SunOS also had a dreadfully slow allocator so it wasn't
unique in this regard).
On memory constrainted systems I think it's still a useful
technique. free() can also be accelerated somewhat by using a
balanced tree by bumping up the minimum allocation.
So the first idea I played with was a page allocator which would
underly more specific algorithms. The idea would be that it would
require the free size like FreeMem() and so allow pages to be
allocated with no overhead and no wastage.
I didn't really get anywhere with this and then revisited it in a
different way later on (ez-alloc-alloc).
I diverted for a little while trying to work on a string
allocator. It would take some number of pages as an allocation
block. In this case there would be no way to free individual
strings. But they wouldn't require any alignment or size overhead
Strings can be built a character at a time, and large strings are
moved over to their own block.
Allocation can be made very fast by only searching the most recent
block. Or it can be made more compact by searching all blocks.
Or some trade-off in-between.
I think this could be the basis of a good no-free high performance
On the way I came across apr_pool and it's code is pretty ugly and
I was suprised it didn't support free() of individual allocations,
given how complex it is.
I've written multiple memory allocators in the past and one of the
easiest and most useful is an allocator for fixed-sized blocks.
It was particularly important for glib container classes.
Free is trivial as you already know the size and never have to
coalesce free blocks and you can just plop the memory into the
head of a single-linked list.
Allocate is likewise as trivial as a single-linked list.
However you end up having to have multiple allocators for each
size so you just end up with a lot of wasted space anyway.
Eh I dunno, i'm thinking about whether this one is useful at all.
I had a look at an allocator that could support memalign() without
wastage. And basically do-away with a need for a special page
This is basically a tree-based version of AlocMem()/FreeMem() but
with a very large alignment. Because there is no size overhead
all memory is used and allocating new page-aligned pages doesn't
leave unfillable holes.
Performance problems with the algorithm would be mitigated by
having fewer, larger allocations and leaving the internal
allocations to other more specific algorithms.
Still I dunno, it's many drawbacks in this form but I like the
idea. Probably some sort of external index (radix-tree) would be
better and enforcing N*page sized allocations.
I tried to write an allocator that supported free but with the
minimum overhead. I got it down to a minimum allocation size of 8
bytes (void *) with no overhead.
Basically I wrote another AllocMem()/FreeMem() but this time much
closer to the original. To support the minimum allocation size of
8 bytes on a 64-bit platform the lsb of the link pointer is used
to indicate whether the size is 8 bytes (no room for size) or
larger (has size).
This still suffers from the same performance problems, although it
can be mitigated by being used only for short-term memory or with
The Story So Far. A Big Waste Of Fucking Time.
In the end i'm not really sure I like anything I came up with.
I'll probably change the existing ezelib serialiser, and likely
split the code up so that the description part is separate from
the (multiple) encoding part(s). I'll probably have something
XDR-like and another more compact one, maybe tagged. Each may not
support all possible structures.
The memory stuff was really just fucking around; I don't remotely
need any of this. But I will poke at it some more I guess.
Probably a pageish allocator which manages system memory without
allocation overhead (at point of allocation), then a couple of
allocators which sit atop of it and can be used as pools. A
general purpose one and a high-performance non-freeing one that
also supports incremental string/structure creation.
The general purpose pool might just end up being very similar to
malloc() but with the (rather useful) benefit of being able to be
freed en-mass. I can possibly get the minimum allocation size
down to 16 bytes although I will need to borrow some bits from
ez-tree's ez_node to achieve this. I can possibly also get
first-fit to work efficiently (order tree by size then address?).
I think i'm going to have to save the size in allocations though
(thus 8-byte allocation overhead): it's going to be needed a lot
of the time and I can't do fast coalescing without it. If I have
a particualrly fast page allocator i might be able to do something
neat for small fixed-sized allocs and that would also cover the
interesting memalign cases.
Ultumately they would also need valgrind instrumentation as well.
Probably back to work next week, should get away from the screen.
Wheather's nice and warm but not too hot. Finally getting a few
tomatoes and the sweet-corn is nearly there.
blogz - a personal blog engine
Time to announce the release of another new project
Actually it's an old project, it is the software which is running
the blog portion of this site. I mentioned months ago that I
would get around to releasing it at some point and this is that
Go read the homepage for a quick summary. As usual the source has
a decent README.
Well that was shit
It ended up hitting 46.6 - yep the hottest day on record - just as
I left to hit the town. The town was dead. I spent most of the
time drinking alone outside on the street although I used splashes
of water to keep me cooler than I would have been otherwise.
Even the cockroaches were succumbing to the heat - well at least
one expired while I was sitting there! So much for the nuclear
As a health exercise it was a complete failure and probably worse
than just staying home.
At least it's supposed to cool down tomorrow for a few days
although at 23:00 it's still about 35 so it's going to be a
Hell on Earth
Welcome to Adelaide, South Australia.
So the mercury just hit 46.6C, which would make it hottest day on
record if that sticks as an official reading. Thats a prelimary
number from the BOM and i'm not sure which site they use as the
official Adelaide number (it was 47.7 in Kent Town). Not even 4pm
yet so it might get higher although some clouds are rolling in and
there's some `breeze' (more like a fan-forced oven).
Probably not the best day for it but I have to get out of the
house for health reasons so time to go see what a burning city
looks like first hand. Going to be a warm ride in.
libeze - C utility library
This is announcing a new small project
It is a small collection of C utilities as a static link library.
- Aim for best implementation of an algorithm;
- Clean readable source;
- Small source size;
- Small code size;
- Consistent api and features where it makes sense;
- Low level and highly re-usable building blocks where it makes sense.
Rather than copy the rest of it's current home page just go there
and read it. There is much more in the README and source.
It's an offshoot (read: I got very distracted) of the mele music
player i've been working on as well as some previous code
Copyright (C) 2018 Michael Zucchi, All Rights Reserved.Powered by gcc & me!