GC JNI, HashTables, Memory
I had a very busy week with work working on porting libraries and applications to Java modules - that wasn't really the busy part, I also looked into making various implementation's pluggable using services and then creating various pluggable implementations, often utilising native code. Just having some (much faster) implementation of parts also opened other opportunities and it sort of cascaded from there.
Anyway along the way I revisited my implementation of Garbage Collection with JNI and started working on a modular version that can be shared between libraries without having to copy core object, and then along the way found bugs and things to improve.
Here are some of the more interesting pieces I found along the way.
JNI call overheads
The way i'm writing jni these days is typically just write the method signature as if it were a Java method and just mark it native. Let the jni handle Java to C mappings direclty. This is different to how I first started doing it and flies in the convention i've typically seen amongst JNI implementations where the Java just passes the pointers as a long and has a wrapper function which resolves these longs as appropriate.
The primary reason is to reduce boilerplate and signficiantly simplify the Java class writing without having a major impact on performance. I have done some performance testing before but I re-ran some tests and they confirm the design decisions used in zcl for example.
First, I tested some mechanisms for accessing arrays. I passed two arrays to a native function and had it perform various tests:
- No op;
- GetPrimitiveArrayCritical on both arrays;
- GetArrayElements for read-only arrays (call Release(ABORT))
- GetArrayElements for read-only on one array and read-write on the other (call Release(Abort, Commit));
- GetArrayRegion for read-only, to memory allocated using alloca
- GetArrayRegion and SetArrayRegion for one array, to memory using alloca
- GetArrayRegion for read-only, to memory allocated using malloc
- GetArrayRegion and SetArrayRegion for one array, to memory using malloc
I then ran these tests for different sized float arrays, for 1 000 000 iterations, and the results in seconds are below. It's some intel laptop.
NOOP Critical Elements Region/alloca Region/malloc 0 1 2 3 4 5 6 7 1 0.014585537 0.116005779 0.199563981 0.207630731 0.104293268 0.127865782 0.185149189 0.217530639 2 0.013524620 0.118654092 0.201340322 0.209417471 0.104695330 0.129843794 0.193392346 0.216096210 4 0.012828157 0.113974453 0.206195102 0.214937432 0.107255090 0.127068808 0.190165219 0.215024016 8 0.013321001 0.116550424 0.209304277 0.205794572 0.102955338 0.130785133 0.192472825 0.217064583 16 0.013228272 0.116148320 0.207285227 0.211022409 0.106344162 0.139751496 0.196179709 0.222189471 32 0.012778452 0.119130446 0.229446026 0.239275912 0.111609011 0.140076428 0.213169077 0.252453033 64 0.012838540 0.115225274 0.250278658 0.259230054 0.124799171 0.161163577 0.230502836 0.260111468 128 0.014115022 0.120103332 0.264680542 0.282062633 0.139830967 0.182051151 0.250609001 0.297405818 256 0.013412645 0.114502078 0.315914219 0.344503396 0.180337154 0.241485525 0.297850562 0.366212494 512 0.012669807 0.117750316 0.383725378 0.468324904 0.261062826 0.358558946 0.366857041 0.466997977 1024 0.013393850 0.120466096 0.550091063 0.707360155 0.413604094 0.576254053 0.518436072 0.711689270 2048 0.013493996 0.118718871 0.990865614 1.292385065 0.830819392 1.147347700 0.973258653 1.284913436 4096 0.012639675 0.116153318 1.808592969 2.558903773 1.628814486 2.400586604 1.778098089 2.514406096
Some points of note:
- Raw method invocation is around 14 nanoseconds, pretty much irrelevant once you do any work.
- Get/SetArrayElements is pretty much the same as using GetSet/ArrayRegion with malloc but with less flexibility.
- For small arrays 2 calls to malloc/free is nearly 50% of the processing time. Given the gay abandon with which most C programmers throw these around like they cost nothing, the extra JNI overhead is modest.
- For larger arrays memcpy time dominates.
- For one way transfers shorter than 64 float using Get/SetRegion to the stack or pre-allocated memory is the fastest.
- For all other cases including any-sized two-way transfers, GetPrimitiveArrayCritical is the fastest. But it has other overheads and isn't always applicable.
I didn't look at ByteBuffer because it doesn't really fit what i'm doing with these functions.
Anyway - the overheads are unavoidable with JNI but are quite modest. The function in question does nothing with the data and so any meaningful operation will quickly dominate the processing time.
Object Pointer resolution
The next test I did was to compare various mechanisms for transferring the native C pointer from Java to C.
I created a Native object with two long fields, native final long p, and native long q.
- No operation;
- C invokes getP() method which returns p;
- C invokes getQ() method which returns q;
- C access to .p field;
- C access to .q field;
- The native signature takes a pointer directly, call it resolving the .p field in the caller;
- The native signature takes a pointer directly, call it resolving the .p field via a wrapper function.
Again invoking it 1 000 000 times.
NOOP getP() getQ() (C).p (C).q (J).p J wrapper 0 1 2 3 4 5 6 0.016606942 0.293797182 0.294253973 0.020146810 0.020154508 0.015827028 0.016979563
- final makes no difference.
- method invocation is 15x slower than a field lookup!
- Field lookups are much slower in C than Java, but the absolute cost is insignificant at ~2.5nS per lookup.
In short, just passing Java objects directly and having the C resolve the pointer via a field lookup is slightly slower but requires much less boilerplate and so is the preferred solution.
After I sorted out the basic JNI mechanisms I started looking at the reference tracking implementation (i'll call this NativeZ from here on).
For debugging and trying to be a more re-usable library I had added logging to various places in the C code using Logger.getLogger(tag).fine(String.printf());
It turns out this was really not a wise idea and the logging calls alone were taking approximately 50% of the total execution time - versus java to C to java, hashtable lookups and synchronisation blocks.
Simply changing to use the Supplier versions of the logging functions approximately doubled the performance.
Logger.getLogger(tag).fine(String.printf()); -> Logger.getLogger(tag).fine(() -> String.printf());
But I also decided to just make including any of the code optional by bracketing each call to a test against a final static boolean compile-time constant.
This checking indirectly confirmed that the reflection invocations aren't particualrly onerous assuming the're doing any work.
Now the other major component of the NativeZ object tracking is using a hash-table to map C pointers to Java objects. This serves two important purposes:
- Allows the Java to resolve separate C pointers to the same object;
- Maintains a hard reference to the WeakReference, without which they just don't work.
For simplicity I just used a HashMap for this purpose. I knew it wasn't ideal but I did the work to quantify it.
Using jol and perusing the source I got some numbers for a jvm using compressed oops and an 8-byte object alignment.
|HashMap.Node||32||Used for short hash chains.|
|HashMap.TreeNode||56||Used for long hash chains.|
|Long||24||The node key|
|CReference||48||The node value. Subclass of WeakReference|
Thus the incremental overhead for a single C object is either 104 bytes when a linear hashchain is used, and 128 bytes when a tree is used.
Actually its a bit more than that because the hashtable (by default) uses a 75% load factor so also allocates 1.5 pointers for each object but that's neither here nor there and also a feature of the algorithm regardless of implementation.
But there are other bigger problems, the Long.hashCode() method just mixes the low and high words together using xor. If all C pointers are 8 (or worse, 16) byte aligned you essentially only get every 8 (or 16) buckets ever in use. So apart from the wasted buckets the HashMap is very likely to end up using Trees to store each chain.
So I wrote another hashtable implementation which addresses this by using the primitive long stored in the CReference directly as the key, and using the CReference itself as the bucket nodes. I also used a much better hash function. This reduced the memory overhead to just the 48 bytes for the CReference plus a (tunable) overhead for the root table - anywhere from 1/4 to 1 entry per node works quite well with the improved hash function.
This uses less memory and runs a bit faster - mostly because the gc is run less often.
So i'm still working on wrapping this all up in a module notzed.nativez which will include the Java base class and a shared library for other JNI libraries to link to which includes the (trivial) interface to the NativeZ object and some helpers to help write small and robust JNI libraries.
And then of course eventually port jjmpeg and zcl to use it.