About

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

Tags

android (44)
beagle (63)
biographical (82)
business (1)
code (56)
cooking (29)
dez (6)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (414)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (216)
java ee (3)
javafx (48)
jjmpeg (67)
junk (3)
kobo (15)
linux (3)
mediaz (27)
ml (15)
nativez (3)
opencl (117)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
politics (7)
ps3 (12)
puppybits (17)
rants (134)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (2)
zedzone (13)
Tuesday, 09 October 2018, 10:07

Other JNI bits/ NativeZ, jjmpeg.

Yesterday I spent a good deal of time continuing to experiment and tune NativeZ. I also ported the latest version of jjmpeg to a modularised build and to use NativeZ objects.

Hashing C Pointers

C pointers obtained by malloc are aligned to 16-byte boundaries on 64-bit GNU systems. Thus the lower 4 bits are always zero. Standard malloc also allocates a contiguous virtual address range which is extended using sbrk(2) which means the upper bits rarely change. Thus it is sufficient to generate a hashcode which only takes into account the lower bits (excluding the first 4).

I did some experimenting with hashing the C pointer values using various algorithms, from Knuth's Magic Number to various integer hashing algorithms (e.g. hash-prospector), to Long.hashCode(), to a simple shift (both 64-bit and 32-bit). The performance analysis was based on Chi-squared distance between the hash chain lengths and the ideal, using pointers generated from malloc(N) for different fixed values of N for multiple runs.

Although it wasn't the best statistically, the best performing algorithm was a simple 32-bit, 4 bit shift due to it's significantly lower cost. And typically it compared quite well statically regardless.

static int hashCode(long p) {
    return (int)p >>> 4;
}

In the nonsensical event that 28 bits are not sufficient the hash bucket index it can be extended to 32-bits:

static int hashCode(long p) {
    return (int)(p >>>> 4);
}

And despite all the JNI and reflection overheads, using the two-round function from the hash-prospector project increased raw execution time by approximately 30% over the trivial hashCode() above.

Whilst it might not be ideal for 8-bit aligned allocations it's probably not that bad either in practice. One thing I can say for certain though is NEVER use Long.hashCode() to hash C pointers!

Concurrency

I also tuned the use of synchronisation blocks very slightly to make critical sections as short as possible whilst maintaining correct behaviour. This made enough of a difference to be worth it.

I also tried more complex synchronisation mechanisms - read-write locks, hash bucket row-locks and so on, but it was at best a bit slower than using synchronize{}.

The benchmark I was using wasn't particularly fantastic - just one thread creating 10^7 `garbage' objects in a tight loop whilst the cleaner thread freed them. No resolution of exisitng objects, no multiple threads, and so on. But apart from the allocation rate it isn't an entirely unrealistic scenario either and i was just trying to identify raw overheads.

Reflection

I've only started looking at the reflection used for allocating and releaseing objects on the Java side, and in isolation these are the highest costs of the implementation.

There are ways to reduce these costs but at the expense of extra boilerplate (for instantiation) or memory requirements (for release).

Still ongoing. And whilst the relative cost over C is very high, the absolute cost is still only a few hundred nanoseconds per object.

From a few small tests it looks like that maximum i could achieve is a 30% reduction in object instantiation/finalisation costs, but I don't think it's worth the effort or overheads.

Makefile foo

I'm still experiemnting with this, I used some macros and implicit rules to get most things building ok, but i'm not sure if it couldn't be better. The basic makefile is working ok for multi-module stuff so I think i'm getting there. Most of the work is just done by the jdk tools as they handle modules and so on quite well and mostly dicatate the disk layout.

I've broken jjmpeg into 3 modules - the core, the javafx related classes and the awt related classes.

Tagged java, jjmpeg, nativez.
Sunday, 07 October 2018, 08:51

GC JNI, HashTables, Memory

I had a very busy week with work working on porting libraries and applications to Java modules - that wasn't really the busy part, I also looked into making various implementation's pluggable using services and then creating various pluggable implementations, often utilising native code. Just having some (much faster) implementation of parts also opened other opportunities and it sort of cascaded from there.

Anyway along the way I revisited my implementation of Garbage Collection with JNI and started working on a modular version that can be shared between libraries without having to copy core object, and then along the way found bugs and things to improve.

Here are some of the more interesting pieces I found along the way.

JNI call overheads

The way i'm writing jni these days is typically just write the method signature as if it were a Java method and just mark it native. Let the jni handle Java to C mappings direclty. This is different to how I first started doing it and flies in the convention i've typically seen amongst JNI implementations where the Java just passes the pointers as a long and has a wrapper function which resolves these longs as appropriate.

The primary reason is to reduce boilerplate and signficiantly simplify the Java class writing without having a major impact on performance. I have done some performance testing before but I re-ran some tests and they confirm the design decisions used in zcl for example.

Array Access

First, I tested some mechanisms for accessing arrays. I passed two arrays to a native function and had it perform various tests:

  1. No op;
  2. GetPrimitiveArrayCritical on both arrays;
  3. GetArrayElements for read-only arrays (call Release(ABORT))
  4. GetArrayElements for read-only on one array and read-write on the other (call Release(Abort, Commit));
  5. GetArrayRegion for read-only, to memory allocated using alloca
  6. GetArrayRegion and SetArrayRegion for one array, to memory using alloca
  7. GetArrayRegion for read-only, to memory allocated using malloc
  8. GetArrayRegion and SetArrayRegion for one array, to memory using malloc

I then ran these tests for different sized float[] arrays, for 1 000 000 iterations, and the results in seconds are below. It's some intel laptop.

       NOOP         Critical     Elements                  Region/alloca             Region/malloc
            0            1            2            3            4            5            6            7
    1  0.014585537  0.116005779  0.199563981  0.207630731  0.104293268  0.127865782  0.185149189  0.217530639
    2  0.013524620  0.118654092  0.201340322  0.209417471  0.104695330  0.129843794  0.193392346  0.216096210
    4  0.012828157  0.113974453  0.206195102  0.214937432  0.107255090  0.127068808  0.190165219  0.215024016
    8  0.013321001  0.116550424  0.209304277  0.205794572  0.102955338  0.130785133  0.192472825  0.217064583
   16  0.013228272  0.116148320  0.207285227  0.211022409  0.106344162  0.139751496  0.196179709  0.222189471
   32  0.012778452  0.119130446  0.229446026  0.239275912  0.111609011  0.140076428  0.213169077  0.252453033
   64  0.012838540  0.115225274  0.250278658  0.259230054  0.124799171  0.161163577  0.230502836  0.260111468
  128  0.014115022  0.120103332  0.264680542  0.282062633  0.139830967  0.182051151  0.250609001  0.297405818
  256  0.013412645  0.114502078  0.315914219  0.344503396  0.180337154  0.241485525  0.297850562  0.366212494
  512  0.012669807  0.117750316  0.383725378  0.468324904  0.261062826  0.358558946  0.366857041  0.466997977
 1024  0.013393850  0.120466096  0.550091063  0.707360155  0.413604094  0.576254053  0.518436072  0.711689270
 2048  0.013493996  0.118718871  0.990865614  1.292385065  0.830819392  1.147347700  0.973258653  1.284913436
 4096  0.012639675  0.116153318  1.808592969  2.558903773  1.628814486  2.400586604  1.778098089  2.514406096

Some points of note:

I didn't look at ByteBuffer because it doesn't really fit what i'm doing with these functions.

Anyway - the overheads are unavoidable with JNI but are quite modest. The function in question does nothing with the data and so any meaningful operation will quickly dominate the processing time.

Object Pointer resolution

The next test I did was to compare various mechanisms for transferring the native C pointer from Java to C.

I created a Native object with two long fields, native final long p, and native long q.

  1. No operation;
  2. C invokes getP() method which returns p;
  3. C invokes getQ() method which returns q;
  4. C access to .p field;
  5. C access to .q field;
  6. The native signature takes a pointer directly, call it resolving the .p field in the caller;
  7. The native signature takes a pointer directly, call it resolving the .p field via a wrapper function.

Again invoking it 1 000 000 times.

NOOP         getP()       getQ()       (C).p        (C).q        (J).p        J wrapper
     0            1            2            3            4            5            6
0.016606942  0.293797182  0.294253973  0.020146810  0.020154508  0.015827028  0.016979563

In short, just passing Java objects directly and having the C resolve the pointer via a field lookup is slightly slower but requires much less boilerplate and so is the preferred solution.

Logger

After I sorted out the basic JNI mechanisms I started looking at the reference tracking implementation (i'll call this NativeZ from here on).

For debugging and trying to be a more re-usable library I had added logging to various places in the C code using Logger.getLogger(tag).fine(String.printf());

It turns out this was really not a wise idea and the logging calls alone were taking approximately 50% of the total execution time - versus java to C to java, hashtable lookups and synchronisation blocks.

Simply changing to use the Supplier versions of the logging functions approximately doubled the performance.

  Logger.getLogger(tag).fine(String.printf());
->
  Logger.getLogger(tag).fine(() -> String.printf());
    

But I also decided to just make including any of the code optional by bracketing each call to a test against a final static boolean compile-time constant.

This checking indirectly confirmed that the reflection invocations aren't particualrly onerous assuming the're doing any work.

HashMap<Long,WeakReference>

Now the other major component of the NativeZ object tracking is using a hash-table to map C pointers to Java objects. This serves two important purposes:

  1. Allows the Java to resolve separate C pointers to the same object;
  2. Maintains a hard reference to the WeakReference, without which they just don't work.

For simplicity I just used a HashMap for this purpose. I knew it wasn't ideal but I did the work to quantify it.

Using jol and perusing the source I got some numbers for a jvm using compressed oops and an 8-byte object alignment.

ObjectSize
HashMap.Node32Used for short hash chains.
HashMap.TreeNode56Used for long hash chains.
Long24The node key
CReference48The node value. Subclass of WeakReference

Thus the incremental overhead for a single C object is either 104 bytes when a linear hashchain is used, and 128 bytes when a tree is used.

Actually its a bit more than that because the hashtable (by default) uses a 75% load factor so also allocates 1.5 pointers for each object but that's neither here nor there and also a feature of the algorithm regardless of implementation.

But there are other bigger problems, the Long.hashCode() method just mixes the low and high words together using xor. If all C pointers are 8 (or worse, 16) byte aligned you essentially only get every 8 (or 16) buckets ever in use. So apart from the wasted buckets the HashMap is very likely to end up using Trees to store each chain.

So I wrote another hashtable implementation which addresses this by using the primitive long stored in the CReference directly as the key, and using the CReference itself as the bucket nodes. I also used a much better hash function. This reduced the memory overhead to just the 48 bytes for the CReference plus a (tunable) overhead for the root table - anywhere from 1/4 to 1 entry per node works quite well with the improved hash function.

This uses less memory and runs a bit faster - mostly because the gc is run less often.

notzed.nativez

So i'm still working on wrapping this all up in a module notzed.nativez which will include the Java base class and a shared library for other JNI libraries to link to which includes the (trivial) interface to the NativeZ object and some helpers to help write small and robust JNI libraries.

And then of course eventually port jjmpeg and zcl to use it.

Tagged java, nativez.
Newer Posts
Copyright (C) 2018 Michael Zucchi, All Rights Reserved.Powered by gcc & me!