About

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

Tags

android (44)
beagle (63)
biographical (82)
business (1)
code (56)
cooking (29)
dez (6)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (414)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (216)
java ee (3)
javafx (48)
jjmpeg (67)
junk (3)
kobo (15)
linux (3)
mediaz (27)
ml (15)
nativez (3)
opencl (117)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
politics (7)
ps3 (12)
puppybits (17)
rants (134)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (2)
zedzone (13)
Thursday, 11 October 2018, 20:29

Concurrent Hash Tables

So I got curious about whether the GC in NativeZ would cause any bottlenecks in highly contested situations - one I already faced with an earlier iteration of the library. The specific case I had was running out of memory when many threads were creating short-lived native objects; the single thread consuming values from the ReferenceQueue wasn't able to keep up.

The out of memory situation was fairly easily addressed by just running a ReferenceQueue poll whenever a new object is created, but I was still curious about the locking overheads.

A field of tables

I made a few variations of hash tables which supported the interface I desired which was to store a value which also provides the key itself, as a primitive long. So its more of a set but with the ability to remove items by the key directly. For now i'll just summarise them.

SynchronisedHash

This subclasses HashMap and adds the desired interfaces, each of which is synchronised to the object.

Rather than use the pointer directly as the key it is shifted by 4 (all malloc are 16-byte aligned here) to avoid Long.hasCode() pitfalls on pointers.

ConcurrentHash

This subclasses ConcurrentHashMap and adds the desired interfaces, but no locking is required.

The same shift trick is used for the key as above.

PointerTable

This was the use-weakreference-as-node implementation mentioned in the last post - a simple single-threaded chained hash table implementation. I just synchronised all the entry points.

A load factor of 2.0 is used for further memory savings.

CopyHash

This is a novel(?) approach in which all modifications to the bucket chains is implemented by copying the whole linked list of nodes and replacing the table entry with the new list using compare and set (CAS). A couple of special cases can avoid any copies.

In the event of a failure it means someone else updated it so it simply retries. Some logic akin to a simplified version of what ConcurrentHashMap does is used to resize the table when needed, potentially concurrently.

It uses a load factor of 1.0.

I also tested a fast-path insert which doesn't try to find an existing value (this occurs frequently with the design) but that so overloaded the remove() mechanism it wasn't actually a good fit!

ArrayHash

This is similar to CopyHash but instead of a linked list of container nodes it either stores the value directly in the table, or it stores an array of objects. Again any modifications require rewriting the table entry. Again resize is handled specially and all threads can contribute.

This also has a small modification in that the retry loop includes a ficonacci-increasing Thread.sleep(0,nanos) delay if it fails which may slow down wall-clock time but can improve the cpu load.

I had some ideas for other approaches but i've had enough for now.

Tests

I created a native class which allocates and frees memory of a given size, and a simple 'work' method which just adds up the octets within. I ran two tests, one which just allocated 16 bytes whic immediately went out of scope, and another which allocated 1024 bytes, added them up (a small cpu-bound task), then let it fall out of scope.

Note that lookups are never tested in this scenario apart from the implicit lookup during get/put. I did implement get() - which are always non-blocking in the case of the two latter implementations (even during a resize), but I didn't test or use them here.

I then created 8 000 000 objects in a tight loop in one or more threads. Once the threads finish I invoke System.gc(), then wait for all objects to be freed. The machine I'm running it on has 4 cores/8 threads so I tried it with 1 and with 8 threads.

Phew, ok that's a bit of a mouthful that probably doesn't make sense, bit tired. The numbers below are pretty rough and are from a single run using openjdk 11+28, with -Xmx1G.

                   8 threads, alloc  8 threads, alloc+sum        1 thread, alloc     1 thread, alloc+sum
                     x 1 000 000           x 1 000 000            x 8 000 000             x 8 000 000
                  Elapsed     User      Elapsed     User       Elapsed     User        Elapsed     User
		   
 SynchronisedHash    8.3       38           11       54           8.4       27             16       40
 ConcurrentHash      6.9       35           11       52           8.2       27             17       38

 PointerTable        7.2       26           13       44           7.9       17             19       29
 CopyHash            6.6       31            8.2     42           8.1       24             18       33
 ArrayHash           6.0       28            8.2     39           8.5       23             16       24
 ArrayHash*          6.9       23           12       30           8.1       21             17       23

		    * indicates using a delay in the retry loop for the remove() call
  

To be honest I think the only real conclusion is that this machine doesn't have enough threads for this task to cause a bottleneck in the hash table! Even in the most contested case (alloc only) simple synchronised methods are just about as good as anything else. And whilst this is represenative of a real life scenario, it's just a bloody pain to properly test high concurrency code and i'm just not that into it for a hobby (the journey not the destination and so forth).

I haven't shown the numbers above but in the case of the Copy and Array implementations I count how many retries are required for both put and get calls. In the 8-thread cases where there is no explicit delay it can be in the order of a million times! And yet they still run faster and use less memory. Shrug.

All of the non java.util based implementations also benefit in both time and space from using the primitive key directly without boxing and storing the key as part of the containee object and not having to support the full Collections and Streams interfaces. PointerTable also benefits from fewer gc passes due to not needing any container nodes and having a higher load factor.

BTW one might note that this is pretty bloody slow compared to C or pure Java - there are definitely undeniably high overheads of the JNI back and forths.

The other thing of note is that well hashtables are hashtables - they're all pretty good at what they're doing here because of the algorithm that they share. There's not really all that much practical difference between any of them.

But why?

I'm not sure what posessed me to look into it this deeply but I've done it now. Maybe i'll post the code a bit later, it may have enough bugs to invalidate all the results but it was still a learning experience.

For the lockless algorithms I made use of VarHandles (Java 9's 'safe' interface to Unsafe) to do the CAS and other operations, and some basic monitor locking for coordinating a resize pass.

The idea of 'read, do work, fail and retry' is something I originally learnt about using a hardware feature of the CELL SPU (and Power based PPU). On that you can reserve a memory location (128 bytes on the SPU, 4 or 8 on the PPU), and if the ticket is lost by the time you write back to it the write fails and you know it failed and can retry. So rather than [spin-lock-doing-nothing WORK unlock], you spin on [reserve WORK write-or-retry]. I guess it's a speculative reservation. It's not quite as clean when using READ work CAS (particularly without the 128 byte blocks on the SPU!) but the net result is (mostly) the same. One significant difference is that you actually have to write a different value back and in this instance being able to merely indicate change could have been useful.

ConcurrentHashMap does something similar but only for the first insert into an empty hash chain, after that it locks the entry and only ever appends to the chain.

Some of the trickiest code was getting the resize mechanism to synchronise across threads, but I really only threw it together without much thought using object mointors and AtomicInteger. Occasionally i'm gettting hangs but they don't appear to make sense: some number of threads will be blocked waiting to enter a synchronised method, while a few others are already on a wait() inside it, all the while another thread is calling it at will - without blocking. If I get keen i'll revisit this part of the code.

Tagged java, nativez.
Tuesday, 09 October 2018, 10:07

Other JNI bits/ NativeZ, jjmpeg.

Yesterday I spent a good deal of time continuing to experiment and tune NativeZ. I also ported the latest version of jjmpeg to a modularised build and to use NativeZ objects.

Hashing C Pointers

C pointers obtained by malloc are aligned to 16-byte boundaries on 64-bit GNU systems. Thus the lower 4 bits are always zero. Standard malloc also allocates a contiguous virtual address range which is extended using sbrk(2) which means the upper bits rarely change. Thus it is sufficient to generate a hashcode which only takes into account the lower bits (excluding the first 4).

I did some experimenting with hashing the C pointer values using various algorithms, from Knuth's Magic Number to various integer hashing algorithms (e.g. hash-prospector), to Long.hashCode(), to a simple shift (both 64-bit and 32-bit). The performance analysis was based on Chi-squared distance between the hash chain lengths and the ideal, using pointers generated from malloc(N) for different fixed values of N for multiple runs.

Although it wasn't the best statistically, the best performing algorithm was a simple 32-bit, 4 bit shift due to it's significantly lower cost. And typically it compared quite well statically regardless.

static int hashCode(long p) {
    return (int)p >>> 4;
}

In the nonsensical event that 28 bits are not sufficient the hash bucket index it can be extended to 32-bits:

static int hashCode(long p) {
    return (int)(p >>>> 4);
}

And despite all the JNI and reflection overheads, using the two-round function from the hash-prospector project increased raw execution time by approximately 30% over the trivial hashCode() above.

Whilst it might not be ideal for 8-bit aligned allocations it's probably not that bad either in practice. One thing I can say for certain though is NEVER use Long.hashCode() to hash C pointers!

Concurrency

I also tuned the use of synchronisation blocks very slightly to make critical sections as short as possible whilst maintaining correct behaviour. This made enough of a difference to be worth it.

I also tried more complex synchronisation mechanisms - read-write locks, hash bucket row-locks and so on, but it was at best a bit slower than using synchronize{}.

The benchmark I was using wasn't particularly fantastic - just one thread creating 10^7 `garbage' objects in a tight loop whilst the cleaner thread freed them. No resolution of exisitng objects, no multiple threads, and so on. But apart from the allocation rate it isn't an entirely unrealistic scenario either and i was just trying to identify raw overheads.

Reflection

I've only started looking at the reflection used for allocating and releaseing objects on the Java side, and in isolation these are the highest costs of the implementation.

There are ways to reduce these costs but at the expense of extra boilerplate (for instantiation) or memory requirements (for release).

Still ongoing. And whilst the relative cost over C is very high, the absolute cost is still only a few hundred nanoseconds per object.

From a few small tests it looks like that maximum i could achieve is a 30% reduction in object instantiation/finalisation costs, but I don't think it's worth the effort or overheads.

Makefile foo

I'm still experiemnting with this, I used some macros and implicit rules to get most things building ok, but i'm not sure if it couldn't be better. The basic makefile is working ok for multi-module stuff so I think i'm getting there. Most of the work is just done by the jdk tools as they handle modules and so on quite well and mostly dicatate the disk layout.

I've broken jjmpeg into 3 modules - the core, the javafx related classes and the awt related classes.

Tagged java, jjmpeg, nativez.
Sunday, 07 October 2018, 08:51

GC JNI, HashTables, Memory

I had a very busy week with work working on porting libraries and applications to Java modules - that wasn't really the busy part, I also looked into making various implementation's pluggable using services and then creating various pluggable implementations, often utilising native code. Just having some (much faster) implementation of parts also opened other opportunities and it sort of cascaded from there.

Anyway along the way I revisited my implementation of Garbage Collection with JNI and started working on a modular version that can be shared between libraries without having to copy core object, and then along the way found bugs and things to improve.

Here are some of the more interesting pieces I found along the way.

JNI call overheads

The way i'm writing jni these days is typically just write the method signature as if it were a Java method and just mark it native. Let the jni handle Java to C mappings direclty. This is different to how I first started doing it and flies in the convention i've typically seen amongst JNI implementations where the Java just passes the pointers as a long and has a wrapper function which resolves these longs as appropriate.

The primary reason is to reduce boilerplate and signficiantly simplify the Java class writing without having a major impact on performance. I have done some performance testing before but I re-ran some tests and they confirm the design decisions used in zcl for example.

Array Access

First, I tested some mechanisms for accessing arrays. I passed two arrays to a native function and had it perform various tests:

  1. No op;
  2. GetPrimitiveArrayCritical on both arrays;
  3. GetArrayElements for read-only arrays (call Release(ABORT))
  4. GetArrayElements for read-only on one array and read-write on the other (call Release(Abort, Commit));
  5. GetArrayRegion for read-only, to memory allocated using alloca
  6. GetArrayRegion and SetArrayRegion for one array, to memory using alloca
  7. GetArrayRegion for read-only, to memory allocated using malloc
  8. GetArrayRegion and SetArrayRegion for one array, to memory using malloc

I then ran these tests for different sized float[] arrays, for 1 000 000 iterations, and the results in seconds are below. It's some intel laptop.

       NOOP         Critical     Elements                  Region/alloca             Region/malloc
            0            1            2            3            4            5            6            7
    1  0.014585537  0.116005779  0.199563981  0.207630731  0.104293268  0.127865782  0.185149189  0.217530639
    2  0.013524620  0.118654092  0.201340322  0.209417471  0.104695330  0.129843794  0.193392346  0.216096210
    4  0.012828157  0.113974453  0.206195102  0.214937432  0.107255090  0.127068808  0.190165219  0.215024016
    8  0.013321001  0.116550424  0.209304277  0.205794572  0.102955338  0.130785133  0.192472825  0.217064583
   16  0.013228272  0.116148320  0.207285227  0.211022409  0.106344162  0.139751496  0.196179709  0.222189471
   32  0.012778452  0.119130446  0.229446026  0.239275912  0.111609011  0.140076428  0.213169077  0.252453033
   64  0.012838540  0.115225274  0.250278658  0.259230054  0.124799171  0.161163577  0.230502836  0.260111468
  128  0.014115022  0.120103332  0.264680542  0.282062633  0.139830967  0.182051151  0.250609001  0.297405818
  256  0.013412645  0.114502078  0.315914219  0.344503396  0.180337154  0.241485525  0.297850562  0.366212494
  512  0.012669807  0.117750316  0.383725378  0.468324904  0.261062826  0.358558946  0.366857041  0.466997977
 1024  0.013393850  0.120466096  0.550091063  0.707360155  0.413604094  0.576254053  0.518436072  0.711689270
 2048  0.013493996  0.118718871  0.990865614  1.292385065  0.830819392  1.147347700  0.973258653  1.284913436
 4096  0.012639675  0.116153318  1.808592969  2.558903773  1.628814486  2.400586604  1.778098089  2.514406096

Some points of note:

I didn't look at ByteBuffer because it doesn't really fit what i'm doing with these functions.

Anyway - the overheads are unavoidable with JNI but are quite modest. The function in question does nothing with the data and so any meaningful operation will quickly dominate the processing time.

Object Pointer resolution

The next test I did was to compare various mechanisms for transferring the native C pointer from Java to C.

I created a Native object with two long fields, native final long p, and native long q.

  1. No operation;
  2. C invokes getP() method which returns p;
  3. C invokes getQ() method which returns q;
  4. C access to .p field;
  5. C access to .q field;
  6. The native signature takes a pointer directly, call it resolving the .p field in the caller;
  7. The native signature takes a pointer directly, call it resolving the .p field via a wrapper function.

Again invoking it 1 000 000 times.

NOOP         getP()       getQ()       (C).p        (C).q        (J).p        J wrapper
     0            1            2            3            4            5            6
0.016606942  0.293797182  0.294253973  0.020146810  0.020154508  0.015827028  0.016979563

In short, just passing Java objects directly and having the C resolve the pointer via a field lookup is slightly slower but requires much less boilerplate and so is the preferred solution.

Logger

After I sorted out the basic JNI mechanisms I started looking at the reference tracking implementation (i'll call this NativeZ from here on).

For debugging and trying to be a more re-usable library I had added logging to various places in the C code using Logger.getLogger(tag).fine(String.printf());

It turns out this was really not a wise idea and the logging calls alone were taking approximately 50% of the total execution time - versus java to C to java, hashtable lookups and synchronisation blocks.

Simply changing to use the Supplier versions of the logging functions approximately doubled the performance.

  Logger.getLogger(tag).fine(String.printf());
->
  Logger.getLogger(tag).fine(() -> String.printf());
    

But I also decided to just make including any of the code optional by bracketing each call to a test against a final static boolean compile-time constant.

This checking indirectly confirmed that the reflection invocations aren't particualrly onerous assuming the're doing any work.

HashMap<Long,WeakReference>

Now the other major component of the NativeZ object tracking is using a hash-table to map C pointers to Java objects. This serves two important purposes:

  1. Allows the Java to resolve separate C pointers to the same object;
  2. Maintains a hard reference to the WeakReference, without which they just don't work.

For simplicity I just used a HashMap for this purpose. I knew it wasn't ideal but I did the work to quantify it.

Using jol and perusing the source I got some numbers for a jvm using compressed oops and an 8-byte object alignment.

ObjectSize
HashMap.Node32Used for short hash chains.
HashMap.TreeNode56Used for long hash chains.
Long24The node key
CReference48The node value. Subclass of WeakReference

Thus the incremental overhead for a single C object is either 104 bytes when a linear hashchain is used, and 128 bytes when a tree is used.

Actually its a bit more than that because the hashtable (by default) uses a 75% load factor so also allocates 1.5 pointers for each object but that's neither here nor there and also a feature of the algorithm regardless of implementation.

But there are other bigger problems, the Long.hashCode() method just mixes the low and high words together using xor. If all C pointers are 8 (or worse, 16) byte aligned you essentially only get every 8 (or 16) buckets ever in use. So apart from the wasted buckets the HashMap is very likely to end up using Trees to store each chain.

So I wrote another hashtable implementation which addresses this by using the primitive long stored in the CReference directly as the key, and using the CReference itself as the bucket nodes. I also used a much better hash function. This reduced the memory overhead to just the 48 bytes for the CReference plus a (tunable) overhead for the root table - anywhere from 1/4 to 1 entry per node works quite well with the improved hash function.

This uses less memory and runs a bit faster - mostly because the gc is run less often.

notzed.nativez

So i'm still working on wrapping this all up in a module notzed.nativez which will include the Java base class and a shared library for other JNI libraries to link to which includes the (trivial) interface to the NativeZ object and some helpers to help write small and robust JNI libraries.

And then of course eventually port jjmpeg and zcl to use it.

Tagged java, nativez.
Copyright (C) 2018 Michael Zucchi, All Rights Reserved.Powered by gcc & me!