About

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as zed
  & handle of notzed

Tags

android (44)
beagle (63)
biographical (87)
blogz (7)
business (1)
code (63)
cooking (30)
dez (7)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (434)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (224)
java ee (3)
javafx (48)
jjmpeg (77)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (8)
opencl (119)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (3)
zcl (1)
zedzone (21)
Monday, 19 March 2012, 18:25

JNI overheads

Update: Also see the next post which has a slightly more real-world example.

So, i've been poking around with some JNI binding stuff and this time I was experimenting with different types of interfaces: I was thinking of doing more work on the C side of things so I can just directly use the native interfaces rather than having to go through wrappers all the time.

So I was curious to see how much overhead a single field access would be.

I'll go straight to the results.

This is the type of stuff I'm testing:

class HObject {
  long handle;
  public void call1() { invokes static native void call(this.handle); }
  public void call2() { invokes native void call(this.handle); }
  native public void call3();
}
class BObject {
  ByteBuffer handle;

  public void call1() { invokes static native void call(this.handle); }
}

All each native function does is resolve the 'handle' to the desired pointer, and assign it to a static location. e.g. for a ByteBuffer it must call JNIEnv.GetDirectBufferAddress(), for long it can just use the parameter directly, and for an object reference is must call back into the JVM to get the field before doing either of those.The timings ... are for

10^6

10x10^6 invocations, spread over 10050 objects (some attempt to avoid unrealistic optimisations), repeated 5 times: the last result is shown.

What \ time
0 static native void call(HObject o)            0.15
1 HObject.call1()                               0.10
2 static native void call(long handle)          0.10
3 static native void call(HObject.handle)       0.10
4 HObject.call2()                               0.12
5 HObject.call3()                               0.14
6 static native void call (BObject o)           0.59 (!)
7 BObject.call1()                               0.36 (!)

The timings varied a bit so I just showed them to 2 significant figures.

Whilst case 2 isn't useful, cases 1 and 3 show that hotspot has effectively optimised out the field dereferences after a few runs. Obviously this is the fastest way to do it. Although it's interesting that the static native method (1) is measurably different to the object native method (4).

The last case is basically how I implemented most of the bindings i've used so far, so I guess I should have a re-think here. There are historical reasons I implemented jjmpeg this way - I was going to write java-side struct accessors. But since I dropped that idea as impractical, it may make sense for a rethink here. PDFZ does have some java-side struct builders, but these are only for simple arrays which are already typed appropriately.

I didn't test the more complex double-level used in jjmpeg which allows it's native objects to be garbage collected more easily.

Conclusions

So I was thinking I could implement code using case 0 or 5: this way the native calls can just be used directly without any extra glue-code.

There are overheads compared to cases 1 and 4, but it's less than 50%, and relatively speaking it will be far less than that. And most of this is due to the fact that hotspot can remove the field access entirely (which is of course: very cool).

Although it is interesting that a static native call from a local method is faster than a local native call from a local method. Whereas a static native call with an object parameter is slower than a local native call with an object parameter.

Although

10^6

10x10^6 calls are a lot of calls, so the absolute overhead is pretty insignificant even for the worst-case. Even if it's 5x slower, it's still only 59 vs 10 ns per call.

Small Arrays

This has me curious now: I wonder what the overhead for small arrays are, versus using a ByteBuffer/IntBuffer, etc.

I ran some tests with ByteBuffer/IntBuffer vs int[], using Get/ReleaseArrayElements vs using alloca(), and Get(Set)ArrayRegion. The test passes from 0 to 60 integers to the C code, which just adds up the elements.

Types used:

IntBuffer ib = ByteBuffer.allocateDirect(nels * 4)
  .order(ByteOrder.nativeOrder()).asIntBuffer;
ByteBuffer bb = ByteBuffer.allocateDirect(nels * 4)
  .order(ByteOrder.nativeOrder());
int[] ia = new int[nels];

Interestingly:

But this was just using the same array. What about if i update the content each time? Here I am using the same object, but setting it's content before invocation.

Obviously the ByteBuffer suffers from calling setInt(), but all the values are within 30% of each other so it's much of a muchness really.

And finally, what if the object is created every time as well?

So this contains few few curious results.

Firstly (well perhaps not so curious), only if you know the direct Buffer has been allocated beforehand is it always going to win. Dynamic allocation will be a killer; a cache might even it up, but i'm doubtful it would put any Buffer back to a winning spot.

Secondly - again not so curious: small array allocation is pretty damn fast. The timings hint that these small loops might be optimising away the allocation completely which cannot be done for the direct buffers.

And finally the strangest result; copying the whole array to the stack is usually faster than trying to access it directly. Possibly the latter case is either having to take the memory from the heap first and is effectively just doing the same thing. Or it needs to lock the region or perform other GC-related things which slows it down.

Small arrays aren't the only thing needed for a JNI binding, but they do come up often enough. Now I know they're just fine to use, I will look at using them more: they will be easier to use on the Java side too.

Update: So I realised I'd forgotten Get/ReleasePrimitiveArrayCritical: for the final test cases, this was always a bit faster even than Get/SetArrayRegion. I don't know if it has other detrimental effects in an MT application though.

However, it does seem to work fine for very large arrays too, so it might be the simple one-stop shop, as at least on Oracle's JVM it always seems to be the fastest solution.

I tried some runs of 1024 and 10240 elements, and oddly enough the same relative results hold in all cases. Direct buffers only win when pre-allocated, GetIntArrayRegion is equal/faster to GetIntArrayElements, and GetCriticalArray is the fastest.

Tagged hacking, java, jjmpeg, pdfz.
More JNI'ing about | Filter Graph
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!