About

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as zed
  & handle of notzed

Tags

android (44)
beagle (63)
biographical (97)
blogz (9)
business (1)
code (73)
cooking (31)
dez (7)
dusk (30)
extensionz (1)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (451)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (229)
java ee (3)
javafx (49)
jjmpeg (80)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (9)
opencl (120)
os (17)
panamaz (3)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
players (1)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (3)
zedzone (23)
Sunday, 03 May 2015, 00:29

Fixing netbeans' most utterly useless and annoying mis-feature

I finally got sick of hitting escape every time i used code completion to remove the damn tooltip which always shows up - despite having "turned off" tooltips and all the in-line popups i could find to turn off. This is the one that shows a grey box (in my colour scheme - everything is grey) which hovers just above the line your entering and shows a list of stuff I can't read anyway.

So I worked out what the mis-feature was actually called - it's called "show method parameters".

Then I downloaded the netbeans source code (all of it, i used a zip - netbeans-src, not sure if i could've just got the ide module, mercurial looked like it was going to take forever so i didn't bother).

I worked out where the offending bit of snot lived: "editor.completion"

I fixed it:

--- editor.completion/src/org/netbeans/modules/editor/completion/CompletionImpl.java~   2014-11-18 19:07:58.000000000 +1030
+++ editor.completion/src/org/netbeans/modules/editor/completion/CompletionImpl.java    2015-05-02 22:39:20.935817693 +0930
@@ -1271,16 +1271,6 @@
      * May be called from any thread but it will be rescheduled into AWT.
      */
     public void showToolTip() {
-        if (!SwingUtilities.isEventDispatchThread()) {
-            // Re-call this method in AWT if necessary
-            SwingUtilities.invokeLater(new ParamRunnable(ParamRunnable.SHOW_TOOL_TIP));
-            return;
-        }
-
-        if (ensureActiveProviders()) {
-            toolTipCancel();
-            toolTipQuery();
-        }
     }
 
     /**

And then after a few false starts trying to work out how to compile the module, I got it compiled. I put "cluser.config=ide" in nbbuild/user.build.properties and then ran 'ant jar' in the editor.completion directory. I'm not exactly sure what made it work in the end because things sort of failed and then they didn't. It refuses to build on jdk8 (as documented) but i had a jdk7 lying about already. It didn't take very long - i was genuinely impressed.

I found where the generated module went (nbbuild/netbeans/ide/modules/org-netbeans-modules-editor-completion.jar) and then copied that into both the system (/usr/local/netbeans-8.0/ide/modules/) and local (~/.netbeans/8.0/modules) module dir.

And then I started netbeans and checked to see if it took. So far looks ok.

What were they thinking?

It's a pretty strange feature to start with given it's functional overlap with code completion but it's a mystery why it is also turned on automatically every time you do any code completion or parameter lookup.

For starters the code completion window already shows this information - why pop up another less readable tooltip to duplicate it?

The other problem is that it shows the parameter lists for ALL functions of the same name not just the one specified by the current arguments (or even the type of the argument under the cursor). This just makes it a big mess with insufficient local context to make it readable at 'thinking speed'.

And the icing on the cake is that once you've 'done' your completion or parameter lookup task and moved the cursor - it decides it still wants to hang around a bit longer and jumps to the parameters of the function call you're inside of - i.e. if you're in an in-line lambda or a anonymous class definition it will move up to the function call that the lambda or new Thing is a parameter of. i.e. that code you already finished, perhaps months ago.

So ... out comes ESC. And again, and again. Almost EVERY TIME I use code completion. When i'm cutting lots of code this can be up to dozens of times in a given minute. Of course I type ctrl-space to initiate it that often too, but that's something I asked for and not some undecipherable clutter to piss off.

I already have all the popups and tooltips turned off and only call them up explicitly using ctrl-space; it got to the point there was so many little boxes flashing across the screen as i typed it was like standing around while flies keep landing on your face - once or twice is no big deal but constantly it becomes more than just a bit annoying. Swatting flies is a very apt description of using some ide's in their default configs.

That's it, ... for now ...

But given it was so easy to build I may look at another annoying behaviour. The 'show matching parenthesis' function shows the matching parenthesis of any bracket the cursor is merely touching and gives no indication of which side of the bracket the cursor is on. This just makes it harder to work out where the cursor insertion point is when you're otherwise not interested in the matching bracket.

It could just be familiarity (although i only recently turned it on in emacs so it really isn't) or it's lisp implementation but the way emacs works makes it more useful. It will only highlight the matching bracket if the cursor is to the right of a close bracket or sitting on an open bracket (insertion point is to the left). If it's both, it will highlight the opening bracket of a close bracket since that's more useful. emacs uses a block cursor rather than a line which I also find easier to use for fixed-width character editors for a couple of reasons including that it goes hollow instead of vanishing when the window loses focus - which makes it easier to relocate if you're switching between windows on the same screen(s).

Tagged code, java, rants.
Friday, 01 May 2015, 13:01

stream of abuse

I know it's a "cool new feature" and all, but really guys, think before you put everything in a stream.

I wont link to the article but i came across one that included some code that used streams everywhere for some examples. Apparently it's "obvious" how using streams simplifies the code and so on.

For a start the code isn't really any simpler; it avoids needing to calculate the output size but that isn't a very complicated calculation. Unless all you knew was streams it certainly isn't any more expressive or concise either.

I converted one function directly to arrays and it was shorter and runs at twice the speed for a given test case. It also uses about 1/6th of the memory and roughly 30 000 fewer memory allocations (oh my poor breaking gc, still, java alloc is rather fast isn't it?).

Apart from that ... oh boy it does some bad bad things to a poor innocent atomic counter.

My first thought was "very poor and unscalable use of atomic counter". Then I looked closer.

    // names changed to protect the innocent
    AtomicInteger count=new AtomicInteger();
    int bob[] = foo.stream()
        .map(f->{
            Thing t=list.get(count.getAndIncrement());
            int p0=f.a; int p1=f.b; int p2=f.c;
            int t0=t.a; int t1=t.b; int t2=t.c;
            return IntStream.of(p0, t0, p1, t1, p2, t2);
        }).flatMapToInt(i->i).toArray();

So what this is attempting to do is iterate through two and merge pairs at matching indices into a flattened array of integers.

Except one of those arrays has been forced into a stream (for no reason other than it can be done it seems) - thus losing the position information. So to "recover" this information so that it can be correctly indexed into the other array an atomic counter has been added. But this solution is both dangerous and confusing. It's confusing because it looks like it should be concurrent code - that's exactly what atomic counters are for and also a common desired side-effect of using streams, but this loop is not being invoked in a concurrent context. It's probably there because it was left-over from such an attempt. It's dangerous because it is vanishingly unlikely to actually work if it actually was invoked on a parallel stream because atomic counters by definition just don't provide the required constraints and thus there is no guarantee of obtaining matching pairs. At best it looks like a thoroughly cheeky and worst-practice approach to work around the final rules for lambdas.

Whatever it is, it's sick. Don't ever do this (in any language).

In the source example this is part of a larger function where the previous two loops generate one each of these lists independently and in-order and then after this merge they are simply discarded. i.e. there is not any practical reason for any of this intermediate garbage creation apart from saving a little arithmetic in pre-calculating the required size of the array required. The two loops could be retained and just write interleaved results and without losing the expressiveness of the implementation.

But for arguments sake if you really did want to merge two array lists of a known length to an interleaved integer array, here's one approach that has worked for a few decades and is still about as good as it's going to get:

 int[] bob = new int[foo.size() * 6];
 for (int i=0, j=0; i < foo.size(); i++) {
   Point2D f = foo.get(i), t=list.get(i);
   bob[j++] = f.a; bob[j++] = t.a;
   bob[j++] = f.b; bob[j++] = t.b;
   bob[j++] = f.c; bob[j++] = t.c;   
 }

Lets also say for arguments sake that the stream example did actually work in parallel, and you would gain anything from it (hint: you wont, see the end), so you got that for free right? What about that ancient and daggy old for loop?

 int[] bob = new int[foosize()*6];
 IntStream.range(0, foo.size()).parallel().forEach((i)-> {
   int j = i*6;
   Point2D f = foo.get(i), t=list.get(i);
   bob[j++] = f.a; bob[j++] = t.a;
   bob[j++] = f.b; bob[j++] = t.b;
   bob[j++] = f.c; bob[j++] = t.c;   
 }

Code re-use

One argument for using streams is code-reuse - which falls over once you have side-effects and so the initial example is no better than the last in that respect.

If you really absolutely positively must use streams for this because of whatever reason (there are some legitimate ones): write a proper spliterator and use StreamSupport.stream() to turn it into a stream. It will have to take the two lists in it's constructor and iterate over a container object which holds the matching pair (like Entry<,>).

One will note however that for example the linked list spliterator just breaks the iterable into batches of arrays which are then spliterated over. Whilst this conceptually may seem that it could be a win due to locality of reference and doing the work piece-meal: in reality the spliterators are run to their limit before anything starts for scheduling purposes. So all it's really doing is breaking a single allocation and copy loop into many (sqrt(n)) smaller ones; which is always guaranteed to run slower and use less memory. i.e. you could just flatten both lists to arrays and then use the per-item or array methods above and get the same result - more efficiently - and with less programmer effort.

So whilst streams can save a lot of effort in writing concurrent code; it still many of the same gotchas that can introduce performance side-effects for the ignorant. For example if you're using thread synchronisation primitives at all in any stream processing chain including custom spliterators or collectors then you're literally "doing it wrong" as:

the entire point of the parallel part of the stream framework is to exterminate this unscalable approach

(I thought it needed a bit more emphasis than em/strong/underline could provide :)

EntrySpliterator

I couldn't leave it there so I tried implementing an "Entry" spliterator: one that takes two streams and passes each one into the stream as an Entry.

Because you really don't want to run this on a linked list i forced it to take arraylists. But there are few cases (random deletes or deletes from the head) where you would want to use linked lists and the container approach to linked lists creates pretty shit lists since you lose the ability to delete randomly in O(1). So even when they might be a win for the name of the data structure they often aren't due to the implementation details. But I digress.

So using a spliterator which is under 10 lines of significant code:

    bob = StreamSupport.stream(new SpliteratorEntry<>(
        (ArrayList<Thing>) foo,
        (ArrayList<Thing>) list), false)
    .map(e -> {
        Thing f = e.getKey();
        Thing t = e.getValue();
        int p0 = f.a; int p1 = f.b; int p2 = f.c;
        int t0 = t.a; int t1 = t.b; int t2 = t.c;
        return IntStream.of(p0, t0, p1, t1, p2, t2);
}).flatMapToInt(i -> i).toArray();

It doesn't make any appreciable difference to the running time or to those underwhelming allocation overheads; but at least it now allows for a reusable function and it isn't just plain "wrong".

FWIW making this concurrent just makes it a bit slower while using more cpu cycles and energy. But I could have guessed "not worth it" beforehand since it just isn't doing enough work to compensate for all the overheads on such a small problem (a few thousand items).

Tagged hacking, humour, java, rants.
Thursday, 30 April 2015, 23:15

80ks!

My shitty scales said i broke through the 80kg floor this morning. It usually wavers a little over a week but the trend has been clear of late so even if it's not stable it's not far off being so.

Once I got to about 81kg I actually started to actually feel lighter, although the belt is the first indicator followed by the mirror. I've got a waist again.

Not sure how much more will go but 10kg from the start of Feb to the end of April is alright and i'm still eating like a chook.

Now if only my foot worked ...

Tagged biographical.
Thursday, 30 April 2015, 20:31

hmmm

Playing with spliterators again.

This was my first go of the row spliterator forEachRemaining() func, and does what I believe the stream api expects:

public void forEachRemaining(Consumer<? super ByteRow> action) {
    for (int y = this.y, ey = y + height; y < ey; y++) {
        ByteRow row = ...;
        ...
        pixels.getRow(y, row);
        action.accept(row);
    }
}

And this is the way I would rather do it, because well any other way is a bit dumb:

public void forEachRemaining(Consumer<? super ByteRow> action) {
    ByteRow row = pixels.createByteRow();
    for (int y = this.y, ey = y + height; y < ey; y++) {
        ...
        pixels.getRow(y, row);
        action.accept(row);
    }
}

Because spliterators are single-threaded this actually works for some (all?) of the stream operations that make sense like "map()" (even multiple stages), "collect()", "forEach()", and a couple more. This is because these (or the parts which take the partial result) are all run on the same context as the spliterator. It wont provide meaningful results for others like "toArray()" or "sort()" but those operations wouldn't even if they did and one could always map to a new copy.

Of course, a different streams implementation could change that but I think it's one of those "i'll fix it if it breaks" things and I suspect it wont ever be able to be altered because someone who pays money to oracle will also think the same thing but wont want to fix it on their purse.

Maybe keeping valid instances around makes more sense for tiles, but if needed a copying collector could do that instead.

Update: Ok I decided not to post this originally because i just wasn't feeling like it, but now here goes, so this is a pre-post "update".

I followed on these ideas and did a bunch of re-arranging and settled on a consistent approach: by default any in the "stream" are just references to per-spliterator instances. If they need to be unique for further prorcessing they can be duplicated by clone() or the like. If it turns out this assumption is not sufficiently useful i can just change the spliterators then.

I did a bit of refactoring to try and save some small common bits of code in the spliterators but it just seemed like fiddling at the edges and adding extra classes for no real benefit.

Then I ran out of steam a bit and haven't really done much else on it.

Been doing some mildly interesting things with javafx though - i was going to say a few days ago how nice it is to work with, but then i hit some exceedingly frustrating problems with very simple layout needs which really gave me the shits. I got it working but damn it took way longer than it should have. I also spent too long trying to work out how to blend against a generated mask before I remembered about using the clip node. It's only a simple compositing task but having it running interactive speed with "so little work" is nice.

Tagged hacking, java.
Monday, 27 April 2015, 19:12

ASM + Consumer = ?

I ended up adding bulk operations to my 'row accessor' for the remapping functions; which let me get away with a single driver implementation whilst still retaining most of the performance. I have a separate row accessor for each type and image depth - this is essentially an N-channel vector. Possibly a solution to the problem of complex numbers in java actually so there's something else to try.

So i kinda cleaned a bunch of it up and started a code repository and whatnot.

But after getting all that stuff sorted out; I found the api just wasn't very convenient.

For starters I tried the "andThen" thing of a consumer and found it doesn't work if you have a function which takes two arguments since the second one will still be processing the original input. But I guess thats what you should expect when you abuse the contract; what i really want is a function not a consumer. I'm just trying to avoid all that temporary garbage that would ensue if i kept allocating arrays for each output stage. I just changed andThen() to take only an in-place processor so it 'works', but this isn't very flexible.

This is probably something I can work through the accessor one way or another although anything that isn't trivially simple might be faster just leaving to new[]. Maybe provide in/out/tmp accessors and a way to toggle which one is which as it passes through the processing chain.

ASM bytecode editor

It would still also be nice to just write simple per-pixel operations without the likely potential of losing so much performance.

So I had a "quick look" this morning at using ASM to do some on-the-fly bytecode foo to make it happen. And well, I think here's a rabbit hole that might keep me occupied for some time ...

My first cut was based on the observation that hotspot will in-line a function call if the code-loop which calls it only invokes up to two instances of the function. So my first goal was to make this 'true' and let the JVM handle the optimisation. I have a a class which implements the Pixel accessor interface and can track a row. It has a forEach function which sets itself up and then invokes Consumer.accept() on itself for each location. When I want to run a particular Consumer over pixel data, I create a new class which is a copy of the original but renamed to be unique and then load this via a class loader. I then create an instance of this class and set it up for the target image and then invoke the forEach function against the Consumer. So this is effectively "specialising" the whole class for each target type.

Does it work? Well it's about 3-4x faster than when hot-spot de-optimises the function which is pretty good; but it's still about 2x slower than it was before hot-spot did the de-optimisation. By passing the consumer interface in the constructor i can knock a bit more off. Hmm, I did think maybe i could do something a bit more involved but now i think about it i'm not sure I can actually.

For an indication of the proportion I have a case that goes from 3.5ms (optimised) to 22ms (deoptimised) to 7ms (per-callback specialisation) to 5.8ms (callback in constructor). Better than a poke in the eye anyway and didn't take much work.

Well I guess the short of it is that whether or not I pursue it at this point it means I don't really have to worry too much about the deoptimisation drop-off when considering the api; if it really becomes an issue I do have the option of doing something about it. And i'm sure there are existing libraries for this if I really needed it (but for now i'm more interested in the mechanism than the results). ASM is something i've wanted to look at for a while anyway.

API humdrum

I can go back to thinking about that api again. Actually maybe the row-based one isn't so bad after-all, I originally made a mistake with the way I started my 'op library' and I was creating a new class for each operation which was a bit cluttery. When I fixed some of the definitions I could change the fixed-function stuff to lambdas and that improved it somewhat.

I'm not that happy with the in-place vs/in-addition-to the out-of-place functions - is it worth all that effort for a bit of speed in some cases? I'll have to at least try a functional/value-returning version to see what impact that has on performance and whether it can be mitigated using some thread-job-specific buffering. I guess that'll keep me off the streets for a little while longer yet.

I also started a basic gui 'thing' to exercise everything and build a useful tool I want for myself. But that will remain a slow-burner for the time being.

So since a 'quick look' turned into 'I guess I totally wrote that day off' I thought i may as well try the row streaming thing before dinner. Running time is ok - about 8.8ms for that earlier test except it's writing to a new image, but as one would expect it can be very very garbage heavy. I'm assuming the contract here is that the row needs to be allocated every call - i suspect in many cases it can just be allocated per-spliterator which makes a huge difference to the gc overhead although it probably breaks the streams contract(?). Using a custom map/collect which runs a batch of rows in a specific thread allows me to get rid of all that garbage and running time is in the order of 4.4ms which is acceptable. Unlike the earlier tests each function includes it's own loop but given each is working on it's own copy of the data it makes them more re-usable.

I'm fast approaching the point I just want to settle on something that works and throw away all these new experiments. Infact I think i'll do that right now; that should fill out the evening and round out the day.

Tagged hacking, java.
Sunday, 26 April 2015, 03:06

ahh

So I discovered from observation that the whole performance thing is basically some fundamental speed/space trade-off in the compiler. It seems that for any (abstract) class which implements a loop and calls a method defined in a sub-class inside that loop, it will be inlined (if possible and/or based on other parameters) for up to two sub-classes, but if you add a third those both will be thrown away and replaced with a version they all share which uses a method call instead.

This dynamic recompilation (or rebinding) is actually pretty cool but has some implications for the whole mechanism on which java streams/lambdas are built. It's only much of a "problem" for tasks that don't do much work or for platforms where call overhead is high and where the operation is being called multiple-millions of times. But lambdas rather encourage the former.

Micro-benchmarks overstate the impact and in any event it can still be worked around or avoided in performance-critical code. But regardless I know what to look for now if i see funny results.

I did some experimentation with yet another way to access and process the pixels: by row accessor. There are still on-going but my current approach is to have a Consumer which is supplied a type-specific row accessor. This allows flat-mapped access via a get(i)/set(i,v) interface pair and exposes the internal channel ordering (which is fixed and interleaved). To handle different storage formats the forEach function will perform data conversion and be supplied data-directional hints so that each implementation isn't forced to implement every possible storage format.

Performance seems to be pretty good and there don't seem to be any weird speed-degradation gotchas so far. The Consumer code still needs to implement a loop but it's somewhat cleaner than having to have an array+base index passed around and used everywhere and the inner loop being local seems to be the easiest way to guarantee good performance. Oh and I have a way to handle multi-image combining operations by just defining multi-row interfaces which can be passed with a single parameter.

I don't know yet whether I want to also go "full-on-stream" with this as well as the tiles, and/or whether i even want a "Pixel" interface anywhere. I'm getting ideas to feed back and forth between them though like the memory transfer hints and automatic format conversion which might also be applicable to tiles. Having a `Pixel' is still kind of handy for writing less code even if there are performance trade-offs from using it. Then again, how many interfaces do I want to write and build the scaffolding for? Then there is the question of whether these accesors are used in the base apis themselves outside of just supplying ways to read or write the information.

Well that was a short diversion to end the day but I actually spent most of today (actually almost a full working day's worth) implementing what I thought was a fairly modest task: adding a parameterisable coordinate mapper to an all-edge-cases handling "set pixels" function. I'm using it as part of the tile extraction so that for example you can extract a tile from anywhere and have the edges extended in different ways that suit your use.

I already had a mapper enum to handle this for other cases but performing the calculation per-pixel turned out to be rather slow, actually oddly slow now I think about it - like 100x slower then a straight copy and it's not really doing a lot of work. I tried writing an iterator which performed the calculation incrementally and although it was somewhat faster it was still fairly slow and exposed too much of the mess to the caller. After a few other ideas I settled on adding a batch row-operation method to the enum where it handles the whole calculation for a given primitive data type.

The mirror-with-edge-repeat case turned out to way more code than I would like but it's something like 20x faster than a trivial per-pixel mapped version. It breaks the target row into runs of in-order and reverse-order pixels and copies them accordingly. I can get a bit more performance (~20-30%) coding specific versions for each channel count/element size; but it's already enough code to have both float and byte versions. Hmm, maybe this is something that could take a row-pair-accessor, 4 arguments instead of 8, .. ahh but then i'd probably want to add bulk operations to the interface and it's starting to get complicated.

Well now i'm super-tired and not thinking straight so time to call it a night. My foot seems to be going south again and i've been downing all the tea and water I can stomach so i'll probably be up all night anyway :(

Tagged hacking, java.
Saturday, 25 April 2015, 13:32

Streaming tiles

Yesterday I pretty much got lost all day poking around streams and iterators and puzzling over compiler nuances. It was a horrid cold and windy day and my foot's in no state to do anything so I didn't really have much else to do on an rdo. I learnt way more than I really wanted to ...

I was trying to come up with ways to make pixel-programming 'easy' and fit it into the stream framework yet still run at near-optimal efficiency. Streams of primitives are fine for single-channel images but since you can only return a single value the typical 'function' interfaces just don't work.

This means that you have to return multi-element values in some other structure and given it's stored that way; just writing directly to the storage array with an offset works.

My initial approach was to internalise the loop over rows or spans of consective pixels and just pass offsets and a length. i.e. all operators are batch/vector operators. This lets one amortise per-row addressing costs and the compiler can do some other optimisations fairly simply.

The problem is that you're still left writing yet-another-fucking-for-loop each time you write a new function ... and apart from that you've lost the positional information which is sometimes required. Ho hum.

So then I tried functions which operate on a single pixel but are given the coordinates and the channel number as parameters. This looked fairly promising and the performance is still good but it doesn't really solve the initial problem in that each channel is still being processed separately.

Pixel

Then I tried a Pixel accessor object. This somewhat suprisingly ... was actually very fast.

Well not really. Is it? I thought it was? Things get a little odd here. I had various other cases but the most recent and relevant one is that as soon as the same class of an iterator is used for more than two callbacks suddenly the performance plummets by 5-10x; and stays there forever more. This would explain the strange results that I encountered in the previous post. It seemed like a compiler bug soI tried some older jdks and they were identical, but since i don't feel like removing all the lambdas they were all 1.8.x. It may just be a space/speed tradeoff rather than a bug.

So, ... well shit to that.

And I tried everything ... I first noticed that sharing a spliterator class between different target classes seemed to be the problem, but then i found i could use a general spliterator and then use a custom iterator for the inner-most loop, but then I found even that drops in speed once you have too many functions. So I tried per-target iterators and spliterators and forEach() implementations with out luck.

Since I had some pretty poor performance from immutable base types before I abused the interface somewhat anyway: each pixel is the same instance and I just change the parameters for each consumer invocation. This means the object can't be passed out of the first map() or forEach() stage. It still works for many cases though.

For example:

// greyscale quantise to 16 levels
src.pixels().forEach((Pixel p) -> {
    int v = p.get(0) & 0xf0;
    p.set(v | v>>4);
}

Just for kicks I might see how well an immutable accessor works - the pixel data will still be stored in the original image but the 'Pixel' object will track it's location so it can be passed through a map function. I can't see much use for chaining per-pixel processing in this way though as you can just add the extra steps in the same function.

If i'm really bored I might try one where the pixel data itself is also copied so you'd need to use a collector to reconstruct an image. Again; I can't see how this would be useful though.

Tile

Because I thought i'd solved the Pixel performance problem; and I had at least exhausted all my current ideas, I thought i'd tackle the next big one: tile processing.

So the goal here is to break an image into tiles, process each one independently, and then re-form the result. This ... actually looks a lot like a stream processor and indeed it fits the model quite well assuming you retain the tile home position through the stream (does this make it not really a stream? I don't know, I think it's close enough and the arguments otherwise are a bit pointless).

After a couple of iterations (you can do a lot in 14 hours) I'm getting pretty close to something i'm pretty pleased with. It supports a copy mode which can pass tiles through a processing pipeline and an accessor mode which can be used for readonly access or per-pixel operations. Tiles can be overlapped. I've got a collector which forms the tiles back into a single image based on their coordinates.

A more more ideas to try but this is the sort of thing I've got working already:

// process an image by tiles into a new image
BytePixels1 src = ...;
dst = src.tiles(64, 64, 0, 0)
    .map((Tile t) -> {
            System.out.printf("do stuff @ %d,%d\n", t.getX(), t.getY());
            return t;
        })
    .collect(new TileCollector(src.create()));

Because there are fewer of them there aren't so many concerns with gc blow-out as with the pixel case so these are pipelineable objects.

The practical api

I'm not going for api perfection but i'm working on practicality with a heavy dose of performance.

So I like the tile streaming bit: this is just a necessary feature and it's a pleasantly good fit for java streams and works well with threads.

I'm still not sure on approach to pixels yet, so far i've got convenience or performance but not both (at least not in the general case). This latter target seems to be in a bit of a fight with the compiler but there might be a way to put the specialisation in the correct place so the compiler knows how to optimise it or at least not slow down other cases. Another issue is that any temporary storage needs to be allocated in each callback as you don't have access to thread-instance information in order to index pre-allocated information. So anything that needs extra workspace will add even more overhead.

Ideally I would like to support java streams ... but it may be more practical to just use different mechanisms for pixels and leave the stream and multi-thread handling to tiles. i.e. i could just have serial map(), forEach(), and reduce() calls which can be invoked per-tile and leave the combining to the higher levels. Yes that sounds reasonable.

And there are also some cases where the stream interface doesn't really fit such as a calculation across the same location of different images (apart from using a spliterator to create a multi-source accessor?).

As an old boss used to say, just gotta suck it and see. Time to suck?

Steaming piles

The syntax for generics are giving me the shits a bit. Nesting gets ugly very quickly and when things get complicated it's a bit of pot-luck working out what is needed. This is obviously something that could be learnt but it's not very useful knowledge and i'd rather be doing more interesting things. So I'm mostly just twiddling strings and seeing what netbeans tells me - when it decides to respond.

Something has 'happened' to netbeans too. It's using way too much memory and running far too slow. I'm constantly getting out of memory errors which then throws the whole system into a confused state. I've tried closing all the projects i'm not working on and regularly close all windows and even disabling all the plugins i'm not immediately using right now. Barely makes any difference. It starts at 500MB and keeps growing until I have to restart it.

And that's besides all of it's really annoying behaviours which are apparently not bugs. I've turned off all the tooltips and everything automatic I can find but it still wants to pop up this shitty argument list thing every time I do an expansion via ctrl-space: i'm constantly having to hit escape just to fuck that shit off. And it seems to have given up guessing types altogether now. Type `Foo f = new <ctrl-space>' - No, I do not want or could possibly want to ever create a new AbstractMethodError. It's 'smart' build system is constantly out of date - i never know when i'll need to clean-rebuild so rather than face that hassle I just have to do it every time; and even then it executes some other set of classes so can be out of sync. It seems to randomly decide when it'll let you run the same application twice or not. That's one of those utterly pointless features that just isn't necessary and should be removed as soon as it doesn't work perfectly every single time - there's been enough bug reports about it.

And we're talking fairly small toy projects with little or not dependencies written by one man. This image library is only 5KLOC of code lines or 12KOC of source.

Emacs is looking more attractive but that has big problems formatting straightforward C these days (all too often forcing a file close - reopen for a momentary reprieve) and with the size of the java runtime api it's just not practical typing out every function call, package, and type name.

Tagged hacking, java.
Thursday, 23 April 2015, 17:17

spliterators are just iterators

I've been mucking about on-and-off with a 2D pixel library (i.e images) and as one part of that I was trying to utilise the stream functions from java 8 and/or at least look into some functional callbacks.

I can't just use the jre stuff because they don't have byte or float streams and they don't handle 2D data; I wanted to be able to support sub-regions of a rectangular array accessed by offset with different strides.

It took me a little while to wrap my head around spliterators because I was looking at them as if their primary job was to split the work up into smaller chunks to be worked across threads; but this is completely the wrong way to frame them. They are instead just iterators that can be broken into non-overlapping parts.

Once I realised that I had no trouble coming up with some 2D array streams of various types and trying to work out how to get better performance.

Given the code complexity i'm fairly impressed with the performance of the stream code in general but it doesn't always seem to interact with the jvm very well. Some code I have gets slower once hotspot has completed it's optimisations; which is odd. And sometimes it's really not too fast at all.

So I have a test-case which creates a 2D byte array of 1024x1024 elements and sums up every element as an integer 1000 times in tight loop.

The basic object i'm using for these tests is:

class BytePixels {
    public final byte[] pixels;
    public final int width, height, stride, offset;

    public int offset(int x, int y)  { return x+y*stride+offset; }
    public int getb(int x, int y)    { return pixels[offset(x,y)] & 0xff; }
    public int getb(int i)           { return pixels[i] & 0xff; }
}

The machine is a quad-core cpu and 4 threads are used whenever concurrency.

I'll go through some of the cases I tried and show the inner loops. For the most part i'm assuming offset is not zero since that case is a little simpler but its difference to runtime is minimal. I'm applying obvious micro-optimisations like copying loop bounds or array pointers to local variables.

The first examples are just using basic Java & jre facilities.

376ms Nested loops with getb(x,y)

This loops over the dimensions of the image and calls getb(x,y) to retrieve each element.

for (int y = 0; y < height; y++)
    for (int x = 0; x < width; x++)
        sum += bp.getb(x, y);

This sets a base-line for the most obvious implementation.

387ms Nested loops with pixels[o++]

This loops over the dimensions but removes the internal array calculation by calculating the offset at the start of each row and then accessing the pixel values directly.

for (int y = 0; y < height; y++)
    for (int x = 0, o = bp.offset(0, y); x < width; x++)
        sum += pixels[o++] & 0xff;

This is supposedly optimised ... but executes marginally slower or about the same. Ahah what? I can't explain this one.

328ms Nested loops with pixels[o++] take 2

This is the same as the previous but changes the inner loop bounds checking so that the incrementing value is the same one used as the bounds check.

for (int y = 0; y < height; y++)
    for (int o = bp.offset(0, y), e = o + width; o < < e; o++)
        sum += pixels[o] & 0xff;

Given the simplicity of the change; this makes a pretty big difference. I must remember this for the future.

I only tried this whilst writing this post so went back and applied it to some of the other algorithms ...

2980ms map/reduce in parallel with flat indices.

This uses IntStream.range().parallel() to create a stream of all indices in the array and uses .map() to turn it into an array of elements. Because range is only 1-dimentional this is not a general solution merely 'best case'.

IntStream.range(0, width * height)
    .parallel()
    .map((i) -> getb(i))
    .sum();

So ... it is not very 'best'. Actually a further wrinkle is unlike other examples this one runs slower after hotspot has completed it's optimisations and the first-run is the quickest. And we're talking 2-3x slower here not just a bit.

3373ms map/reduce in parallel with 2D indices.

This uses IntStream.range().parallel() to create a stream of all indices within the bounds of the image and then remaps these back to 2D coordinates inside the .map() function. This is the general solution which implements the required features.

IntStream.range(0, width * height)
    .parallel()
    .map((i) -> getb(offset + (i / width) * stride + (i % width)));
    .sum();

Not surprisingly it's a bit slower, and like the previous example also gets slower once optimised.

Basically; that's a no then, this just isn't going to cut it. So time to see if a custom spliterator will do it.

1960ms map/reduce in parallel with custom 2D spliterator

This uses a custom 2D spliterator that splits (at most) into rows and then uses map() to retrieve the pixel values.

StreamSupport.intStream(new Spliterator2DIndex(width, height, offset, stride), true)
    .map((i) -> getb(i))
    .sum();

Still "slow as shit" to use the correct technical jargon.

This is about when I worked out how to use spliterators and tried the next obvious thing: using the spliterator to do the map() since that seems to be "problematic".

125ms stream reduce in parallel with custom 2D array spliterator

The custom 2D spliterator performs the array lookups itself and feeds out the values as the stream. It supports the full features required.

StreamSupport.intStream(new Spliterator2DByteArray(width, height, offset, stride, pixels), true)
    .sum();

Ok, so now this is more like it. It's finally managed to beat that single threaded code.

Until this point I was ready to ditch even bothering with the stream code; one can deal with a bit of a performance drop for some convenience and code re-use, but 10x is just unacceptable.

115ms stream reduce in parallel with custom 1D array spliterator

This loops over all elements of an array and feeds out the array values. This is a non-conformant solution to determine the overheads of the 2D indexing.

StreamSupport.intStream(new SpliteratorByteArray(offset, width * height, pixels), true)
    .sum();

The overheads of the 2D case are not zero but they are modest.

I tried a bunch of other things right through to a completely different multi-threaded forEach() call and polled queues and a more OpenCL-like approach to the processing model; my best result was under 100ms.

By `OpenCL approach' I made the thread index an explicitly available parameter so that for example a reduction step can just write a partial result directly to a pre-allocated array of values by 'group id' rather than having to allocate and pass back working storage. This together with an exotic single-reader work queue produced the best result. I also learned how to use park(), ... maybe something for a later post.

Optimisations

I found it particularly critical to optimise the spliterator iteration function forEachRemaining(). Even trivial changes can have a big impact.

The difference between this:

private final byte[] pixels;
private final int width, stride;
private int height, offset;
private int x, y;

public void forEachRemaining(IntConsumer action) {
    while (y < height) {
        while (x < width)
            action.accept(pixels[(x++) + y * stride + offset] & 0xff);
        x = 0;
        y += 1;
    }
}

And this:

private final byte[] pixels;
private final int width, stride;
private int height, offset;
private int x, y;

public void forEachRemaining(IntConsumer action) {
    int y = this.y;
    int height = this.height;
    int x = this.x;
    byte[] pixels = this.pixels;
    int offset = this.offset;

    while (y < height) {
        for (int o = x + y * stride + offset, e = y * stride + offset + width; o < e; o++)
            action.accept(pixels[o] & 0xff);
        y += 1;
        x = 0;
    }
    this.x = x;
    this.y = y;
}

Is an over 600% difference in running time.

For images

These streams aren't really very useful for a lot of image processing; only handling single channel data and losing the positional information leaves only a few basic and not widely useful operations left over. The other stream interfaces like collections just wont scale very well to images either: they require intermediate blocks to be allocated rather than just referencing sub-ranges.

So I have been and probably will continue to use the streams in these cases as multi-threaded forEach() and gain the concurrency serialisation implicitly by each task working on independent values. But I can also just borrow ideas and implement similar interfaces even if they are not compatible, e.g. a reduce call that just runs directly on the data without forming a stream first. This might let me get some of the code re-usability and brevity benefits without trying to force a not-quite-right processing model onto the problem.

But being able to create custom/multi-dimensional spliterators is pretty handy all the same.

Tagged hacking, java.
Newer Posts | Older Posts
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!