ASM + Consumer = ?
I ended up adding bulk operations to my 'row accessor' for the remapping functions; which let me get away with a single driver implementation whilst still retaining most of the performance. I have a separate row accessor for each type and image depth - this is essentially an N-channel vector. Possibly a solution to the problem of complex numbers in java actually so there's something else to try.
So i kinda cleaned a bunch of it up and started a code repository and whatnot.
But after getting all that stuff sorted out; I found the api just wasn't very convenient.
For starters I tried the "andThen" thing of a consumer and found it doesn't work if you have a function which takes two arguments since the second one will still be processing the original input. But I guess thats what you should expect when you abuse the contract; what i really want is a function not a consumer. I'm just trying to avoid all that temporary garbage that would ensue if i kept allocating arrays for each output stage. I just changed andThen() to take only an in-place processor so it 'works', but this isn't very flexible.
This is probably something I can work through the accessor one way or another although anything that isn't trivially simple might be faster just leaving to new. Maybe provide in/out/tmp accessors and a way to toggle which one is which as it passes through the processing chain.
ASM bytecode editor
It would still also be nice to just write simple per-pixel operations without the likely potential of losing so much performance.
So I had a "quick look" this morning at using ASM to do some on-the-fly bytecode foo to make it happen. And well, I think here's a rabbit hole that might keep me occupied for some time ...
My first cut was based on the observation that hotspot will in-line a function call if the code-loop which calls it only invokes up to two instances of the function. So my first goal was to make this 'true' and let the JVM handle the optimisation. I have a a class which implements the Pixel accessor interface and can track a row. It has a forEach function which sets itself up and then invokes Consumer.accept() on itself for each location. When I want to run a particular Consumer over pixel data, I create a new class which is a copy of the original but renamed to be unique and then load this via a class loader. I then create an instance of this class and set it up for the target image and then invoke the forEach function against the Consumer. So this is effectively "specialising" the whole class for each target type.
Does it work? Well it's about 3-4x faster than when hot-spot de-optimises the function which is pretty good; but it's still about 2x slower than it was before hot-spot did the de-optimisation. By passing the consumer interface in the constructor i can knock a bit more off. Hmm, I did think maybe i could do something a bit more involved but now i think about it i'm not sure I can actually.
For an indication of the proportion I have a case that goes from 3.5ms (optimised) to 22ms (deoptimised) to 7ms (per-callback specialisation) to 5.8ms (callback in constructor). Better than a poke in the eye anyway and didn't take much work.
Well I guess the short of it is that whether or not I pursue it at this point it means I don't really have to worry too much about the deoptimisation drop-off when considering the api; if it really becomes an issue I do have the option of doing something about it. And i'm sure there are existing libraries for this if I really needed it (but for now i'm more interested in the mechanism than the results). ASM is something i've wanted to look at for a while anyway.
I can go back to thinking about that api again. Actually maybe the row-based one isn't so bad after-all, I originally made a mistake with the way I started my 'op library' and I was creating a new class for each operation which was a bit cluttery. When I fixed some of the definitions I could change the fixed-function stuff to lambdas and that improved it somewhat.
I'm not that happy with the in-place vs/in-addition-to the out-of-place functions - is it worth all that effort for a bit of speed in some cases? I'll have to at least try a functional/value-returning version to see what impact that has on performance and whether it can be mitigated using some thread-job-specific buffering. I guess that'll keep me off the streets for a little while longer yet.
I also started a basic gui 'thing' to exercise everything and build a useful tool I want for myself. But that will remain a slow-burner for the time being.
So since a 'quick look' turned into 'I guess I totally wrote that day off' I thought i may as well try the row streaming thing before dinner. Running time is ok - about 8.8ms for that earlier test except it's writing to a new image, but as one would expect it can be very very garbage heavy. I'm assuming the contract here is that the row needs to be allocated every call - i suspect in many cases it can just be allocated per-spliterator which makes a huge difference to the gc overhead although it probably breaks the streams contract(?). Using a custom map/collect which runs a batch of rows in a specific thread allows me to get rid of all that garbage and running time is in the order of 4.4ms which is acceptable. Unlike the earlier tests each function includes it's own loop but given each is working on it's own copy of the data it makes them more re-usable.
I'm fast approaching the point I just want to settle on something that works and throw away all these new experiments. Infact I think i'll do that right now; that should fill out the evening and round out the day.