So, I've been looking into how to re-do the processing graph stuff in my customer's application over the last few days. First reason is I need to change the way it works, from a real-time interface to an interactive/batched one. And the second reason is that thinking about how Aparapi works (or could work with added concurrency) got me thinking that I really could do a better job at simplifying the whole lot, whilst making it more useful.
Currently I have a processing tree, and use threads and multiple queues to feed the data around so the main feeding thread isn't blocked on long-running tasks. There are basically 3 levels of abstraction: the top-level tree which consists of processing nodes which are invoked in the correct order: these might call kernels directly or higher level components. The middle level is only there sometimes, and includes more complex routines which consist of several opencl kernels combined in various ways with it's work data (e.g. something like KLT). And the lowest level is the direct kernel bindings (which usually do not manage their own data) very much the same as the ones in socles. The tree only defines the invocation order; data relationships are statically created/assigned at tree creation time or dynamically synchronised at run-time.
This works ok, but it is pretty messy and none of the top level is re-usable in the least (and there's little middle-level to re-use). It is actually fairly efficient - 'real time' means for some tasks I have plenty of spare time to waste and give up to to background jobs, so small bits of cpu-synchronous code aren't a deal-breaker. Obviously I want to simplify the usage of this, whilst increasing the possibilities for automatic job-level-parallelism.
My first re-cut of this was to take much the same idea, but make a couple of alterations. Firstly to re-arrange the abstraction levels, so that the top-level doesn't do so much direct kernel invocation but move this code to more second-tier components which are hopefully more re-usable. Either way this should be a big plus. And secondly to simplify the data management; use the tree to define data flow, automatic data conversion between stages, plus a bit of double buffering to cope with cpu synchronisation and some cpu parallelism (at least, one-way cpu-to-gpu). But otherwise basically just a synchronous fixed call-tree managed by a single queue.
But I don't think this is going to cut it either. It's not really flexible enough and unless I have lots of batch processes running concurrently (which i wont) the device will be underutilised in many cases.
Last night I started working on a socles version of something similar but quickly got bogged down in the data-conversion issues which get messy pretty fast (I want to be able to mix and match image, array, and java native graph nodes without each having to worry about where the data is coming from). And that was before I got onto the synchronisation stuff.
Yesterday was just a crap day anyway: very little sleep, grumpy as hell, and not able to think straight, so it was flying blind a bit by just not being on the ball ...
Then I realised this morning (i'm sure now that i knew this, and this is one reason I didn't want to do a call-graph interface in the first place; i'd just forgotten about it); opencl already has all of the stuff required to manage the processing graph (and it can handle a graph, not just a tree) (or I would've realised earlier if my searches hadn't have kept pointing to the 1.0 spec). The graph invoker just has to call the processing nodes in the right order, and it can build and maintain the event nodes used to link them all together. User events can be used to synchronise with (asynchronous/mt) java-side processing so I don't need to stop that branch entirely just to do some cpu code. All the processing nodes need to do in addition to their work is use the standard opencl condition/event set to ensure synchronisation. I can possible even manage queue stuff automatically. The data conversion stuff will still be a pain - but it's just a pain anyway and just can't be avoided.
Representing the graph in a simple way and turning that into an invocation sequence with events is another issue, but at least it gives me something new (to me) and useful to learn about.