post google code post
Well nobody bothered to comment about the stuff i removed from google code apart from the one lad or lass who lamented the loss of some javafx demos.
I had comments open+moderated for a few weeks but got hit by spammers a couple of days ago so had to go back to id+moderated. Maybe something got lost in those 500 bits of snot but i don't think so. The spam was quite strange; most mentioned web sites but didn't provide links or weren't very readable so i'm not sure what the point was. Perhaps they're just fishing for open sites or naive moderators they can then exploit. Like the "windows computer department" that keeps calling and calling hoping i'll not tell them to fuck off every time (sigh, no i don't normally say that although i would tonight).
I've still got the subversion clones but i'm not inclined to do much with any of it for the forseeable future and i'm not even sure if i'm going to continue publishing other bits of code i play with going forward.
Desktop Java, OpenCL, ARM assembly language; these things are just not very common in the Free Software world. Server Java is pretty common but that's just, well, `open sauce' companies sharing costs and not hobbyists. So i think all i'm really doing is providing hints or solutions for some student's homework or help for graduate programmers to keep their jobs. And even then it's so niche it wouldn't be many, if any.
As an example of niche, I was looking up some way to communicate with adobe photoshop that doesn't involve psd format and one thing i came across was someone linking to one of my projects for some unfinished experiments with openraster format - on the first page of results. This happens rarely but still too often. Of course it could just be the search engine trying to be smart and tuning results to the user, which is a somewhat terrifying possibility (implications beyond these types searches of course). FWIW I came to the conclusion photoshop is just one of those proprietary relics from the past which intentionally refuses to support other formats so it's idiot users can continue to be arse-reamed by its inflated price.
It's just a hobby
As a hobby i have no desire to work on larger projects of my own or other established projects in my spare time. Occasionally i'll send in a patch to a project but if they want a bunch of fucking around then yeah, ... naah. In hindsight i somewhat regret how we did it on evolution but i think i've mentioned that before. Neither do i need to solicit work or build a portfolio or just gain experience.
I'm not sure how many hobbyists are around; anyone with remotely close to enough skill seems to be jumping into the wild casinos of app-stores or services and expecting to make billion$ and not just doing it for the fun of it. Some of those left over just seem to be arrogant egotistical fuckwits (and some would probably think the same of me). Same as it ever was I guess.
I suppose I will continue to code-drop even if it's just out of habit.
For another hobby I made kumquat marmalade on the weekend. Spent a couple of hours in the sun slicing the tiny fruit and extracting seeds (2-3 cups worth of seeds) and cooked it the next day. Unfortunately after all that effort it looks like it wasn't cooked quite enough and it probably wont set - it's a bit runny but at least it tastes good. Not sure what i'll do with 2-odd litres of the stuff though.
ahh shit, google code is being scrapped
Joy. I guess it's not making google enough money.
I'll make a copy of all my projects and then delete them. Probably sooner rather than later (probably immediately because i'd rather get it out of the way, i have a script going now). I'm keeping all the commit history for now although i never find it particularly useful. Links in old posts may break.
Some of them may or may not appear later somewhere else, or they may just sit on my hdd until i forget about them and delete them by accident or otherwise.
Wherever they end end up though it wont be in one of the other commercial services because i'm not interested in doing this again.
I need to do the same to this blog, ... but that can wait.
Update: So ... it seems someone did notice. Don't worry i've got a full backup of everything. In part I was pissed off with google and in part I wanted to make it immediately obvious it was going away to see if anyone actually noticed. Who could know from the years of feedback i've never received.
If you are interested in particular projects then add a comment about them anywhere on this blog. All comments are moderated so they wont appear till i let them through but there is no need to post more than once (i'll delete any nonsense or spam unless it makes you look like a bit of a wanker). I will see about enabling anonymous comments for the moment if they are not already on. I can then decide what to do depending on the interest.
I will not be using github or sourceforge nor the like.
In a follow-up post i'll see where i'm going with them as well as pondering the future location and shape of the blog itself. I've already written some stuff to take the blogger atom stuff, strip out the posts, download images, and fix the urls.
(in resp to Peter below) I knew the writing was on the wall when google code removed binary hosting: it's kinda useless for it's purpose without it. This is why all my new projects subsequent to that date are hosted differently.
So google have decided to disable downloads on google code.
So I have decided to stop using it.
... although as yet I have no concrete plans or timeline for when this decision will take effect.
Whilst they claim it's about abuse, one can only assume that is just a "likely-sounding excuse" for what in reality is just another straight-up lie from the PR department of a supra-national conglomerate, and it's really just a way to cut costs and promote their 'drive' service (a useless microsoft/apple only service as far as i'm concerned).
Nobody seems to have reported that they have also gimped their POP interface to gmail a couple of days ago. No more UID support. This makes POP a lot less reliable/useful as a mail store (although in honesty it was never designed for that purpose). I proceeded to delete all the mail in gmail to help them free up some disk space.
I guess over-all the writing is on the wall. We all know that at some point 'google account' will mean 'google+', and blogger may be retired at any time.
So it seems my on-going-but-totally-lax search for alternatives to 'everything google for convenience' just got another big kick up the rump-side.
As my projects are all pretty small and low-volume I might look at a local solution because every network based solution faces the same problem. I have a couple of beagleboards doing nothing although getting a running and secure-enough system might be more pain than it's worth.
It's a bit of a pain to have to deal with.
Simple LBP Object Detector
After mucking about getting nowhere with a simple local binary pattern (LBP) object detector algorithm I finally had a bit of a breakthrough. Rather than getting dozens of false positives, i'm finally getting a concise enough answer to be useful for further processing.
Here is an example based on a photo I found on the net with a few faces in it (it's from here - i only found this using an image search, that page is just where i found it but I haven't otherwise read it). The yellow box is their result, the white boxes are mine (i'm not quite centring/scaling it properly yet).
The only tuning is a threshold and the grouping limit.
I'm actually quite surprised some of those faces are even found because this is one of the first tests i've done on more than a couple of images - and it has trouble with Lenna for example (but i suspect that is some aliasing issues with downsampling the larger countenance). I originally started with the eye set (from the OpenCV eye cascade), but was having trouble with false positives - e.g. eyebrows, mouth, dark spots. It seems to work much better with the face data. But most eye detectors on their own seem a bit noisy anyway so perhaps I was expecting too much.
I gave up on the cascade idea and this simply tests every position using the LBP u2 8,1 code (only 59 values) against a binary lookup table, and then each position votes on the outcome - once I get enough positive votes, it's considered a hit. This is similar to the LBP feature test in the OpenCV LBP detector, except that one uses the full 8 bits of LBP code, and of course the code is calculated on regions, not pixels.
I am only using the face and non-face images from the CBCL face dataset available here, which isn't a particularly good quality set of images. The only pre-processing i'm doing is mirroring the faces to double the training set. Training is very fast - after the images are loaded and converted to LBP, it's only taking 0.038s on my machine to `train' the 4858 positive and 4548 negative images (very plain single-threaded Java).
On the CPU the lookup isn't particularly fast (0.4s for the test image above) but I will look at porting it to OpenCL - it should be a very good fit for a GPU. If weighting isn't required, the feature description itself can be made very compact as it only requires 2 integers (64 bits) for each x,y location in the pattern - i.e. under 2K5b for a 17x17 test pattern (e.g. 19x19 training set, as the the LBP requires a 1 pixel border) which can easily fit in the constant cache.
There are still some tuning issues such as that a given threshold doesn't work equally well on all images, but it is still a promising result and there are still plenty of ideas to try.
Update (ok not really an update i hadn't published this yet ...) ... I coded something up in OpenCL and the performance is really very good - kernel time for the scaling (using a mip-map like thing), lbp building, and running the detector is around 2ms (same scales at the cpu example above). But this time doesn't include peak detection, thresholding and grouping. Still, this is pretty favourable compared to the VJ cascade as that does far fewer probes in it's 10ms runtime (and takes a week to train - if you can get that to work). Here i'm doing over 200 000 17x17 probes through 5 scales ...
I also played with a more statistically valid accumulation mechanism (as each test is independent): multiplication rather than addition (statistics isn't my strength by any stretch ... sigh). This leads to much more specific peaks as can be seen by the following picture, although i'm not sure if it leads to a more consistent threshold value (I think it does, and if that's true, I probably don't even need to do peak detection ...).
Both images are normalised.
Update 2: Had a bit more of a play this morning, tried a couple of different kernel topologies and using LDS to reduce memory bandwidth requirements. Got the kernel time of my test case down to 1.3ms, vs the 2.1ms yesterday (on a Radeon HD 7970). I also found the new combination metric is working well - I can use a specific value as the threshold and remove the peak detection stage entirely. It doesn't work too well with the eye data (way too many false positives), but it's pretty good with faces, so far.
Viola & Jones Revisited
So after the last post it got me thinking about just how I did implement the viola-jones haar cascade in socles.
The code runs in a loop, and there is no communications from the CPU to the GPU and only runs about 10-15 loops anyway (depending on the settings): thus the loop overheads are fairly small. But it still does require a 'scale features' step, which is useful on a CPU to avoid excessive calculations but isn't so important on a GPU.
So I tried a slightly different approach - that is, to perform the scaling inside the detector kernel, which allows each kernel to then work on different scales. i.e. to do all scales in one step.
My first attempt at this wasn't much faster - but that's because I was invoking the kernel for too many probes. So then I tried changing the way it works: each work-group still works together solving a single feature test stage together. But instead of calculating it's location and scale from the 2d work coordinates I create a 4 element descriptor with some of the information required and it just uses that. This gives me a bit more flexibility in the work assignment, e.g. I can utilise persistent work-groups and tune the work size to fit the hardware more directly. It requires less temporary memory since the features are scaled in-situ.
This change was definitely worth it, for a given test on the webcamfx code, I got the face detection down to around 13ms total time, vs 19ms - about 4ms overhead is fixed. A stand-alone test of Lenna registers about 8ms vs 19ms, so over 100% improvement.
Comparisons with other hardware are difficult - mostly because it depends a great deal on the subject matter and the settings and i haven't kept track of those - but I was pretty disappointed with the AMD performance up until now and I think this gets it on par with the nvidia hardware at least. Although really the 7970 should do measurably better ...
My guess is that the performance gained is mostly because with the greater amount of work done, it can more efficiently fit the total problem onto the hardware. There is usually a small amount 'modulus' where a given problem wont fill all hardware units leaving some idle, and in this newer version it only happens once rather than 10-15 times. Actually I did some more timing (and updated the numbers above), and 100% seems too much for this. Maybe? Oh I also changed the parallel sum mechanism - but I changed it in both implementations and it made no difference anyway. I changed the region description to a float array too, although that only affects the scaling function in the first instance.
If I run this on a CPU the performance is very poor - around 1.5s for this test case. If I go back to a test CPU version I wrote it's a more reasonable 240ms so i'm still getting a good 30x speedup over an Intel i7 X 980. Given I was getting 90ms before with the cpu driver and the nvidia test case i'm not really sure what's going on there.
I haven't checked the code in yet as it's a bit hacked up.
Update: I checked some stuff in, although left both implementations in-tact for now.
Update 2: So I did some further analysis of the cascades I have: it turns out the way i'm splitting the work is very wasteful of GPU resources. I'm using at least 64 work items per stage - using one work item per feature. But the earlier stages have only a small number of features to test - and the vast majority of probes don't go past the first few stages. e.g. the default cascade only has 9 tests. I tried a few variations to address this but the overheads of multiple kernel calls and the global communication required outweighed any better utilisation.
Update 3: So curiosity kept me poking. First I realised that using fixed scheduling for persistent kernels might not be idea. So I use an atomic to dole out work in a first-some-first-served consumer way. Made a tiny difference.
Then I thought I would try to see if using fewer work-items per feature stage would help. In this case I use 4x16 or 2x32 thread groups to work on 4 or 2 tests concurrently - with all the necessary (messy) logic to ensure all barriers are hit by all threads, etc. This was measurable - the lenna test case I have is now down to around 7ms (unfortunately when using sprofile the algorithm fails for some unknown reason - so this is now time measured with System.nanoTime()).
One big thing left to try is to see if localising the wide work queue would help. e.g. rather than call multiple kernels for each stage and having each work-item busy working on a sub-set of problems, do it within the kernel. e.g. if the stage count is 9, 12, ... do stage 1 with 7 concurrent jobs, if any pass then add them to a local queue. Then do stage 2 with (64/12) = 5 concurrent jobs, if any pass add them to a local queue. etc. Once you get to a stage longer than 32 items, just use 64 threads for all the rest. This way I get good utilisation with small stages as well as with large stages. I'm not sure whether this will be worth all the hassle, and the extra addressing mathematics required (and it's already using a lot of registers); but as i'm really curious to know if it would help I might attempt it.
Given that I now use a work queue, another possibility open is to re-arrange the jobs to see if any locality of reference can be exploited. Given the huge memory load this might help: although the image cache is so small it might not.
Update 4:Curiosity got the better of me, it's been crappy cold weather and I hurt my foot (i don't know how) so I had another look at the complex version this morning ...
Cut a long story short, too many overheads, and although it isn't slow it isn't faster than just using 16 or 32 threads per feature test. Too many dynamic calculations which cannot be optimised out, and so on. It's around 9.5ms on my test case.
Structurally it's quite interesting however ....
- Find out how many concurrent tests can be executed for stage 0, dequeue that many jobs from the work queue and copy them to a local work queue.
- If we exceeded the job length, stop.
- Work out how many jobs can be done for the current stage
- Process one batch of jobs.
- Parallel sum the stage sum.
- If it advances to the next stage, copy to a next-stage queue.
- Go back to 4 unless finished the in queue.
- If any are in the next-stage queue, copy them over, advance to the next stage.
- Go back to 3 if we had any work remaining.
- If we ran the full stage count, copy any work jobs remaining in the queue to the result list.
- Go back to 1.
So each stage is fully processed in lock-step, and then advanced. The DEFAULT cascade starts with
7 feature tests, so it never has more than
7 items in the queue (7 feature tests of 9 elements = 63 work items). As the stages get deeper the number of work-items assigned to solve the problem widens, up to a limit of 64 - i.e. the work item topology dynamically alters as it runs through the stages in an attempt to keep most work-items busy.
There's a lot of messy logic used to make sure every thread in the workgroup executes every barrier, and there are lots of barriers to make sure everything works properly (i'm using locals a lot to communicate global info like the stage and topology information). So the code runs on a CPU (i.e. I got the barriers correct), although very inefficiently.
As is often the case with GPU's, the simpler version works better even if on paper it is less efficient at filling the ALU slots. Although I haven't confirmed this is the case mathematically: apart from stage 0, the more complex method will also have un-even slot fillage - it's one of those discrete maths/Knuth style problems I simply give up on.
AMD Fusion Summit, HSA, etc.
Been looking forward to watching the AMD Fusion summit this year after watching a bunch of very interesting videos last year. I knew they were coming up but this month has gone faster than I thought ...
So far i've just watched the 'programmer' keynote from Phil - it's a pity about the emphasis on C++ which is such a shit language - but what are you gonna do eh? His talk on the viola-jones haar cascade algorithm was interesting, how HSA could be used to split up algorithms to move the problem to where it is most efficiently solved (not sure how it compares to face-detector in socles, as I solved the problem of idle work-items in a different way). But yeah, looking forward to that capability in the future; during my last visit to OpenCL in the last month or so I kept thinking that being able to run stuff on the CPU where it made sense would ... make sense.
I slightly disagree that the problem with the GPU parallel programming is just that it is too hard to write - all good software is hard to write - I think it more has to do with the availability of the platform. e.g. PS3 is hard to write too, but there seems to be plenty of that now because everyone's writing to the same platform. If I was a commercial developer writing software, right now it's only going to be a niche (photoshop is a niche). This is ok - because niche customers are probably already using capable hardware or don't mind buying it - but for mass market adoption it requires mass market availability of stable, quality, compatible platforms. This is still some way off.
The videos are on the summit broadcast site which requires a freely available login.
Update: Blah, ahh well, mostly a bit dull & sparse this year, or maybe they just weren't all put up on the net. The HSA stuff is the most interesting again from a software perspective.
Update 2: Apparently more content will be added over time, I guess last year I didn't spot it for a few months so had a lot more to look at off the bat.
On more reflection the HSA foundation and the HSAIL stuff is pretty big news. People don't seem to understand why it's so important though. It's really about the H in HSA - heterogeneous. Being able to support many CPUs with the same code and even the same compiler. Being able to target the code at run-time to execute on the most efficient hardware available in the current system. And being able to do that in a practical way that isn't tied to some vendor-specific secret sauce using broken proprietary compilers. At the bottom of it, it's just another attempt at 'write once, run everywhere' technology, but this time for computationally intensive processing and not for desktop user applications. I guess time will tell to see how it goes without nvidia and intel though. And the same as to whether this finally allows free software to take part.
The other part of it is coming up with a set of re-usable libraries so that the performance is opened up to non-gun-hackers (or in their terms, non-'ninja'-programmers), although TBH I don't see that is any different to any other modern hierarchical programming environment full of frame-works and tool-kits. This can already be done with OpenCL anyway, but I suppose there is still messy crap to deal with from the idiot-programmer's perspective. e.g separate memory spaces, device-host copy overheads and so on. HSA with code transparently intermingling with plain old host code means the same could be done without the overheads and make it more attractive.
I still think the biggest hurdle for application developers is platform support. Any extra work has to be justifiable if it is only going to benefit a part of your customer base.
Update: I never got around to seeing the actual talks at the time but I just found that Stream Computing have a nice index of all the OpenCL specific talks. I'm not a regular reader of their blog but every now and then I do a search in which it turns up and I do a catch up ...
Idle minds ...
So it turns out I have a bit of a break between contracts again - i'm always happy to have extra time off, so there's nothing to complain about there!
I sat down on the weekend and yesterday to play with some socles code, but so far it's been really slow going. I just don't feel like getting too much into it and it's easier just to put it down if i hit a problem; I guess I really do need a bit of a break. I also have tons of crap to do in the back yard, shed, and even around the house as well; but i've been a bit lazy on that front the last year or so, as such I doubt much will happen there.
But yeah, I guess eventually over this break I will get the opencl ransac stuff sorted out in socles, and probably then re-visit jjmpeg to at least check in the code I've already done on the android stuff.
I tried the 12.6 beta catalyst driver yesterday - and thankfully it seems a lot better than 12.4, so far, touch wood, etc. At least it doesn't keep throwing up OUT_OF_HOST_MEM errors after a half a dozen code runs, AND the xinerama twin-screen desktop is back up to decent performance. So after finally getting a GCN GPU I would like to have a play with that and see what I can get out of it. I should probably try and come up with a specific application I want to try to implement and work toward it as well, rather than just poking in random algorithms to socles. The thing is, computers mostly just do what I need them to do (run emacs and a terminal in overlapping windows?), so i'm not particularly driven at this point.
I'm keeping an eye on the ARM stuff; the rhombus tech guys, the open pandora (who knows if i'll ever get the one i ordered - at least an email confirming the order and address once a year would be nice), but with a bunch of beagleboards sitting idle already it doesn't seem much point in me getting another dev board to poke at. Just not enough hours to look at everything that is interesting ...
On the RANSAC code, I pretty much have it done - it's just that messy testing to go. In this version I tried to do most work in the one kernel - I will see if that added complexity makes it slower, or the lower memory demands help it overall. I also tried to parallelise absolutely everything, from coordinate normalisation/result denormalisation to matrix setup. So far i'm getting a strange result in that just the SVD is somewhat slower than just the SVD I had before: although for all intents they are the same design. Once I have it going I will try double arithmetic to see if that generates better results.
Well, after a hair-pulling week (well I need a haircut, and i'm having a break for a few weeks next week) I'm finally getting somewhere with the HD7970. Not just having it crash the machine on me every test run makes for a much better day.
Some of that time was spent trying to track down crashes inside the clsurf code ... but they were all because I didn't notice that it needed images rounded up to 16 pixels wide ... sigh. Oops. Most of the rest was some barrier issues with my new code - it's been a while and I forgot some of the finer points. Getting it working on a CPU driver was a good help there because if you get the barriers wrong you just get nonsense results.
There was also a lot of time wasted rebooting - not only because of the code that crashes the driver, but because it still decides to start returning CL_OUT_OF_HOST_MEMORY all of a sudden. And I didn't realise till last night I can just log out of/back into X to fix this until it happens again. And time wasted verifying my drivers were ok too - which probably was wasted (and now i have a broken dependency map and catalyst libraries splatted over lib64 to boot). And finally I think I found a bug in the AMD driver as well, it's getting a divide-by-zero signal (which causes the jvm to abort!) when using a local worksize < 64 - this isn't something I normally do, but the occasional algorithm benefits from it. It's not too difficult to work around at least.
I finally have some RANSAC code working on the new card. And it's a screamer.
I'm getting around 2-3x total performance boost compared to the HD6950 for one run of the RANSAC code. Although I can up the number of RANSAC random probes by 4x and still run about 2x faster (this was not the case with the 6950, 2x probes meant 2x time taken) (so it's about 8x faster then). I thought i'd make a plot of the scalability to see how it does.
The stuff below 40 is pretty much 1.0ms, the ups and down are just sampling noise.
In this case, the number of work-groups per compute unit means the number of jobs queued would mean that many work-groups (wave-fronts) per compute unit. The 7970 has 32 compute units, each work-group does 7 matrices concurrently, so that means 40 on the X axis equates to 8960 RANSAC probes, i.e. solving 8960 9x9 matrices using SVD, and forming the homographic matrix with a couple of 3x3 matrix multiplies on the result takes about 1ms.
So, anything under 9000 checks is wasting resources on this machine.
So whilst writing this post and after doing all the timing i revisited a tiny part of the algorithm - the heaviest bit of the SVD is the error calculation which involves 3 sums of products across all 8 rows. For the HD 6950 I got a 2x speedup by using a simple loop vs a parallel sum - calculate the products in parallel but sum them in series directly in registers, but only in 1 thread of 9. I just noticed the ALU usage was a bit low on the 7970, and I turned back on the parallel sum. Well what do you know, ALU instruction count dropped from 9500 to 5900 and reduced the biggest case above from 2.3ms to 1.7ms (which is closer to a linear scaling anyway).
Sigh, now to debug some older and far more complex code that is not working 100%.
Copyright (C) 2018 Michael Zucchi, All Rights Reserved.Powered by gcc & me!