Modern computers tend to have multiple processors. By making use of
               	    all the processing ability of your machine your programs can run
               	    much faster. Writing code that takes advantage of multiple
               	    processors in Java usually involves either manually creating and
               	    managing threads, or using a higher level concurrent programming
               	    abstraction library, such as the classes found in the excellent
               	    java.util.concurrent package.
               	  
            
               	    A common use-case for multithreading in multimedia analysis is the
               	    application of an operation to a collection of objects - for
               	    example, the extraction of features from a list of images. This kind
               	    of task can be effectively parallelised using Java’s concurrent
               	    classes, but requires a large amount of boiler-plate code to be
               	    written each time. To help reduce the programming overhead
               	    associated with this kind of parallel processing, OpenIMAJ includes
               	    a Parallel class that contains a number of
               	    methods that allow you to efficiently and effectively write
               	    multi-threaded loops.
               	  
            
               	    To get started playing with the Parallel class,
               	    create a new OpenIMAJ project using the archetype, or add a new
               	    class and main method to an existing project. Firstly, lets see how
               	    we can write the parallel equivalent of a
               	    for (int i=0; i<10; i++) loop:
               	  
            
Parallel.forIndex(0, 10, 1, new Operation<Integer>() {
	public void perform(Integer i) {
	    System.out.println(i);
	}
});
            	  Try running this code; you’ll see that all the numbers from 0 to 9 are printed, although not necessarily in the correct order. If you run the code again, you’ll probably see the order change. It’s important to note that the when parallelising a loop that the order of operations is not deterministic.
Now let’s explore a more realistic scenario in which we might want to apply parallelisation. We’ll build a program to compute the normalised average of the images in a dataset. Firstly, let’s write the non-parallel version of the code. We’ll start by loading a dataset of images; in this case we’ll use the CalTech 101 dataset we used in the Classification 101 tutorial, but rather than loading record object, we’ll load the images directly:
VFSGroupDataset<MBFImage> allImages = Caltech101.getImages(ImageUtilities.MBFIMAGE_READER);
We’ll also restrict ourselves to using a subset of the first 8 groups (image categories) in the dataset:
GroupedDataset<String, ListDataset<MBFImage>, MBFImage> images = GroupSampler.sample(allImages, 8, false);
We now want to do the processing. For each group we want to build the average image. We do this by looping through the images in the group, resampling and normalising each image before drawing it in the centre of a white image, and then adding the result to an accumulator. At the end of the loop we divide the accumulated image by the number of samples used to create it. The code to perform these operations would look like this:
List<MBFImage> output = new ArrayList<MBFImage>();
ResizeProcessor resize = new ResizeProcessor(200);
for (ListDataset<MBFImage> clzImages : images.values()) {
    MBFImage current = new MBFImage(200, 200, ColourSpace.RGB);
    for (MBFImage i : clzImages) {
        MBFImage tmp = new MBFImage(200, 200, ColourSpace.RGB);
        tmp.fill(RGBColour.WHITE);
        MBFImage small = i.process(resize).normalise();
        int x = (200 - small.getWidth()) / 2;
        int y = (200 - small.getHeight()) / 2;
        tmp.drawImage(small, x, y);
        current.addInplace(tmp);
    }
    current.divideInplace((float) clzImages.size());
    output.add(current);
}
            	  
               	    We can use the DisplayUtilities class to display
               	    the results:
               	  
            
DisplayUtilities.display("Images", output);
            	  
               	    Before we try running the program, we’ll also add some timing code
               	    to see how long it takes. Before the outer loop (the one over the
               	    groups provided by images.values()), add the
               	    following:
               	  
            
Timer t1 = Timer.timer();
and, after the outer loop (just before the display code) add the following:
System.out.println("Time: " + t1.duration() + "ms");
            	  You can now run the code, and after a short while (on my laptop it takes about 7248 milliseconds (7.2 seconds)) the resultant averaged images will be displayed. An example is shown below. Can you tell what object is depicted by each average image? For many object types in the CalTech 101 dataset it is quite easy, and this is one of the reasons that the dataset has been criticised as being too easy for classification experiments in the literature.
|  | 
               	    Now we’ll look at parallelising this code. We essentially have three
               	    options for parallelisation; we could parallelise the outer loop,
               	    parallelise the inner one, or parallelise both. There are many
               	    tradeoffs that need to be considered including the amount of memory
               	    usage and the task granularity in deciding how to best parallelise
               	    code. For the purposes of this tutorial, we’ll work with the inner
               	    loop. Using the Parallel.for method, we can
               	    re-write the inner-loop as follows:
               	  
            
Parallel.forEach(clzImages, new Operation<MBFImage>() {
    public void perform(MBFImage i) {
        final MBFImage tmp = new MBFImage(200, 200, ColourSpace.RGB);
        tmp.fill(RGBColour.WHITE);
        final MBFImage small = i.process(resize).normalise();
        final int x = (200 - small.getWidth()) / 2;
        final int y = (200 - small.getHeight()) / 2;
        tmp.drawImage(small, x, y);
        synchronized (current) {
            current.addInplace(tmp);
        }
    }
});
            	  
               	    For this to compile, you’ll also need to make the
               	    current image final by adding
               	    the final keyword to the line in which it is
               	    created:
               	  
            
final MBFImage current = new MBFImage(200, 200, ColourSpace.RGB);
               	    Notice that in the parallel version of the loop we have to put a
               	    synchronized block around the part where we
               	    accumulate into the current image. This is to
               	    stop multiple threads trying to alter the image concurrently. If we
               	    now run the code, we should hopefully see an improvement in the time
               	    it takes to compute the averages. You might also need to increase
               	    the amount of memory available to Java for the program to run
               	    successfully.
               	  
            
               	    On my laptop, with 8 CPU cores the running time drops to ~3100
               	    milliseconds. You might be thinking that because we have gone from 1
               	    to 8 CPU cores that the speed-up would be greater; there are many
               	    reasons why that is not the case in practice, but the biggest is
               	    that the process that we’re running is rather I/O bound because the
               	    underlying dataset classes we’re using retrieve the images from disk
               	    each time they’re needed. A second issue is that there are a couple
               	    of slight bottlenecks in our code; in particular notice that we’re
               	    creating a temporary image for each image that we process, and that
               	    we also have to synchronise on the current
               	    accumulator image for each image. We can factor out these problems
               	    by modifying the code to use the partitioned
               	    variant of the for-each loop in the Parallel
               	    class. Instead of giving each thread a single image at a time, the
               	    partitioned variant will feed each thread a collection of images
               	    (provided as an Iterator) to process:
               	  
            
Parallel.forEachPartitioned(new RangePartitioner<MBFImage>(clzImages), new Operation<Iterator<MBFImage>>() {
	public void perform(Iterator<MBFImage> it) {
	    MBFImage tmpAccum = new MBFImage(200, 200, 3);
	    MBFImage tmp = new MBFImage(200, 200, ColourSpace.RGB);
	    while (it.hasNext()) {
	        final MBFImage i = im.next();
	        tmp.fill(RGBColour.WHITE);
	        final MBFImage small = i.process(resize).normalise();
	        final int x = (200 - small.getWidth()) / 2;
	        final int y = (200 - small.getHeight()) / 2;
	        tmp.drawImage(small, x, y);
	        tmpAccum.addInplace(tmp);
	    }
	    synchronized (current) {
	        current.addInplace(tmpAccum);
	    }
	}
});
            	  
               	    The RangePartitioner in the above code will break
               	    the images in clzImages into as many
               	    (approximately equally sized) chunks as there are available CPU
               	    cores. This means that the perform method will be
               	    called many fewer times, but will do more work - what we’ve done is
               	    called increasing the task granularity. Notice that we created an
               	    extra working image called tmpAccum to hold the
               	    intermediary results. This means memory usage will be increased,
               	    however, also notice that whilst we still have to synchronise on the
               	    current image, we do far fewer times (once per CPU core in fact).
               	    Try running the improved version of the code; on my laptop it
               	    reduces the total running time even further to ~2900 milliseconds.
               	  
            
As we discussed earlier in the tutorial, there were three primary ways in which we could have approached the parallelisation of the image-averaging program. Instead of parallelising the inner loop, can you modify the code to parallelise the outer loop instead? Does this make the code faster? What are the pros and cons of doing this?