Hyperion Cluster: Using MATLAB on the Cluster

Contents

Overview
Submitting a Parallel Job
Submitting a Distributed Job

Overview

SNS has a license for the MATLAB Parallel Computing Toolbox (PCT) so that you may take advantage of parallel computing to speed up your computations. To allow jobs to run on the Hyperion Cluster we also have a license fo the MATLAB Distributed Computing Server for up to 16 workers.

For more information on the MATLAB Parallel Computing Toolbox, please see the Parallel Computing Toolbox documentation on the Mathworks website.

How write parallel MATLAB programs is beyond the scope of this document. This document only covers how to submit parallel or distributed jobs to the Hyperion Cluster.

Mathworks describes a distributed job as

one whose tasks do not directly communicate with each other. The tasks do not need to run simultaneously, and a worker might run several tasks of the same job in succession. Typically, all tasks perform the same or similar functions on different data sets in an embarrassingly parallel configuration.

and a parallel job as

A parallel job consists of only a single task that runs simultaneously on several workers, usually with different data. More specifically, the task is duplicated on each worker, so each worker can perform the task on a different set of data, or on a particular segment of a large data set. The workers can communicate with each other as each executes its task. In this configuration, workers are referred to as labs.

For more information, please see these links from the PCT documentation:

To submit a MATLAB job to the cluster, you do not use SGE's qsub command to submit a job script. Instead, you write a MATLAB function that submits the job for you. In the next section, I'll describe how to do this for a parallel job.

Submitting a Parallel Job


Here's a simple MATLAB function that uses MATLAB's parfor function to execute a loop in parallel, and returns the elapsed time it took to run the loop. We'll be using this code as an example.

function [elapsedTime] =  test_parfor()

N = 15000;
A = zeros(N,1);
tic;
parfor i=1:N
    E = eig(rand(100))+i;
    A(i) = E(1);
end
elapsedTime = toc;
disp(elapsedTime);
end

If you'd like to follow along with this example, copy the above code and save it into a text file named 'test_parfor.m'.

The MATLAB code below can be used to submit the above function to the cluster as a parallel job.

% Example MATLAB submission script for running a parallel job on the 
% SNS Hyperion Cluster
% Mar-2014 Lee Colbert, lcolbert@ias.edu

% Modify these lines to suit your job requirements.
cluster = parallel.cluster.Generic('JobStorageLocation', '/home/lcolbert/matlab');
h_rt = '00:15:00';
exclusive = 'false';

% Do not modify these lines
set(cluster, 'HasSharedFilesystem', true);
set(cluster, 'ClusterMatlabRoot', '/usr/local/matlab');
set(cluster, 'OperatingSystem', 'unix');
set(cluster, 'NumWorkers',16);
set(cluster, 'IndependentSubmitFcn', {@independentSubmitFcn,h_rt,exclusive});
set(cluster, 'CommunicatingSubmitFcn', {@communicatingSubmitFcn,h_rt,exclusive});
set(cluster, 'GetJobStateFcn', @getJobStateFcn);
set(cluster, 'DeleteJobFcn', @deleteJobFcn);

% Create parallel job with default settings.
pjob = createCommunicatingJob(cluster, 'Type', 'SPMD');
%pjob = createCommunicatingJob(cluster, 'Type', 'pool');


% Specify the number of workers required for execution of your job
pjob.NumWorkersRange = [1 16];

% Add a task to the job. In this example, I'm calling MATLAB's Sum 
% function to add 1+1. Replace @sum with a name your MATLAB program
%createTask(pjob, @sum, 1, {[1 2 3]});
createTask(pjob, @test_parfor, 1, {});

% Submit the job to the cluster
submit(pjob);

% Wait for the job to finish running, and retrieve the results.
% This is optional. Your program will block here until the parallel
% job completes. If your program is writing it's results to file, you
% many not want this, or you might want to move this further down in your
% code, so you can do other stuff while pjob runs.
wait(pjob, 'finished');
results = fetchOutputs(pjob);

% This checks for errors from individual tasks and reports them.
% very useful for debugging
errmsgs = get(pjob.Tasks, {'ErrorMessage'});
nonempty = ~cellfun(@isempty, errmsgs);
celldisp(errmsgs(nonempty));

% Display the results
disp(results);

% Destroy job
% For parallel jobs, I recommend NOT using the destroy command, since it
% causes the SGE jobs to exit with an Error due to a race condition. If you
% insist on using it to clean up the 'Job' files and subdirectories in your
% working directory, you must include the pause statement to avoid the job 
% finishing in SGE with a Error. 
%pause(16);
%destroy(pjob);

Cut and paste the above code into a text file, and save it with a '.m' extension somewhere in your MATLAB path so you can call it from with MATLAB. Give it a descriptive filename such as 'sgesubmitpar.m' or 'parallelsge.m', or whatever else you like.

Once you have saved the file, you will need to change a few lines. These lines are specified in the comments, but I'll briefly go over them here in the order they appear in the file. The first line you need to change is this one:

cluster = parallel.cluster.Generic('JobStorageLocation', '/home/lcolbert/matlab');

This line basically tells MATLAB what working directory to use for this job. It's similar to the '-wd' switch to qsub. Change '/path/to/your/data' to the actual path to your data.

The next two lines,

h_rt = '00:00:00';
exclusive = 'false';

are related to SGE. The first line specifies the estimated runtime (h_rt) for your job to SGE in HH:MM:SS format. If you don't change this, SGE will kill your job immediately, since its set to run for 0 seconds. If, for some reason, you want your job to have exclusive use of a cluster node, set exclusive to 'true'. Otherwise leave it set to 'false'.

Next you need to specify whether you are going to create your parallel job using a type of 'Pool' or 'SPMD'.

pjob = createCommunicatingJob(cluster, 'Type', 'SPMD');

or

pjob = createCommunicatingJob(cluster, 'Type', 'SPMD');

Only one of the lines should be uncommented. Which one depends on what MATLAB parallel computing functions you use. job = createCommunicatingJob(...,'Type','pool',...) creates a communicating job of type 'pool'. This is the default if 'Type' is not specified. A 'pool' job runs the specified task function with a MATLAB pool available to run the body of parfor loops or spmd blocks. job = createCommunicatingJob(...,'Type','spmd',...) creates a communicating job of type 'spmd', where the specified task function runs simultaneously on all workers, and lab* functions can be used for communication between workers.

The final line you need to change is the one that actually specifies what program or function you want MATLAB to run in parallel:

createTask(pjob, @test_parfor, 1, {});

You'll want to change this to the name of the matlab file containing the actual MATLAB program or fiunction you want to run.

Once you've saved your changes to this file. You can call it from within MATLAB.

Submitting a Distributed Job

Submitting a distributed job is very similar to submitting a parallel job, with some minor differences in the functions called, and the fact that some scheduler properties (MaxNumberofWorkers, etc.) are no longer needed, since they don't make any sense in this context. For this example, we are going to use a modified version of the test_parfor function from above that used for instead of parfor:

function [elapsedTime] =  test_for()

N = 15000;
A = zeros(N,1);
tic;
for i=1:N
    E = eig(rand(100))+i;
    A(i) = E(1);
end
elapsedTime = toc;
end

Copy this code and save it as test_for.m. The code below can be used to submit a distributed job using this function to the cluster:

% Example MATLAB submission script for running a distributed job on the
% SNS Hyperion Cluster
% Mar-2014 Lee Colbert, lcolbert@ias.edu

cluster = parallel.cluster.Generic('JobStorageLocation', '/home/lcolbert/work_lcolbert/testing/matlab');
h_rt = '00:15:00';
exclusive = 'false';

% Do not modify these lines
set(cluster, 'HasSharedFilesystem', true);
set(cluster, 'ClusterMatlabRoot', '/usr/local/matlab');
set(cluster, 'OperatingSystem', 'unix');
set(cluster, 'NumWorkers',16);
set(cluster, 'IndependentSubmitFcn', {@independentSubmitFcn,h_rt,exclusive});
set(cluster, 'CommunicatingSubmitFcn', {@communicatingSubmitFcn,h_rt,exclusive});
set(cluster, 'GetJobStateFcn', @getJobStateFcn);
set(cluster, 'DeleteJobFcn', @deleteJobFcn);

% Create distributed job with default settings.  You can also attach any files that are needed.
pjob = createJob(cluster,'AttachedFiles',...
        {'/home/lcolbert/matlab/test_for.m'});

% Add a task to the job. In this example, I'm calling the test_for
% function repeatedly. Replace @test_for with the name your MATLAB function
createTask(pjob, @test_for, 1, {});
createTask(pjob, @test_for, 1, {});
createTask(pjob, @test_for, 1, {});

% Run the job
submit(pjob);
% Wait for the job to finish running, and retrieve the results.
% This is optional. Your program will block here until the parallel
% job completes. If your program is writing it's results to file, you
% many not want this, you might want to move this further down in your
% code
wait(pjob);
results = fetchOutputs(pjob);

% This checks for errors from individual tasks and reports them.
% very useful for debugging
errmsgs = get(pjob.Tasks, {'ErrorMessage'});
nonempty = ~cellfun(@isempty, errmsgs);
celldisp(errmsgs(nonempty));

% Display the results
disp(results);
% destroy the job
destroy(job);

Cut and paste the code above into a text file and save it with a descriptive name, such as 'sgesubmitdist.m'. Save it to a location in your MATLAB path so you can call it from within MATLAB.

As with the parallel job above, you'll need to modify a couple of lines to meet the specifics of your job. The lines you need to change are clearly identified in the comments in the code. The first line you need to modify is this one:

cluster = parallel.cluster.Generic('JobStorageLocation', '/home/lcolbert/matlab');

Change it to match the location of your data. This tell MATLAB what directory to use for it's working directory, more or less.

The next two lines,

h_rt = '00:00:00';
exclusive = 'false';

are related to SGE. The first line specifies the estimated runtime (h_rt) for your job to SGE in HH:MM:SS format. If you don't change this, SGE will kill your job immediately, since its set to run for 0 seconds. If, for some reason, you want your job to have exclusive use of a cluster node, set exclusive to 'false'. Otherwise leave it set to 'true'.

You may need to specify the files that this job needs in order to run.

pjob = createJob(cluster,'AttachedFiles',...
        {'/home/lcolbert/matlab/test_for.m'});

The final line(s) you need to change is the one that actually specifies what programs or functions you want MATLAB to run on the cluster:

createTask(job, @test_for, 1, {});
createTask(job, @test_for, 1, {});
createTask(job, @test_for, 1, {});

This is where the real distributed magic happens. In this case, I'm telling it to run the test_for function from above as 3 different jobs. These jobs will run independently, which is what differentiates a distributed job from a parallel one. Instead of adding one function 3 different times, which is pretty useless, I could have added a function that takes parameters with 3 different sets of parameters, or 3 completely different functions with completely different parameters. In actual practice, you'd change this to the name of the MATLAB file containing the actual MATLAB programs or functions you want to run, and you could add more (or less) than 3 task. I just chose 3 tasks for this example.

Once you've saved your changes to this file. You can call it from within MATLAB. For example, I saved this file as 'sgesubmitdist.m':

>> submitsgedist
    [96.5408]
    [96.4077]
    [96.4140] 

And that's all there is too it. That should be all you need to know to get started using the Hyperion Cluster to run distributed or parallel jobs from MATLAB!