BLAST (Basic Local Alignment Search Tool) is a tool
used in bioinformatics to find regions of local similarity between sequences.
BLAST is a software package that contains several different
tools that search existing databases given an input of nucleotides or a
protein. The exact details of BLAST and how to run and use
the software are beyond the scope of this documentation. The examples that
follow are to illustrate how to use the program with a cluster scheduling
system and a variety of scripting techniques to streamline program operation.
Note: The example inputs in this section are taken from the
tcoffee package, which has many different kinds of inputs in
a variety of formats.
Note: The scripts in this section are primarily for example purposes, they
are not a ``best method'' to run a
particular program in all cases, instead they are meant to showcase different
scripting and job setup techniques. In general, the best submission scripts are
those where the process is standardized and organized to a degree that you
seldom (if ever) have to change the script. You are encouraged to write scripts
that suit your particular style and preferences.
Here we have already set up a blast_test directory with an input file ready
to go.
Example:
[jdpoisso@umms-amino ~]$ cd blast_test/ [jdpoisso@umms-amino blast_test]$ ls sv.fasta [jdpoisso@umms-amino blast_test]$
Lets say you wanted to run this without using the
scheduling system. Your command might be something like this:
/opt/bio/ncbi/bin/blastall -p blastp -i sv.fasta -d
/library/yzhang/nr/nr
Note: Depending on how paths are set up on your cluster system or by
your personal settings you may not need to use the full path
``/opt/bio/ncbi/bin/blastall ''
Also, depending on your system you may need to be aware of having multiple
versions of a software, and only one may be the default at a given time. Be
aware of these factors when writing your script, and be sure to run the correct
version if your input is version sensitive.
submit that exact command to the scheduling system means writing it
into a script. Here is a script that will run that exact command through the
scheduling system:
blast.sh :
#!/bin/bash cd ${PBS_O_WORKDIR} /opt/bio/ncbi/bin/blastall -p blastp -i sv.fasta -d /library/yzhang/nr/nr
As you can see, just to run a command there is only a small amount
of setup required. The first line identifies the file as a script and specifies
what shell (for our purposes - the language of the script) to use. The second
line changes the directory using the ${PBS_O_WORKDIR} environment
variable. As previously mentioned, the scheduling system may set environment
variables for a job. The ${PBS_O_WORKDIR} variable is set to the
directory from which the job is submitted, our submission
directory . So if we submit the job in our
blast_test directory, then the variable is set to be
``blast_test `` (actually
its the absolute path - /home/jdpoisso/blast_test ).
This is necessary because (by default) when the scheduling system starts your
job, the directory is set to be your home directory, which may have any input
(or worse, different input) causing your job to try to run, failing to
find your input and ending. TT>
Example :
[jdpoisso@umms-amino blast_test]$ qsub blast.sh 1231700.umms-amino.ccmb.med.umich.edu [jdpoisso@umms-amino blast_test]$ qstat 1231700 Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 1231700.umms-amino blast.sh jdpoisso 00:03:20 R default <----job finishes----> [jdpoisso@umms-amino blast_test]$ ls blast.sh blast.sh.e1231700 blast.sh.o1231700 sv.fasta [jdpoisso@umms-amino blast_test]$
Having run that script, the job is allowed to run, and to finish,
and the output is placed in submission directory. For those familiar with
BLAST , you may know that when the specified command is run
without the scheduling system, all the results are printed to the screen, and
not saved to a file. , all the results normally printed to the screen
and instead in blast.sh.o1231700 , as mentioned in a previous section,
the scheduling system captures the standard output and saves it into a file, which is then posted back to your submission directory as a
result.
What if your job, instead of writing its results to the
screen, writes it out to a file? Or multiple files? In this case, there may be
nothing in that blast.sh.o1231700 file. Instead your data may be in
the files you specified, or specified by the program.
blast.sh :
[jdpoisso@umms-amino blast_test]$ qsub blast.sh 1231986.umms-amino.ccmb.med.umich.edu [jdpoisso@umms-amino blast_test]$ qstat 1231986 Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 1231986.umms-amino blast.sh jdpoisso 00:00:36 R default [jdpoisso@umms-amino blast_test]$ ls blast.out blast.seq blast.sh blast.sh.e1231986 blast.sh.o1231986 sv.fasta [jdpoisso@umms-amino blast_test]$
The command in the script has been changed to produce two output files,
instead of anything that would be printed to the screen. Both these files have
been written to our submission directory,
as our script still explicitly changes to our submission directory. this example, the files are small and manageable. However, what if your results are large files? Or your program does megabytes or gigabytes of temporary storage while it runs? are many ways your cluster system could be configured to handle these situation. Some systems may have a high speed shared space that provides the necessary performance to handle many concurrent jobs of this type. In most cases though, you will want to copy your job to a local scratch space.
Accomplishing this is quite simple, assuming as in previous example all the
necessary data is in the submission directory, we can modify our script to copy
out and stage our data into a local scratch space (/tmp).
blast.sh :
#!/bin/bash cd ${PBS_O_WORKDIR} # setup and copy workdir LOCAL_WORKDIR=/tmp/${USER}/${PBS_JOBID} mkdir -p ${LOCAL_WORKDIR} cp -r * ${LOCAL_WORKDIR} cd ${LOCAL_WORKDIR} /opt/bio/ncbi/bin/blastall -p blastp -i sv.fasta -d /library/yzhang/nr/nr -o blast.out -O blast.seq # copy back data cp -r * ${PBS_O_WORKDIR} cd ${PBS_O_WORKDIR} # cleanup rm -rf ${LOCAL_WORKDIR}
Example :
[jdpoisso@umms-amino blast_test]$ qsub blast.sh 1382907.umms-amino.ccmb.med.umich.edu [jdpoisso@umms-amino blast_test]$ qstat 1382907 Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 1382907.umms-amino blast.sh jdpoisso 00:04:10 R default [jdpoisso@umms-amino blast_test]$ ls blast.sh sv.fasta <----job finishes----> [jdpoisso@umms-amino blast_test]$ ls blast.out blast.seq blast.sh blast.sh.e1382907 blast.sh.o1382907 sv.fasta [jdpoisso@umms-amino blast_test]$
Note: The example script here does not take into account any potential
caveats that could occur due to copying files back and forth, such as running
out of disk space, or failing to copy files back to their origin. A modified
version of the script above that includes checks for these factors is included
in the scripting appendix. (Draft Note - Write a scripting appendix)
As you can see, the job may be submitted and run, and no change appears to
the submission directory. This is because the relevant data (in this case the
sv.fasta file) had been copied to a local scratch space on the whatever node
the program was assigned. the program completed, the data was copied
back, and all the results appear in your submission directory.
Using these examples you have a simple framework and knowledge base you can
use to submit jobs to the cluster. You should be able to submit jobs and have
them run. However, you may notice certain problems when running your jobs using
modified versions of these examples. Your jobs may end prematurely. They may
not thread or distribute properly. They may run extremely slowly. They may
crash for lack of memory or disk space. is because so far there has been
no discussion of how to request resources.