LSF Error Codes

Determining why a job ended unexpectedly is an essential skill for running jobs successfully on the cluster and identifying systemic errors.

The basic process for locating error codes, and subsequently an english translation, mostly involves the use of the bjobs and bhist commands. A script for locating job exit information is also provided in /tools/scripts.

Here is some information on common LSF error codes…

Error condition LSF exit code System code eq. Meaning
Command not found 127 1 or 127 Command shell returns 1 if command not found. If the command cannot be found inside a job script, LSF return exit code 127.
Directory not available for output 0 1 LSF sends the output back to user through email if directory not available for output (bsub -o).
LSF internal error -127, 127 N/A RES returns -127 or 127 for all internal problems.
Out of memory N/A N/A Exit code depends on the error handling of the application itself.
LSF job states 0 N/A Exit code 0 is returned for all job states

 

Source: http://www-01.ibm.com/support/docview.wss?uid=isg3T1013659

Error codes over 127  and less than 255 (255 is a general failure) are considered a system errors, and the actual code can be obtained with the following formula:

LSF Error Code – 128 = System Error Code

So, for example, if LSF returns a 139, your actual error code is 139 – 128 = 11,  which is a SEGFAULT on most systems.

Here is some more information from CERN:

http://information-technology.web.cern.ch/services/fe/lxbatch/howto/how-interpet-batch-job-return-codes

You may also like...

Leave a Reply