LSF Error Codes
Determining why a job ended unexpectedly is an essential skill for running jobs successfully on the cluster and identifying systemic errors.
The basic process for locating error codes, and subsequently an english translation, mostly involves the use of the bjobs and bhist commands. A script for locating job exit information is also provided in /tools/scripts.
Here is some information on common LSF error codes…
Error condition | LSF exit code | System code eq. | Meaning |
---|---|---|---|
Command not found | 127 | 1 or 127 | Command shell returns 1 if command not found. If the command cannot be found inside a job script, LSF return exit code 127. |
Directory not available for output | 0 | 1 | LSF sends the output back to user through email if directory not available for output (bsub -o). |
LSF internal error | -127, 127 | N/A | RES returns -127 or 127 for all internal problems. |
Out of memory | N/A | N/A | Exit code depends on the error handling of the application itself. |
LSF job states | 0 | N/A | Exit code 0 is returned for all job states |
Source: http://www-01.ibm.com/support/docview.wss?uid=isg3T1013659
Error codes over 127 and less than 255 (255 is a general failure) are considered a system errors, and the actual code can be obtained with the following formula:
LSF Error Code – 128 = System Error Code
So, for example, if LSF returns a 139, your actual error code is 139 – 128 = 11, which is a SEGFAULT on most systems.
Here is some more information from CERN: