edit · history · print

Detecting black holes

Symptoms SAM tests are failing with an error:

  • UNKNOWN Got a job held event, reason: Job failed, no reason given by GRAM server
  • Got a job held event, reason: Globus error 94: the jobmanager does not accept any new requests (shutting down)
  • MARADONA File not available.Cannot read JobWrapper output, both from Condor and from Maradona.

Possible problem It can happen that one of the machines became a "black whole". One of the easiest ways to check for that is to check the mails of the SAM test's user. E.g. for CMS that's sgmcms000. The last mails should be complaining about the same node, also giving an additional explanation of what went wrong. Typically it's a problem related to the scp'ing of the data.

edit · history · print
Page last modified on August 23, 2010, at 08:21 PM