Tuesday 22 May 2007

Several problems

On Saturday morning (19 May) the machine that submits the test jobs crashed. On Sunday the system was moved to a different machine but now ~50% jobs fail to complete. This appears to be correlated with them going through lcgrb02.gridpp.rl.ac.uk rather than lcgrb01.gridpp.rl.ac.uk and looks like an RB problem rather than a problem at my end. I have now switched to the IC RB to see if this improves things.

Friday 11 May 2007

More hangups

System hung up again trying to retrieve the job output. I have changed the scrip so that this is now done in a separate thread which is killed after n minutes so as not to hang the whole script.

Tuesday 1 May 2007

Still messed up

Things are still messed up. The script seems to hang trying to retrieve output. I have made a slight change and suspended submission until I clear the backlog.