Tuesday, 19 June 2007

More RB problems

lcgrb01 at RAL is now giving problems similar to that seen with lcgrb02 a couple of weeks ago. I've switched to using lcgrb02 only for the time being.

Monday, 18 June 2007

ATLAS Release 13 and replica problems

There have been a few problems in the last few days. Firstly some sites installed ATLAS release 13.0.10 which broke some of the tests. At the moment the tests are only running on version 12.0.6 even if 13.0.10 is installed. Secondly something has happened to the UI at QMUL in that I cannot find out about my file replicas any more. This prevented analysis jobs being submitted over the weekend. This is being worked around at the moment by using a fixed list and not trying to obtain replica information dynamically.

Tuesday, 22 May 2007

Several problems

On Saturday morning (19 May) the machine that submits the test jobs crashed. On Sunday the system was moved to a different machine but now ~50% jobs fail to complete. This appears to be correlated with them going through lcgrb02.gridpp.rl.ac.uk rather than lcgrb01.gridpp.rl.ac.uk and looks like an RB problem rather than a problem at my end. I have now switched to the IC RB to see if this improves things.

Friday, 11 May 2007

More hangups

System hung up again trying to retrieve the job output. I have changed the scrip so that this is now done in a separate thread which is killed after n minutes so as not to hang the whole script.

Tuesday, 1 May 2007

Still messed up

Things are still messed up. The script seems to hang trying to retrieve output. I have made a slight change and suspended submission until I clear the backlog.

Monday, 30 April 2007

Something messed up this weekend

Over the weekend the system got messed up. Job submission seemed to be OK and the jobs ran OK but my scripts kept hanging getting the output back. No idea why. It seems to be OK again now.

Friday, 27 April 2007

Slight bug in list of logs fixed

There was a small bug in the php script for the web page causing some jobs not to appear in the filtered list of job on the "All Logs" page although they did appear on the main page. This is now fixed. (Stupid strpos returns an integer that can be 0 if there is a match at the first character but false if there is no match. You have to use === or !=== to test it)