Tuesday, 27 February 2007
Proxy expired
All job submission stopped this afternoon because my proxy expired. Now running again.
Monday, 26 February 2007
Cleanup not quite right
The cleanup didn't work quite right as it cleared the entries out of the log file but not out of the if file and so they kept coming back. An attempt to fix this yesterday failed because of a typo. Trying again.
Friday, 23 February 2007
Automatic cleanup
Have introduced some automatic cleanup via the cron job. Scheduled, Ready, Running, Waiting and "Job proxy is expired" jobs are removed from the log after 12 hours. The logs are archived after 7 days. Lock files are removed after 60 minutes and an attempt is made to kill the running process.
OK Now
Seems to be OK now. Have put the archiving of the log files in the cron job to run once a day.
Web page not updated
The web page has not been updated since 4.30 this morning. The previous lock file was not removed for some reason. Deleted it by hand. Job submission seems to have been OK.
Thursday, 22 February 2007
Problem understood
The file system could not cope with me storing all the job outputs in one place. The code is being changed so that each day's worth of jobs is in a separate directory. Job submission will be restarted once this has been debugged.
Strange State
Not sure what is going on. Summary is all yellow but individual jobs look OK. It appears atest isn't getting the status properly. Stopped submission till I sort it out.
Wednesday, 21 February 2007
Sheffield replica failed again
[lloyd@heppc005 atest]$ export LFC_HOST=lfc.gridpp.rl.ac.uk
[lloyd@heppc005 atest]$ lcg-rep -v -t 120 --vo atlas -d lcgse1.shef.ac.uk
srm://epgse1.ph.bham.ac.uk/dpm/ph.bham.ac.uk/home/atlas/generated/2007-01-19/
filef607aa8a-5ee3-44a8-bf0f-f0ccbb73aecb
Using grid catalog type: lfc
Using grid catalog : lfc.gridpp.rl.ac.uk
lcg_rep: Permission denied
Tuesday, 20 February 2007
System restarted
Cleaned out all remaining jobs at IC from my system and restarted everything using the 2nd RAL RB.
Imperial RB is very slow
The Imperial RC is very slow and there is a big backlog of jobs still to be processed ('waiting'). I have stopped job submission and switched the RB to lcgrb02. I have also extended the cancellation time from 8 to 20 hours till the backlog is cleared.
Subscribe to:
Posts (Atom)