Tuesday 27 February 2007

Proxy expired

All job submission stopped this afternoon because my proxy expired. Now running again.

Monday 26 February 2007

Cleanup not quite right

The cleanup didn't work quite right as it cleared the entries out of the log file but not out of the if file and so they kept coming back. An attempt to fix this yesterday failed because of a typo. Trying again.

Friday 23 February 2007

Automatic cleanup

Have introduced some automatic cleanup via the cron job. Scheduled, Ready, Running, Waiting and "Job proxy is expired" jobs are removed from the log after 12 hours. The logs are archived after 7 days. Lock files are removed after 60 minutes and an attempt is made to kill the running process.

OK Now

Seems to be OK now. Have put the archiving of the log files in the cron job to run once a day.

Web page not updated

The web page has not been updated since 4.30 this morning. The previous lock file was not removed for some reason. Deleted it by hand. Job submission seems to have been OK.

Thursday 22 February 2007

Jobs restarted

Cleaned out all pending jobs and restarted submission.

Problem understood

The file system could not cope with me storing all the job outputs in one place. The code is being changed so that each day's worth of jobs is in a separate directory. Job submission will be restarted once this has been debugged.

Strange State

Not sure what is going on. Summary is all yellow but individual jobs look OK. It appears atest isn't getting the status properly. Stopped submission till I sort it out.

Wednesday 21 February 2007

Sheffield replica failed again


[lloyd@heppc005 atest]$ export LFC_HOST=lfc.gridpp.rl.ac.uk
[lloyd@heppc005 atest]$ lcg-rep -v -t 120 --vo atlas -d lcgse1.shef.ac.uk
srm://epgse1.ph.bham.ac.uk/dpm/ph.bham.ac.uk/home/atlas/generated/2007-01-19/
filef607aa8a-5ee3-44a8-bf0f-f0ccbb73aecb
Using grid catalog type: lfc
Using grid catalog : lfc.gridpp.rl.ac.uk
lcg_rep: Permission denied

Looking Good

Everything looking much better today. Changed the kill time back to 8 hours.

Tuesday 20 February 2007

Status

Looking OK at the moment - 70% success

System restarted

Cleaned out all remaining jobs at IC from my system and restarted everything using the 2nd RAL RB.

Imperial RB is very slow

The Imperial RC is very slow and there is a big backlog of jobs still to be processed ('waiting'). I have stopped job submission and switched the RB to lcgrb02. I have also extended the cancellation time from 8 to 20 hours till the backlog is cleared.