Steve Lloyd's ATLAS Grid Tests: 2007

Wednesday 19 December 2007

Major Disruption

There was major disruption over the last couple of days after a raid array died and various things had to be recovered from backup. Everything should be running again now but there is some history missing between 13-19 Dec. This could be recovered but is probably not worth it. The system is 'at risk' from now till 3 Jan (no 24x7 here!). Merry Christmas and a Happy New Year to all who use my test results.

Thursday 6 December 2007

ATLAS Tests now all use release 13.0.30

I have finally managed to create some 13.0.30 AOD and upload it to (most) UK SEs. All the tests are now using 13.0.30. QMUL, UCL_CCC, Lancaster and Edinburgh are not getting the analysis job at the moment as I could not make the replica at these sites.

Tuesday 4 December 2007

Problems at RAL

All the RAL RBs seem to be 'stuck' and the lfc is not reachable. I have switched to the ScotGrid RB to try and get the ATLAS tests running again.

Monday 29 October 2007

Many Changes

So finally things have stabilized. The ATLAS and RB tests are running properly again using a database (rather than text files) to improve reliability. The php scripts that provided the summaries have been replaced by static pages to stop the webserver errors. In addition there are now tests of the UK top level BDIIs: http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/bdiitest.html All the tests are summarised on one page (accessible from the GridPP home page): http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/ukgrid.html Plots of SAM performance are also available:
http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/samplots.html

Thursday 6 September 2007

Mnay problems and possible solutions

There have been many problems over the summer:
RB Tests: I still cannot submit jobs through IC or Glasgow RBs due to atlas/dteam conflicts. There have been other problems due to a build up of jobs, problems updating my (flat) database of results, an overloaded server etc.
ATLAS Tests: These died on 17 August when there was a power cut and have only recently been revieved. They are OK at the moment.
SAM Tests: There was a problem in caculating 'availability' but this only affected the history (for Tier-1?) and has been fixed.

I am rewritting some of the code and replacing my flat database with sqlite. This has been done for the RB Tests and seems to be OK. There are a few more tweaks to do then I will move the ATLAS tests to the same framework. This should make it more robust.

Friday 6 July 2007

dteam Woes

In order to de-ATLAS my test jobs (especially for the RB tests) I joined the dteam. This was a disaster to start with as my ATLAS jobs failed at some sites due to VOMS/Gridmap file issues and bugs in edg-job-submit. These seem to more or less resolved now except that all the jobs I submit through the Glasgow RB never run. They get submitted to a CE but stay in the queue for ever. No-one seems to know why.

RB Tests

There is a new set of tests targeted at UK Resource Brokers. These send (non ATLAS) "Hello World" jobs to each UK RB every 10 minutes to execute on any UK CE. See
http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/rbtest.html.

Tuesday 19 June 2007

RB Logging Problems

Although I switched to lcgrb02 the logging was still going to lcgrb01. This is because it's hardwired into /opt/edg/etc/edg_wl_ui_cmd_var.conf and can't be overwritten by your own conf file (how dumb is that?). Commented out LoggingDestination and it should be OK now. Stephen B reminds me that it is written here:
http://www.gridpp.ac.uk/deployment/users/faq.html#chooserb

Monday 18 June 2007

ATLAS Release 13 and replica problems

There have been a few problems in the last few days. Firstly some sites installed ATLAS release 13.0.10 which broke some of the tests. At the moment the tests are only running on version 12.0.6 even if 13.0.10 is installed. Secondly something has happened to the UI at QMUL in that I cannot find out about my file replicas any more. This prevented analysis jobs being submitted over the weekend. This is being worked around at the moment by using a fixed list and not trying to obtain replica information dynamically.

Tuesday 22 May 2007

Several problems

On Saturday morning (19 May) the machine that submits the test jobs crashed. On Sunday the system was moved to a different machine but now ~50% jobs fail to complete. This appears to be correlated with them going through lcgrb02.gridpp.rl.ac.uk rather than lcgrb01.gridpp.rl.ac.uk and looks like an RB problem rather than a problem at my end. I have now switched to the IC RB to see if this improves things.

Friday 11 May 2007

More hangups

System hung up again trying to retrieve the job output. I have changed the scrip so that this is now done in a separate thread which is killed after n minutes so as not to hang the whole script.

Tuesday 1 May 2007

Still messed up

Things are still messed up. The script seems to hang trying to retrieve output. I have made a slight change and suspended submission until I clear the backlog.

Monday 30 April 2007

Something messed up this weekend

Over the weekend the system got messed up. Job submission seemed to be OK and the jobs ran OK but my scripts kept hanging getting the output back. No idea why. It seems to be OK again now.

Friday 27 April 2007

Slight bug in list of logs fixed

There was a small bug in the php script for the web page causing some jobs not to appear in the filtered list of job on the "All Logs" page although they did appear on the main page. This is now fixed. (Stupid strpos returns an integer that can be 0 if there is a match at the first character but false if there is no match. You have to use === or !=== to test it)

Thursday 29 March 2007

Liverpool replica disappeared again

The replica at Liverpool disappeared again but was still in the catalogue. We do not know why. Carl has remade it now.

Wednesday 28 March 2007

Adjust times for Summer Time

All timestamps appear to be in GMT so changed the web page so that the time of the last job submitted takes summer time into account properly. There are some internal conversions that still need to be checked.

Replicas back

The RAL catalogue seems to be fixed now and analysis jobs are running again.

Tuesday 27 March 2007

Catalogue broken?

I cannot find out any information about any of my replicas. However apparently the replica on disk at Glasgow (at least) is OK. Have raised a GGUS ticket.

Friday 16 March 2007

Summary Problem

There was a problem with the summary csv file having a few entries with only one event. This was due to entries in the original log file being out of time sequence. The file is now sorted before the summary is made.

Hung up

The system was hung up this morning. Although jobs were being submitted overnight my script couldn't find their status or get there output. I did it by hand and it was OK. Perhaps there was a temporary RB problem.

Thursday 15 March 2007

A Sheffield Replica Finally

Finally managed to make a replica at Sheffield!

Bug fixed

The submission system was hung up this morning. This was due to a job being in a "Waiting" state exposing a bug in the script and causing it to abort. This is hopefully now fixed and the backlog has been cleared.

Wednesday 14 March 2007

Filter Logs on Status

Added the option to filter all log files on the job status, i.e Failed, Aborted, Not successful etc.

Thursday 8 March 2007

Old Logs Available

There is a new link "All Logs" which allows one to look at old log and output files. These can be filtered by institute. It's a bit flaky because the webserver isn't really up to it. Logs before February 8 have been lost.

Monday 5 March 2007

Update RB config file

In my atlas.config file change


NSAddresses = "lcgrb02.gridpp.rl.ac.uk:7772";
LBAddresses = "lcgrb02.gridpp.rl.ac.uk:9000";


NSAddresses = {"lcgrb01.gridpp.rl.ac.uk:7772","lcgrb02.gridpp.rl.ac.uk:7772"};
LBAddresses = {{"lcgrb01.gridpp.rl.ac.uk:9000"},{"lcgrb02.gridpp.rl.ac.uk:9000"}};

to try and use load balancing.

Sunday 4 March 2007

Working again

Commented out LoggingDestination line in heppc009:/opt/edg/etc/edg_wl_ui_cmd_var.conf Everything seems to be OK again now.

Friday 2 March 2007

RAL RB Broken

All job submission fails:


Selected Virtual Organisation name (from --config-vo option): atlas
Connecting to host lcgrb02.gridpp.rl.ac.uk, port 7772
Logging to host lcgrb01.gridpp.rl.ac.uk, port 9002
**** Error: API_NATIVE_ERROR ****  
Error while calling the "edg_wll_RegisterJobSync" native api 
Unable to Register the Job:
https://lcgrb02.gridpp.rl.ac.uk:9000/rdV_Ep9fG_oqlBypG--UBQ
to the LB logger at: lcgrb01.gridpp.rl.ac.uk:9002
No route to host (edg_wll_ssl_connect())

Why does it try and use lcgrb01.gridpp.rl.ac.uk when I have this in my conf file:


[
VirtualOrganisation = "atlas";
NSAddresses = "lcgrb02.gridpp.rl.ac.uk:7772";
LBAddresses = "lcgrb02.gridpp.rl.ac.uk:9000";

Manchester ce01 Off

There are no suitable queues on Manchester ce01 and all my jobs fail so I've switched it off for the time being. ce02 is OK.

Thursday 1 March 2007

Splitting Manchester

On Alessandra's request I am splitting Manchester into two - ce01 and ce02 reading from dcache01 and dcache02 respectively. At the moment it isn't quite working because of a bdii problem somewhere.

Minor problem

Attempts to make everything fully automatic by killing old processes and deleting their lock files failed because of a bug. Everything stopped overnight. Hopefully now OK.

Tuesday 27 February 2007

Proxy expired

All job submission stopped this afternoon because my proxy expired. Now running again.

Monday 26 February 2007

Cleanup not quite right

The cleanup didn't work quite right as it cleared the entries out of the log file but not out of the if file and so they kept coming back. An attempt to fix this yesterday failed because of a typo. Trying again.

Friday 23 February 2007

Automatic cleanup

Have introduced some automatic cleanup via the cron job. Scheduled, Ready, Running, Waiting and "Job proxy is expired" jobs are removed from the log after 12 hours. The logs are archived after 7 days. Lock files are removed after 60 minutes and an attempt is made to kill the running process.

OK Now

Seems to be OK now. Have put the archiving of the log files in the cron job to run once a day.

Web page not updated

The web page has not been updated since 4.30 this morning. The previous lock file was not removed for some reason. Deleted it by hand. Job submission seems to have been OK.

Thursday 22 February 2007

Jobs restarted

Cleaned out all pending jobs and restarted submission.

Problem understood

The file system could not cope with me storing all the job outputs in one place. The code is being changed so that each day's worth of jobs is in a separate directory. Job submission will be restarted once this has been debugged.

Strange State

Not sure what is going on. Summary is all yellow but individual jobs look OK. It appears atest isn't getting the status properly. Stopped submission till I sort it out.

Wednesday 21 February 2007

Sheffield replica failed again


[lloyd@heppc005 atest]$ export LFC_HOST=lfc.gridpp.rl.ac.uk
[lloyd@heppc005 atest]$ lcg-rep -v -t 120 --vo atlas -d lcgse1.shef.ac.uk 
srm://epgse1.ph.bham.ac.uk/dpm/ph.bham.ac.uk/home/atlas/generated/2007-01-19/
filef607aa8a-5ee3-44a8-bf0f-f0ccbb73aecb
Using grid catalog type: lfc
Using grid catalog : lfc.gridpp.rl.ac.uk
lcg_rep: Permission denied

Looking Good

Everything looking much better today. Changed the kill time back to 8 hours.

Tuesday 20 February 2007

Status

Looking OK at the moment - 70% success

System restarted

Cleaned out all remaining jobs at IC from my system and restarted everything using the 2nd RAL RB.

Imperial RB is very slow

The Imperial RC is very slow and there is a big backlog of jobs still to be processed ('waiting'). I have stopped job submission and switched the RB to lcgrb02. I have also extended the cancellation time from 8 to 20 hours till the backlog is cleared.