Steve Lloyd's ATLAS Grid Tests: March 2007

Thursday, 29 March 2007

Liverpool replica disappeared again

The replica at Liverpool disappeared again but was still in the catalogue. We do not know why. Carl has remade it now.

Wednesday, 28 March 2007

Adjust times for Summer Time

All timestamps appear to be in GMT so changed the web page so that the time of the last job submitted takes summer time into account properly. There are some internal conversions that still need to be checked.

Replicas back

The RAL catalogue seems to be fixed now and analysis jobs are running again.

Tuesday, 27 March 2007

Catalogue broken?

I cannot find out any information about any of my replicas. However apparently the replica on disk at Glasgow (at least) is OK. Have raised a GGUS ticket.

Friday, 16 March 2007

Summary Problem

There was a problem with the summary csv file having a few entries with only one event. This was due to entries in the original log file being out of time sequence. The file is now sorted before the summary is made.

Hung up

The system was hung up this morning. Although jobs were being submitted overnight my script couldn't find their status or get there output. I did it by hand and it was OK. Perhaps there was a temporary RB problem.

Thursday, 15 March 2007

A Sheffield Replica Finally

Finally managed to make a replica at Sheffield!

Bug fixed

The submission system was hung up this morning. This was due to a job being in a "Waiting" state exposing a bug in the script and causing it to abort. This is hopefully now fixed and the backlog has been cleared.

Wednesday, 14 March 2007

Filter Logs on Status

Added the option to filter all log files on the job status, i.e Failed, Aborted, Not successful etc.

Thursday, 8 March 2007

Old Logs Available

There is a new link "All Logs" which allows one to look at old log and output files. These can be filtered by institute. It's a bit flaky because the webserver isn't really up to it. Logs before February 8 have been lost.

Monday, 5 March 2007

Update RB config file

In my atlas.config file change


NSAddresses = "lcgrb02.gridpp.rl.ac.uk:7772";
LBAddresses = "lcgrb02.gridpp.rl.ac.uk:9000";


NSAddresses = {"lcgrb01.gridpp.rl.ac.uk:7772","lcgrb02.gridpp.rl.ac.uk:7772"};
LBAddresses = {{"lcgrb01.gridpp.rl.ac.uk:9000"},{"lcgrb02.gridpp.rl.ac.uk:9000"}};

to try and use load balancing.

Sunday, 4 March 2007

Working again

Commented out LoggingDestination line in heppc009:/opt/edg/etc/edg_wl_ui_cmd_var.conf Everything seems to be OK again now.

Friday, 2 March 2007

RAL RB Broken

All job submission fails:


Selected Virtual Organisation name (from --config-vo option): atlas
Connecting to host lcgrb02.gridpp.rl.ac.uk, port 7772
Logging to host lcgrb01.gridpp.rl.ac.uk, port 9002
**** Error: API_NATIVE_ERROR ****  
Error while calling the "edg_wll_RegisterJobSync" native api 
Unable to Register the Job:
https://lcgrb02.gridpp.rl.ac.uk:9000/rdV_Ep9fG_oqlBypG--UBQ
to the LB logger at: lcgrb01.gridpp.rl.ac.uk:9002
No route to host (edg_wll_ssl_connect())

Why does it try and use lcgrb01.gridpp.rl.ac.uk when I have this in my conf file:


[
VirtualOrganisation = "atlas";
NSAddresses = "lcgrb02.gridpp.rl.ac.uk:7772";
LBAddresses = "lcgrb02.gridpp.rl.ac.uk:9000";

Manchester ce01 Off

There are no suitable queues on Manchester ce01 and all my jobs fail so I've switched it off for the time being. ce02 is OK.

Thursday, 1 March 2007

Splitting Manchester

On Alessandra's request I am splitting Manchester into two - ce01 and ce02 reading from dcache01 and dcache02 respectively. At the moment it isn't quite working because of a bdii problem somewhere.

Minor problem

Attempts to make everything fully automatic by killing old processes and deleting their lock files failed because of a bug. Everything stopped overnight. Hopefully now OK.

Steve Lloyd's ATLAS Grid Tests