Steve Lloyd's ATLAS Grid Tests

Thursday, 8 March 2007

Old Logs Available

There is a new link "All Logs" which allows one to look at old log and output files. These can be filtered by institute. It's a bit flaky because the webserver isn't really up to it. Logs before February 8 have been lost.

Monday, 5 March 2007

Update RB config file

In my atlas.config file change


NSAddresses = "lcgrb02.gridpp.rl.ac.uk:7772";
LBAddresses = "lcgrb02.gridpp.rl.ac.uk:9000";


NSAddresses = {"lcgrb01.gridpp.rl.ac.uk:7772","lcgrb02.gridpp.rl.ac.uk:7772"};
LBAddresses = {{"lcgrb01.gridpp.rl.ac.uk:9000"},{"lcgrb02.gridpp.rl.ac.uk:9000"}};

to try and use load balancing.

Sunday, 4 March 2007

Working again

Commented out LoggingDestination line in heppc009:/opt/edg/etc/edg_wl_ui_cmd_var.conf Everything seems to be OK again now.

Friday, 2 March 2007

RAL RB Broken

All job submission fails:


Selected Virtual Organisation name (from --config-vo option): atlas
Connecting to host lcgrb02.gridpp.rl.ac.uk, port 7772
Logging to host lcgrb01.gridpp.rl.ac.uk, port 9002
**** Error: API_NATIVE_ERROR ****  
Error while calling the "edg_wll_RegisterJobSync" native api 
Unable to Register the Job:
https://lcgrb02.gridpp.rl.ac.uk:9000/rdV_Ep9fG_oqlBypG--UBQ
to the LB logger at: lcgrb01.gridpp.rl.ac.uk:9002
No route to host (edg_wll_ssl_connect())

Why does it try and use lcgrb01.gridpp.rl.ac.uk when I have this in my conf file:


[
VirtualOrganisation = "atlas";
NSAddresses = "lcgrb02.gridpp.rl.ac.uk:7772";
LBAddresses = "lcgrb02.gridpp.rl.ac.uk:9000";

Manchester ce01 Off

There are no suitable queues on Manchester ce01 and all my jobs fail so I've switched it off for the time being. ce02 is OK.

Thursday, 1 March 2007

Splitting Manchester

On Alessandra's request I am splitting Manchester into two - ce01 and ce02 reading from dcache01 and dcache02 respectively. At the moment it isn't quite working because of a bdii problem somewhere.

Minor problem

Attempts to make everything fully automatic by killing old processes and deleting their lock files failed because of a bug. Everything stopped overnight. Hopefully now OK.