Thursday, 29 March 2007
Liverpool replica disappeared again
The replica at Liverpool disappeared again but was still in the catalogue. We do not know why. Carl has remade it now.
Wednesday, 28 March 2007
Adjust times for Summer Time
All timestamps appear to be in GMT so changed the web page so that the time of the last job submitted takes summer time into account properly. There are some internal conversions that still need to be checked.
Tuesday, 27 March 2007
Catalogue broken?
I cannot find out any information about any of my replicas. However apparently the replica on disk at Glasgow (at least) is OK. Have raised a GGUS ticket.
Friday, 16 March 2007
Summary Problem
There was a problem with the summary csv file having a few entries with only one event. This was due to entries in the original log file being out of time sequence. The file is now sorted before the summary is made.
Hung up
The system was hung up this morning. Although jobs were being submitted overnight my script couldn't find their status or get there output. I did it by hand and it was OK. Perhaps there was a temporary RB problem.
Thursday, 15 March 2007
Bug fixed
The submission system was hung up this morning. This was due to a job being in a "Waiting" state exposing a bug in the script and causing it to abort. This is hopefully now fixed and the backlog has been cleared.
Wednesday, 14 March 2007
Filter Logs on Status
Added the option to filter all log files on the job status, i.e Failed, Aborted, Not successful etc.
Thursday, 8 March 2007
Old Logs Available
There is a new link "All Logs" which allows one to look at old log and output files. These can be filtered by institute. It's a bit flaky because the webserver isn't really up to it. Logs before February 8 have been lost.
Monday, 5 March 2007
Update RB config file
In my atlas.config file change
to
to try and use load balancing.
NSAddresses = "lcgrb02.gridpp.rl.ac.uk:7772";
LBAddresses = "lcgrb02.gridpp.rl.ac.uk:9000";
to
NSAddresses = {"lcgrb01.gridpp.rl.ac.uk:7772","lcgrb02.gridpp.rl.ac.uk:7772"};
LBAddresses = {{"lcgrb01.gridpp.rl.ac.uk:9000"},{"lcgrb02.gridpp.rl.ac.uk:9000"}};
to try and use load balancing.
Sunday, 4 March 2007
Working again
Commented out LoggingDestination line in heppc009:/opt/edg/etc/edg_wl_ui_cmd_var.conf Everything seems to be OK again now.
Friday, 2 March 2007
RAL RB Broken
All job submission fails:
Why does it try and use lcgrb01.gridpp.rl.ac.uk when I have this in my conf file:
Selected Virtual Organisation name (from --config-vo option): atlas
Connecting to host lcgrb02.gridpp.rl.ac.uk, port 7772
Logging to host lcgrb01.gridpp.rl.ac.uk, port 9002
**** Error: API_NATIVE_ERROR ****
Error while calling the "edg_wll_RegisterJobSync" native api
Unable to Register the Job:
https://lcgrb02.gridpp.rl.ac.uk:9000/rdV_Ep9fG_oqlBypG--UBQ
to the LB logger at: lcgrb01.gridpp.rl.ac.uk:9002
No route to host (edg_wll_ssl_connect())
Why does it try and use lcgrb01.gridpp.rl.ac.uk when I have this in my conf file:
[
VirtualOrganisation = "atlas";
NSAddresses = "lcgrb02.gridpp.rl.ac.uk:7772";
LBAddresses = "lcgrb02.gridpp.rl.ac.uk:9000";
Manchester ce01 Off
There are no suitable queues on Manchester ce01 and all my jobs fail so I've switched it off for the time being. ce02 is OK.
Thursday, 1 March 2007
Splitting Manchester
On Alessandra's request I am splitting Manchester into two - ce01 and ce02 reading from dcache01 and dcache02 respectively. At the moment it isn't quite working because of a bdii problem somewhere.
Minor problem
Attempts to make everything fully automatic by killing old processes and deleting their lock files failed because of a bug. Everything stopped overnight. Hopefully now OK.
Subscribe to:
Posts (Atom)