Wednesday, 19 December 2007
Major Disruption
There was major disruption over the last couple of days after a raid array died and various things had to be recovered from backup. Everything should be running again now but there is some history missing between 13-19 Dec. This could be recovered but is probably not worth it. The system is 'at risk' from now till 3 Jan (no 24x7 here!). Merry Christmas and a Happy New Year to all who use my test results.
Thursday, 6 December 2007
ATLAS Tests now all use release 13.0.30
I have finally managed to create some 13.0.30 AOD and upload it to (most) UK SEs. All the tests are now using 13.0.30. QMUL, UCL_CCC, Lancaster and Edinburgh are not getting the analysis job at the moment as I could not make the replica at these sites.
Tuesday, 4 December 2007
Problems at RAL
All the RAL RBs seem to be 'stuck' and the lfc is not reachable. I have switched to the ScotGrid RB to try and get the ATLAS tests running again.
Monday, 29 October 2007
Many Changes
So finally things have stabilized. The ATLAS and RB tests are running properly again using a database (rather than text files) to improve reliability. The php scripts that provided the summaries have been replaced by static pages to stop the webserver errors. In addition there are now tests of the UK top level BDIIs: http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/bdiitest.html All the tests are summarised on one page (accessible from the GridPP home page): http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/ukgrid.html Plots of SAM performance are also available:
http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/samplots.html
http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/samplots.html
Thursday, 6 September 2007
Mnay problems and possible solutions
There have been many problems over the summer:
RB Tests: I still cannot submit jobs through IC or Glasgow RBs due to atlas/dteam conflicts. There have been other problems due to a build up of jobs, problems updating my (flat) database of results, an overloaded server etc.
ATLAS Tests: These died on 17 August when there was a power cut and have only recently been revieved. They are OK at the moment.
SAM Tests: There was a problem in caculating 'availability' but this only affected the history (for Tier-1?) and has been fixed.
I am rewritting some of the code and replacing my flat database with sqlite. This has been done for the RB Tests and seems to be OK. There are a few more tweaks to do then I will move the ATLAS tests to the same framework. This should make it more robust.
RB Tests: I still cannot submit jobs through IC or Glasgow RBs due to atlas/dteam conflicts. There have been other problems due to a build up of jobs, problems updating my (flat) database of results, an overloaded server etc.
ATLAS Tests: These died on 17 August when there was a power cut and have only recently been revieved. They are OK at the moment.
SAM Tests: There was a problem in caculating 'availability' but this only affected the history (for Tier-1?) and has been fixed.
I am rewritting some of the code and replacing my flat database with sqlite. This has been done for the RB Tests and seems to be OK. There are a few more tweaks to do then I will move the ATLAS tests to the same framework. This should make it more robust.
Friday, 6 July 2007
dteam Woes
In order to de-ATLAS my test jobs (especially for the RB tests) I joined the dteam. This was a disaster to start with as my ATLAS jobs failed at some sites due to VOMS/Gridmap file issues and bugs in edg-job-submit. These seem to more or less resolved now except that all the jobs I submit through the Glasgow RB never run. They get submitted to a CE but stay in the queue for ever. No-one seems to know why.
RB Tests
There is a new set of tests targeted at UK Resource Brokers. These send (non ATLAS) "Hello World" jobs to each UK RB every 10 minutes to execute on any UK CE. See
http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/rbtest.html.
http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/rbtest.html.
Tuesday, 19 June 2007
RB Logging Problems
Although I switched to lcgrb02 the logging was still going to lcgrb01. This is because it's hardwired into /opt/edg/etc/edg_wl_ui_cmd_var.conf and can't be overwritten by your own conf file (how dumb is that?). Commented out LoggingDestination and it should be OK now. Stephen B reminds me that it is written here:
http://www.gridpp.ac.uk/deployment/users/faq.html#chooserb
http://www.gridpp.ac.uk/deployment/users/faq.html#chooserb
More RB problems
lcgrb01 at RAL is now giving problems similar to that seen with lcgrb02 a couple of weeks ago. I've switched to using lcgrb02 only for the time being.
Monday, 18 June 2007
ATLAS Release 13 and replica problems
There have been a few problems in the last few days. Firstly some sites installed ATLAS release 13.0.10 which broke some of the tests. At the moment the tests are only running on version 12.0.6 even if 13.0.10 is installed. Secondly something has happened to the UI at QMUL in that I cannot find out about my file replicas any more. This prevented analysis jobs being submitted over the weekend. This is being worked around at the moment by using a fixed list and not trying to obtain replica information dynamically.
Tuesday, 22 May 2007
Several problems
On Saturday morning (19 May) the machine that submits the test jobs crashed. On Sunday the system was moved to a different machine but now ~50% jobs fail to complete. This appears to be correlated with them going through lcgrb02.gridpp.rl.ac.uk rather than lcgrb01.gridpp.rl.ac.uk and looks like an RB problem rather than a problem at my end. I have now switched to the IC RB to see if this improves things.
Friday, 11 May 2007
More hangups
System hung up again trying to retrieve the job output. I have changed the scrip so that this is now done in a separate thread which is killed after n minutes so as not to hang the whole script.
Tuesday, 1 May 2007
Still messed up
Things are still messed up. The script seems to hang trying to retrieve output. I have made a slight change and suspended submission until I clear the backlog.
Monday, 30 April 2007
Something messed up this weekend
Over the weekend the system got messed up. Job submission seemed to be OK and the jobs ran OK but my scripts kept hanging getting the output back. No idea why. It seems to be OK again now.
Friday, 27 April 2007
Slight bug in list of logs fixed
There was a small bug in the php script for the web page causing some jobs not to appear in the filtered list of job on the "All Logs" page although they did appear on the main page. This is now fixed. (Stupid strpos returns an integer that can be 0 if there is a match at the first character but false if there is no match. You have to use === or !=== to test it)
Thursday, 29 March 2007
Liverpool replica disappeared again
The replica at Liverpool disappeared again but was still in the catalogue. We do not know why. Carl has remade it now.
Wednesday, 28 March 2007
Adjust times for Summer Time
All timestamps appear to be in GMT so changed the web page so that the time of the last job submitted takes summer time into account properly. There are some internal conversions that still need to be checked.
Tuesday, 27 March 2007
Catalogue broken?
I cannot find out any information about any of my replicas. However apparently the replica on disk at Glasgow (at least) is OK. Have raised a GGUS ticket.
Friday, 16 March 2007
Summary Problem
There was a problem with the summary csv file having a few entries with only one event. This was due to entries in the original log file being out of time sequence. The file is now sorted before the summary is made.
Hung up
The system was hung up this morning. Although jobs were being submitted overnight my script couldn't find their status or get there output. I did it by hand and it was OK. Perhaps there was a temporary RB problem.
Thursday, 15 March 2007
Bug fixed
The submission system was hung up this morning. This was due to a job being in a "Waiting" state exposing a bug in the script and causing it to abort. This is hopefully now fixed and the backlog has been cleared.
Wednesday, 14 March 2007
Filter Logs on Status
Added the option to filter all log files on the job status, i.e Failed, Aborted, Not successful etc.
Thursday, 8 March 2007
Old Logs Available
There is a new link "All Logs" which allows one to look at old log and output files. These can be filtered by institute. It's a bit flaky because the webserver isn't really up to it. Logs before February 8 have been lost.
Monday, 5 March 2007
Update RB config file
In my atlas.config file change
to
to try and use load balancing.
NSAddresses = "lcgrb02.gridpp.rl.ac.uk:7772";
LBAddresses = "lcgrb02.gridpp.rl.ac.uk:9000";
to
NSAddresses = {"lcgrb01.gridpp.rl.ac.uk:7772","lcgrb02.gridpp.rl.ac.uk:7772"};
LBAddresses = {{"lcgrb01.gridpp.rl.ac.uk:9000"},{"lcgrb02.gridpp.rl.ac.uk:9000"}};
to try and use load balancing.
Sunday, 4 March 2007
Working again
Commented out LoggingDestination line in heppc009:/opt/edg/etc/edg_wl_ui_cmd_var.conf Everything seems to be OK again now.
Friday, 2 March 2007
RAL RB Broken
All job submission fails:
Why does it try and use lcgrb01.gridpp.rl.ac.uk when I have this in my conf file:
Selected Virtual Organisation name (from --config-vo option): atlas
Connecting to host lcgrb02.gridpp.rl.ac.uk, port 7772
Logging to host lcgrb01.gridpp.rl.ac.uk, port 9002
**** Error: API_NATIVE_ERROR ****
Error while calling the "edg_wll_RegisterJobSync" native api
Unable to Register the Job:
https://lcgrb02.gridpp.rl.ac.uk:9000/rdV_Ep9fG_oqlBypG--UBQ
to the LB logger at: lcgrb01.gridpp.rl.ac.uk:9002
No route to host (edg_wll_ssl_connect())
Why does it try and use lcgrb01.gridpp.rl.ac.uk when I have this in my conf file:
[
VirtualOrganisation = "atlas";
NSAddresses = "lcgrb02.gridpp.rl.ac.uk:7772";
LBAddresses = "lcgrb02.gridpp.rl.ac.uk:9000";
Manchester ce01 Off
There are no suitable queues on Manchester ce01 and all my jobs fail so I've switched it off for the time being. ce02 is OK.
Thursday, 1 March 2007
Splitting Manchester
On Alessandra's request I am splitting Manchester into two - ce01 and ce02 reading from dcache01 and dcache02 respectively. At the moment it isn't quite working because of a bdii problem somewhere.
Minor problem
Attempts to make everything fully automatic by killing old processes and deleting their lock files failed because of a bug. Everything stopped overnight. Hopefully now OK.
Tuesday, 27 February 2007
Proxy expired
All job submission stopped this afternoon because my proxy expired. Now running again.
Monday, 26 February 2007
Cleanup not quite right
The cleanup didn't work quite right as it cleared the entries out of the log file but not out of the if file and so they kept coming back. An attempt to fix this yesterday failed because of a typo. Trying again.
Friday, 23 February 2007
Automatic cleanup
Have introduced some automatic cleanup via the cron job. Scheduled, Ready, Running, Waiting and "Job proxy is expired" jobs are removed from the log after 12 hours. The logs are archived after 7 days. Lock files are removed after 60 minutes and an attempt is made to kill the running process.
OK Now
Seems to be OK now. Have put the archiving of the log files in the cron job to run once a day.
Web page not updated
The web page has not been updated since 4.30 this morning. The previous lock file was not removed for some reason. Deleted it by hand. Job submission seems to have been OK.
Thursday, 22 February 2007
Problem understood
The file system could not cope with me storing all the job outputs in one place. The code is being changed so that each day's worth of jobs is in a separate directory. Job submission will be restarted once this has been debugged.
Strange State
Not sure what is going on. Summary is all yellow but individual jobs look OK. It appears atest isn't getting the status properly. Stopped submission till I sort it out.
Wednesday, 21 February 2007
Sheffield replica failed again
[lloyd@heppc005 atest]$ export LFC_HOST=lfc.gridpp.rl.ac.uk
[lloyd@heppc005 atest]$ lcg-rep -v -t 120 --vo atlas -d lcgse1.shef.ac.uk
srm://epgse1.ph.bham.ac.uk/dpm/ph.bham.ac.uk/home/atlas/generated/2007-01-19/
filef607aa8a-5ee3-44a8-bf0f-f0ccbb73aecb
Using grid catalog type: lfc
Using grid catalog : lfc.gridpp.rl.ac.uk
lcg_rep: Permission denied
Tuesday, 20 February 2007
System restarted
Cleaned out all remaining jobs at IC from my system and restarted everything using the 2nd RAL RB.
Imperial RB is very slow
The Imperial RC is very slow and there is a big backlog of jobs still to be processed ('waiting'). I have stopped job submission and switched the RB to lcgrb02. I have also extended the cancellation time from 8 to 20 hours till the backlog is cleared.
Subscribe to:
Posts (Atom)