diff --git a/mkdocs-project-dir/docs/Status_page.md b/mkdocs-project-dir/docs/Status_page.md index 50545110a..9806749c2 100644 --- a/mkdocs-project-dir/docs/Status_page.md +++ b/mkdocs-project-dir/docs/Status_page.md @@ -247,6 +247,8 @@ This page outlines that status of each of the machines managed by the Research C There are also missing files in projects so if you own a project we will give you a list of these too. +#### Latest on Myriad + - 2024-01-11 12:30 - Jobs on Myriad We've had questions from you about when jobs will be able to restart. We were able to assess @@ -272,254 +274,10 @@ This page outlines that status of each of the machines managed by the Research C ### Kathleen - - 2022-09-27 - Kathleen's metadata servers have started encountering the ZFS+Lustre bug that Young - had in the past, which causes very high load and hangs. We also discovered we were running out of - inodes on the metadata server - an inode exists for every file and directory, so we need a - reduction in the number of files on the system. We prevented new jobs from starting for the time - being. - - - 2022-10-03 - We are upgrading Kathleen's ZFS and Lustre on the metadata servers to mitigate the - bug. Jobs will not start running again until this is done. Quotas have been enabled. We have - contacted users who are currently over quota and also have jobs in the queue - their jobs are held - so they do not fail straight away unable to write files once jobs are restarted. These users will - be able to release the hold themselves once under quota again with the `qrls all` command. - - - 2023-03-14 09:30 - A datacentre cooling issue has caused servers in Kathleen to overheat and power - off. As of the afternoon, work continues to bring Kathleen back up. 16:20 - We expect to be able - to put Kathleen back in service tomorrow. - - - 2023-03-16 08:20 - Kathleen is back up and running jobs and you should be able to log in again. - This took a bit longer than expected as we had some configuration issues with the login nodes that - were fixed last thing yesterday, after which we ran some test jobs. - Any jobs that were running when the nodes powered down will have failed. - - - 2023-07-31 10:00 - Kathleen's Object Store Servers went down on Sunday at around 6am. We're - currently working on bringing everything back up. You won't be able to log in right now, and - jobs that were running at the time will have failed. - - - 2023-07-31 17:30 - Kathleen is running jobs again and you should be able to log in. It took a - bit longer to bring back up because one of the OSSs temporarily misplaced its network card - - after encouraging it to find it again we verified that it was working under load before - re-enabling jobs. + - 2024-01 No current issues. Parallel filesystem soon to be replaced. ### Young - - 2023-03-14 09:30 - A datacentre cooling issue has caused servers in Young to overheat and power off. - Some disks that hold the operating system on the Object Store Servers have failed. We cannot - currently bring the filesystem back up as a result. As of the afternoon, work is continuing. - 16:20 - Work on this will need to continue tomorrow. - - - 2023-03-15 11:45 - Young is back up - you should be able to log in now and jobs have started running. - Any jobs that were running when the nodes powered down will have failed. - We're currently running at risk and with reduced I/O performance because we only have one OSS (Object - Store Server) running and it has one failed disk in its boot volume, so we are lacking in resilience - until that is replaced (hopefully later today - done) and the second OSS is fixed and brought back - up. - - - 2023-03-16 09:30 - Young's admin nodes have now gone offline, which will be preventing anyone from - logging in. We are investigating. - - - 2023-03-16 13:15 - Young is back up. There was a cooling issue that affected only the admin rack this - time which consists of admin, util and login nodes plus the Lustre storage. The compute nodes stayed - up and jobs kept running, but they might have failed due to Lustre being unavailable. UCL Estates - know the cause of the problem and so hopefully this should not happen again for the time being. - - - 2023-03-23 03:20 - Young's single running OSS went down due to high load and was brought back up - shortly before 9am. The second OSS had lost both internal drives and is likely to take several more - days before it can be brought back into service, so we are still running at risk. - - - 2023-03-24 09:00 - Young's single running OSS went down again overnight. Last night we appeared to - be running into a new ZFS issue, where we had a kernel panic, a reboot and a failure to start ZFS - resources. Our Infrastructure team brought the resources back up this morning and Lustre recovery - began, but now the OSS has become unresponsive again. This means that you will not be able to log in - at the moment and jobs will have failed because they haven't been able to access the filesystem. - - - 2023-03-24 13:00 - We do not currently expect to be able to bring everything up again until after - the weekend. Right now there is very high memory usage on the existing OSS even when only the util - and login nodes are running Lustre clients and all the compute nodes are off. We aren't sure why. - - Next week we may need to delay bringing the compute back up until after the other OSS is fully fixed, - but we will update you on this then. - - Sorry about this, we were running ok on one OSS until Thursday and now something is preventing us - from continuing like that, so we may have another underlying problem. - - - 2023-03-27 11:00 - The disks in the Young OSSes will be replaced tomorrow (we had another - failure in the OSS that is currently up, so it is down a disk again). This means that it will be - during Wednesday at the earliest that we're able to start jobs again, and it does depend somewhat - on how well everything is behaving after all the disks are replaced and we have a properly - resilient system again - we'll need to do some testing before allowing jobs to start even if - everything looks good. - - - 2023-03-28 17:15 - Young's filesystem is running on two OSSes again. Roughly half the compute nodes - are enabled and jobs are running on them. If all goes well, we'll switch on the rest of the compute - nodes tomorrow. - - - 2023-03-29 15:10 - Jobs were started on the rest of the nodes at around 10:30am and everything is - running ok with the exception of the GPU nodes. On the GPU nodes we are seeing Lustre client - evictions which cause I/O errors for jobs running on the nodes - they weren't able to complete - the read or write they were requesting. For now we aren't running jobs on the GPU nodes until we have - this sorted out. You may have a job that started running there earlier today or yesterday and failed - or did not create all the output expected because of this - please do check your output carefully in - that case. - - This is a new issue. We have some suspicions that it is configuration-related, since the GPU nodes - have two OmniPath cards each and Lustre is set up to use both. This setup was working previously; we - are going to investigate further. - - - 2023-03-30 10:40 - Last night we had two ZFS panics, one on each OSS. These occurred at just - before 10pm and again at 1:30am. These will have caused some jobs to fail with I/O errors. We have - adjusted some configuration so that if we have more panics then the filesystem should hopefully be - able to recover more quickly and cause fewer I/O errors. - - We had a problem with ZFS panics before where it was a known issue with the versions we were running - and fixed it by upgrading the versions of Lustre and ZFS that we had on Young. The current issue we - are experiencing does not appear to be a known one, investigation continues. - - - 2023-04-03 10:30 - Young stayed up over the weekend and was running CPU jobs successfully. We have - now re-enabled the GPU nodes as well. - - We are still getting some ZFS panics, but the change to how they work mean the filesystem is - failing over and recovering in time to only affect a few client connections. We are scanning for - corrupted metadata and files since this may be the cause. - - We will be leaving Young up and running jobs over Easter. UCL's Easter closing is from Thurs 6 April - to Weds 12 April inclusive and we will be back on Thurs 13 April. If anything goes wrong during this - time, we won't be able to fix it until we are back. - - - 2023-04-04 14:20 - The `zfs scrub` we were running to look for corrupted metadata and files - has exposed a drive failure. We have spares and the affected ZFS pool is now recovering, which will - take a day or so. I/O performance may be degraded during the recovery. - - - 2023-06-21 16:30 - We had a filesystem issue caused by a failed disk in the metadata server - at about 4:30pm. We brought it back up and it was recovering at around 8:20pm and will have - been slower for a while overnight until recovery completed. Jobs will have failed in the - period when it was down, so please do check on things that were running during this time to - see whether they completed or not - our apologies for this. - - - 2023-06-27 20:15 - We’ve had an unexpectedly high number of drive failures in Young’s - filesystem, and this has created a significant risk of filesystem failure until we can replace - the drives and restore resiliency. - - To try to keep the filesystem running, we’ve had to shut down the login and compute nodes, - terminating all running jobs and removing access to the cluster. This should massively - reduce the load on the filesystem, and allow us to more rapidly bring the system back to a - safe level of redundancy. - - In future, we will also attempt to keep more spare drives on-site, which should help us avoid - this kind of situation. - - Apologies for the inconvenience: we’re going to be replacing drives and we will hopefully have - an update with better news tomorrow. - - - 2023-06-28 11:00 The situation with the disks in Young's filesystem is worse than we thought - and we're currently doing some data recovery procedures so we can then evaluate what we - should do next. - - Data in home directories is backed up. We have likely lost some data from Scratch but are - working to find out the implications of that, and what and how much could be affected. - - How long Young will be down depends on what we discover in the next few days. We will need - to do a full filesystem check once we have a recovered filesystem to check, and this can take - some time. - - The detail, for those who wish to know: - - A Lustre filesystem consists of object store servers and metadata servers, and the metadata - servers store information such as filenames, directories, access permissions, and file layout. - - We had a failed drive in a metadata mirror on Wednesday last week. We have eight mirrors in - that metadata target and each consists of two drives. We had a replacement drive arrive, but - before we could fit it, the second drive in the same mirror indicated that it was beginning to - fail. - - We fitted the replacement drive yesterday, but the rebuild of the mirror overnight did not - succeed and the failing disk's status degraded further. What we are going to do next is to - retrieve as much off the failing disk as we can, move it onto a healthy disk, and then see - what status the filesystem is in and what is in an inconsistent state. - - As was mentioned, we had also ordered more spare drives to keep on site but this happened - before those arrived. - - We're sorry for this situation and are working to address it. We'll keep you updated as we - know more. - - - 2023-06-29 11:30 - We've recovered what was on the failing disk and are missing approximately - 250kB of metadata. (Since it is metadata, this doesn't directly translate to the effect on your - files, but we can probably be cautiously optimistic about the expected size of impact, should - nothing else go horribly wrong). - - Replacement disks are being sent to us and should arrive today or tomorrow. - - Once the disks arrive, we will proceed with the next step of moving the recovered data onto a - new disk and getting the filesystem to use that in replacement of the failed disk, and then we - can attempt to bring the filesystem up and see what status it thinks it is in. At that point we - will have several levels of filesystem health checking to run. - - We are also progressing with some contingency plans relating to the backup data we have, to - make it easier to use should we need to. - - I've had a couple of questions about when Young will be back up - we don't know yet, but not - during next week. I would expect us to still be doing the filesystem recovery and checks on - the first couple of days of the week even if the disks do arrive today, so we probably won't - have any additional updates for you until midweek. - - It does take a long and unpredictable time to rebuild disks, check all the data, and recover - files: as an example, our metadata recovery was giving us estimates of 19 hours to 5 days while - it was dealing with the most damaged portions of the disk. - - - 2023-07-04 10:25 - I just wanted to give you an update on our progress. The short version is - "we are still working on it but now believe we have lost little to no user data". - - We do not have an ETA for return to service but may be able to give you read-only access to - your files in the near future. - - More detail for those who are interested: - - We have fitted the replacement disks and migrated the data to the new drives. The underlying - ZFS filesystem now believes that everything is OK and we are running the Lustre filesystem - checks. - - We are also restoring a backup to a different file-system so that in the event that any data - in /home was damaged we can restore it quickly. - - Medium term, we plan to replace the storage appliance, which we are funding from UCL's budget - in our FY 23/24 (starts 1st August). This work should be completed by Christmas and will also - affect the local Kathleen system. - - - 2023-07-04 16:25 - I'm pleased to announce that while we are not in a position to return to - service yet (the Lustre filesystem check is still running), we have managed to get to a state - where we can safely give you read-only access to your files. You should be able to log into - the login nodes as usual to retrieve your files. - - (Obviously, this means you cannot change any files or run jobs!) - - - 2023-07-07 12:30 - I didn't want to head into the weekend without giving an update – we are - still running the Lustre File-system Check and hope to have completed it by mid next week, - but unfortunately the tool that does this doesn't report its progress in a very helpful way – - as of this morning it had completed checking 13.7M directories on the file-system but we don't - have an exact number for how many there are total, only an estimate. - - Again, apologies for this unplanned outage – we are working at full speed to get a replacement - for this hardware later this year so we should not see a recurrence of these problems once - that is done. - - - 2023-07-12 16:30 - Full access to Young is restored, and jobs are running again! - - The Lustre filesystem check finished successfully earlier today and we ran some I/O-heavy test - jobs without issues. - - We have replaced disks in the metadata server where both in a pair were still original disks, - to prevent this situation happening again. - - We remounted the filesystem read-write and at 15:45 and 16:00 rebooted the login nodes one - after the other. We're now back in full service. - - Please do check your space and if anything appears to be inconsistent or missing, email - rc-support@ucl.ac.uk - - Thank you for your patience during this time. - - 2023-10-26 14:50 - We seem to have a dying OmniPath switch in Young. The 32 nodes with names beginning `node-c12b` lost their connection to the filesystem earlier. Powercycling the switch only helped temporarily before it went down again. Those nodes are all currently out of service @@ -529,6 +287,8 @@ This page outlines that status of each of the machines managed by the Research C (You can see in `jobhist` what the head node of a job was, and the .po file will show all the nodes that an MPI job ran on). + - 2024-01 Parallel filesystem soon to be replaced. + ### Michael - All systems are working well.