Skip to content

Commit

Permalink
Removed older outages. Mentioned new filesystem for Kathleen, Young. …
Browse files Browse the repository at this point in the history
…Added 'latest on Myriad' subheading.
  • Loading branch information
heatherkellyucl committed Jan 11, 2024
1 parent 23657ca commit 303176d
Showing 1 changed file with 5 additions and 245 deletions.
250 changes: 5 additions & 245 deletions mkdocs-project-dir/docs/Status_page.md
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,8 @@ This page outlines that status of each of the machines managed by the Research C
There are also missing files in projects so if you own a project we will give you a list of
these too.

#### Latest on Myriad

- 2024-01-11 12:30 - Jobs on Myriad

We've had questions from you about when jobs will be able to restart. We were able to assess
Expand All @@ -272,254 +274,10 @@ This page outlines that status of each of the machines managed by the Research C

### Kathleen

- 2022-09-27 - Kathleen's metadata servers have started encountering the ZFS+Lustre bug that Young
had in the past, which causes very high load and hangs. We also discovered we were running out of
inodes on the metadata server - an inode exists for every file and directory, so we need a
reduction in the number of files on the system. We prevented new jobs from starting for the time
being.

- 2022-10-03 - We are upgrading Kathleen's ZFS and Lustre on the metadata servers to mitigate the
bug. Jobs will not start running again until this is done. Quotas have been enabled. We have
contacted users who are currently over quota and also have jobs in the queue - their jobs are held
so they do not fail straight away unable to write files once jobs are restarted. These users will
be able to release the hold themselves once under quota again with the `qrls all` command.

- 2023-03-14 09:30 - A datacentre cooling issue has caused servers in Kathleen to overheat and power
off. As of the afternoon, work continues to bring Kathleen back up. 16:20 - We expect to be able
to put Kathleen back in service tomorrow.

- 2023-03-16 08:20 - Kathleen is back up and running jobs and you should be able to log in again.
This took a bit longer than expected as we had some configuration issues with the login nodes that
were fixed last thing yesterday, after which we ran some test jobs.
Any jobs that were running when the nodes powered down will have failed.

- 2023-07-31 10:00 - Kathleen's Object Store Servers went down on Sunday at around 6am. We're
currently working on bringing everything back up. You won't be able to log in right now, and
jobs that were running at the time will have failed.

- 2023-07-31 17:30 - Kathleen is running jobs again and you should be able to log in. It took a
bit longer to bring back up because one of the OSSs temporarily misplaced its network card -
after encouraging it to find it again we verified that it was working under load before
re-enabling jobs.
- 2024-01 No current issues. Parallel filesystem soon to be replaced.

### Young

- 2023-03-14 09:30 - A datacentre cooling issue has caused servers in Young to overheat and power off.
Some disks that hold the operating system on the Object Store Servers have failed. We cannot
currently bring the filesystem back up as a result. As of the afternoon, work is continuing.
16:20 - Work on this will need to continue tomorrow.

- 2023-03-15 11:45 - Young is back up - you should be able to log in now and jobs have started running.
Any jobs that were running when the nodes powered down will have failed.
We're currently running at risk and with reduced I/O performance because we only have one OSS (Object
Store Server) running and it has one failed disk in its boot volume, so we are lacking in resilience
until that is replaced (hopefully later today - done) and the second OSS is fixed and brought back
up.

- 2023-03-16 09:30 - Young's admin nodes have now gone offline, which will be preventing anyone from
logging in. We are investigating.

- 2023-03-16 13:15 - Young is back up. There was a cooling issue that affected only the admin rack this
time which consists of admin, util and login nodes plus the Lustre storage. The compute nodes stayed
up and jobs kept running, but they might have failed due to Lustre being unavailable. UCL Estates
know the cause of the problem and so hopefully this should not happen again for the time being.

- 2023-03-23 03:20 - Young's single running OSS went down due to high load and was brought back up
shortly before 9am. The second OSS had lost both internal drives and is likely to take several more
days before it can be brought back into service, so we are still running at risk.

- 2023-03-24 09:00 - Young's single running OSS went down again overnight. Last night we appeared to
be running into a new ZFS issue, where we had a kernel panic, a reboot and a failure to start ZFS
resources. Our Infrastructure team brought the resources back up this morning and Lustre recovery
began, but now the OSS has become unresponsive again. This means that you will not be able to log in
at the moment and jobs will have failed because they haven't been able to access the filesystem.

- 2023-03-24 13:00 - We do not currently expect to be able to bring everything up again until after
the weekend. Right now there is very high memory usage on the existing OSS even when only the util
and login nodes are running Lustre clients and all the compute nodes are off. We aren't sure why.

Next week we may need to delay bringing the compute back up until after the other OSS is fully fixed,
but we will update you on this then.

Sorry about this, we were running ok on one OSS until Thursday and now something is preventing us
from continuing like that, so we may have another underlying problem.

- 2023-03-27 11:00 - The disks in the Young OSSes will be replaced tomorrow (we had another
failure in the OSS that is currently up, so it is down a disk again). This means that it will be
during Wednesday at the earliest that we're able to start jobs again, and it does depend somewhat
on how well everything is behaving after all the disks are replaced and we have a properly
resilient system again - we'll need to do some testing before allowing jobs to start even if
everything looks good.

- 2023-03-28 17:15 - Young's filesystem is running on two OSSes again. Roughly half the compute nodes
are enabled and jobs are running on them. If all goes well, we'll switch on the rest of the compute
nodes tomorrow.

- 2023-03-29 15:10 - Jobs were started on the rest of the nodes at around 10:30am and everything is
running ok with the exception of the GPU nodes. On the GPU nodes we are seeing Lustre client
evictions which cause I/O errors for jobs running on the nodes - they weren't able to complete
the read or write they were requesting. For now we aren't running jobs on the GPU nodes until we have
this sorted out. You may have a job that started running there earlier today or yesterday and failed
or did not create all the output expected because of this - please do check your output carefully in
that case.

This is a new issue. We have some suspicions that it is configuration-related, since the GPU nodes
have two OmniPath cards each and Lustre is set up to use both. This setup was working previously; we
are going to investigate further.

- 2023-03-30 10:40 - Last night we had two ZFS panics, one on each OSS. These occurred at just
before 10pm and again at 1:30am. These will have caused some jobs to fail with I/O errors. We have
adjusted some configuration so that if we have more panics then the filesystem should hopefully be
able to recover more quickly and cause fewer I/O errors.

We had a problem with ZFS panics before where it was a known issue with the versions we were running
and fixed it by upgrading the versions of Lustre and ZFS that we had on Young. The current issue we
are experiencing does not appear to be a known one, investigation continues.

- 2023-04-03 10:30 - Young stayed up over the weekend and was running CPU jobs successfully. We have
now re-enabled the GPU nodes as well.

We are still getting some ZFS panics, but the change to how they work mean the filesystem is
failing over and recovering in time to only affect a few client connections. We are scanning for
corrupted metadata and files since this may be the cause.

We will be leaving Young up and running jobs over Easter. UCL's Easter closing is from Thurs 6 April
to Weds 12 April inclusive and we will be back on Thurs 13 April. If anything goes wrong during this
time, we won't be able to fix it until we are back.

- 2023-04-04 14:20 - The `zfs scrub` we were running to look for corrupted metadata and files
has exposed a drive failure. We have spares and the affected ZFS pool is now recovering, which will
take a day or so. I/O performance may be degraded during the recovery.

- 2023-06-21 16:30 - We had a filesystem issue caused by a failed disk in the metadata server
at about 4:30pm. We brought it back up and it was recovering at around 8:20pm and will have
been slower for a while overnight until recovery completed. Jobs will have failed in the
period when it was down, so please do check on things that were running during this time to
see whether they completed or not - our apologies for this.

- 2023-06-27 20:15 - We’ve had an unexpectedly high number of drive failures in Young’s
filesystem, and this has created a significant risk of filesystem failure until we can replace
the drives and restore resiliency.

To try to keep the filesystem running, we’ve had to shut down the login and compute nodes,
terminating all running jobs and removing access to the cluster. This should massively
reduce the load on the filesystem, and allow us to more rapidly bring the system back to a
safe level of redundancy.

In future, we will also attempt to keep more spare drives on-site, which should help us avoid
this kind of situation.

Apologies for the inconvenience: we’re going to be replacing drives and we will hopefully have
an update with better news tomorrow.

- 2023-06-28 11:00 The situation with the disks in Young's filesystem is worse than we thought
and we're currently doing some data recovery procedures so we can then evaluate what we
should do next.

Data in home directories is backed up. We have likely lost some data from Scratch but are
working to find out the implications of that, and what and how much could be affected.

How long Young will be down depends on what we discover in the next few days. We will need
to do a full filesystem check once we have a recovered filesystem to check, and this can take
some time.

The detail, for those who wish to know:

A Lustre filesystem consists of object store servers and metadata servers, and the metadata
servers store information such as filenames, directories, access permissions, and file layout.

We had a failed drive in a metadata mirror on Wednesday last week. We have eight mirrors in
that metadata target and each consists of two drives. We had a replacement drive arrive, but
before we could fit it, the second drive in the same mirror indicated that it was beginning to
fail.

We fitted the replacement drive yesterday, but the rebuild of the mirror overnight did not
succeed and the failing disk's status degraded further. What we are going to do next is to
retrieve as much off the failing disk as we can, move it onto a healthy disk, and then see
what status the filesystem is in and what is in an inconsistent state.

As was mentioned, we had also ordered more spare drives to keep on site but this happened
before those arrived.

We're sorry for this situation and are working to address it. We'll keep you updated as we
know more.

- 2023-06-29 11:30 - We've recovered what was on the failing disk and are missing approximately
250kB of metadata. (Since it is metadata, this doesn't directly translate to the effect on your
files, but we can probably be cautiously optimistic about the expected size of impact, should
nothing else go horribly wrong).

Replacement disks are being sent to us and should arrive today or tomorrow.

Once the disks arrive, we will proceed with the next step of moving the recovered data onto a
new disk and getting the filesystem to use that in replacement of the failed disk, and then we
can attempt to bring the filesystem up and see what status it thinks it is in. At that point we
will have several levels of filesystem health checking to run.

We are also progressing with some contingency plans relating to the backup data we have, to
make it easier to use should we need to.

I've had a couple of questions about when Young will be back up - we don't know yet, but not
during next week. I would expect us to still be doing the filesystem recovery and checks on
the first couple of days of the week even if the disks do arrive today, so we probably won't
have any additional updates for you until midweek.

It does take a long and unpredictable time to rebuild disks, check all the data, and recover
files: as an example, our metadata recovery was giving us estimates of 19 hours to 5 days while
it was dealing with the most damaged portions of the disk.

- 2023-07-04 10:25 - I just wanted to give you an update on our progress. The short version is
"we are still working on it but now believe we have lost little to no user data".

We do not have an ETA for return to service but may be able to give you read-only access to
your files in the near future.

More detail for those who are interested:

We have fitted the replacement disks and migrated the data to the new drives. The underlying
ZFS filesystem now believes that everything is OK and we are running the Lustre filesystem
checks.

We are also restoring a backup to a different file-system so that in the event that any data
in /home was damaged we can restore it quickly.

Medium term, we plan to replace the storage appliance, which we are funding from UCL's budget
in our FY 23/24 (starts 1st August). This work should be completed by Christmas and will also
affect the local Kathleen system.

- 2023-07-04 16:25 - I'm pleased to announce that while we are not in a position to return to
service yet (the Lustre filesystem check is still running), we have managed to get to a state
where we can safely give you read-only access to your files. You should be able to log into
the login nodes as usual to retrieve your files.

(Obviously, this means you cannot change any files or run jobs!)

- 2023-07-07 12:30 - I didn't want to head into the weekend without giving an update – we are
still running the Lustre File-system Check and hope to have completed it by mid next week,
but unfortunately the tool that does this doesn't report its progress in a very helpful way –
as of this morning it had completed checking 13.7M directories on the file-system but we don't
have an exact number for how many there are total, only an estimate.

Again, apologies for this unplanned outage – we are working at full speed to get a replacement
for this hardware later this year so we should not see a recurrence of these problems once
that is done.

- 2023-07-12 16:30 - Full access to Young is restored, and jobs are running again!

The Lustre filesystem check finished successfully earlier today and we ran some I/O-heavy test
jobs without issues.

We have replaced disks in the metadata server where both in a pair were still original disks,
to prevent this situation happening again.

We remounted the filesystem read-write and at 15:45 and 16:00 rebooted the login nodes one
after the other. We're now back in full service.

Please do check your space and if anything appears to be inconsistent or missing, email
[email protected]

Thank you for your patience during this time.

- 2023-10-26 14:50 - We seem to have a dying OmniPath switch in Young. The 32 nodes with names
beginning `node-c12b` lost their connection to the filesystem earlier. Powercycling the switch
only helped temporarily before it went down again. Those nodes are all currently out of service
Expand All @@ -529,6 +287,8 @@ This page outlines that status of each of the machines managed by the Research C
(You can see in `jobhist` what the head node of a job was, and the .po file will show all the
nodes that an MPI job ran on).

- 2024-01 Parallel filesystem soon to be replaced.

### Michael

- All systems are working well.
Expand Down

0 comments on commit 303176d

Please sign in to comment.