Short write up of the order of events Wednesday (2017-10-18) Short maintenance to analyze HW problem on one of ESS (Elastic Storage Server) nodes - After controlled failover and power cycle, machine no longer boots - HW call has been opened, mainboard has to be replaced Friday (2017-10-20) Add new ESS system (GL4S) to the GPFS Cluster to extend the capacity on the core filesystem Creation of recovery groups on new system resulted into loss of disk access on existing ESS systems This process has been recently done for 2x clusters without any issues Initial observation, might be related to incorrect and outdated /etc/hosts entries on existing ESS systems Opened severity 1 SW call, usual recovery mechanisms did not work out - debugging the systems with support + 1 GPFS developer through the night Saturday (2017-10-21) New ESS system has been turned off, to reduce source of errors ~11:30AM: ASAP3 cluster up and running again core filesystem has "gained" ~80TB of space - unclear if data loss or result of planned deletion of data on Friday Start GPFS on Maxwell and mount filesystems again - ASAP3 cluster dies again Additional debugging with 2 GPFS developers - Maxwell node steals internal node number from dead ESS node - triggers failover to a node without access to the disks Developers were able to catch and understand the issue, could be a SW bug Recover ASAP3 cluster final time to establish service for users - GPFS on Maxwell stays stopped, risk is too high - ASAP3 service available again at ~11:00PM for PETRA IIII and FLASH data taking IBM: Will create test setup to reproduce this and ship a possible fix - Mainboard replacement might happen before Monday (2017-10-23) First comparisons of file lists do not indicate a data loss - SW call has been opened, to understand this issue further Technician replaces the mainboard on broken ESS node - machine still unable to boot - HW has only 5x10 service, has to continue next day Tuesday (2017-10-24) Technician will return at ~11:00AM to fix the broken ESS node Wednesday (2017-10-25) Technician will return with CPU + DIMM ESS node is now online again
If the ESS node can be repaired, maxwell will be started.
If not, IT considers Plan B for maxwell: nfs mounts for gpfs