News, GPFS incidence, ICS, P10 LaVue

GPFS incidence

Short write up of the order of events

    Wednesday (2017-10-18)
        Short maintenance to analyze HW problem on one of ESS (Elastic Storage Server) nodes
        - After controlled failover and power cycle, machine no longer boots
        - HW call has been opened, mainboard has to be replaced

    Friday (2017-10-20)
        Add new ESS system (GL4S) to the GPFS Cluster to extend the capacity on the core filesystem
        Creation of recovery groups on new system resulted into loss of disk access on existing ESS systems
            This process has been recently done for 2x clusters without any issues
        Initial observation, might be related to incorrect and outdated /etc/hosts entries on existing ESS systems
        Opened severity 1 SW call, usual recovery mechanisms did not work out
        - debugging the systems with support + 1 GPFS developer through the night

    Saturday (2017-10-21)
        New ESS system has been turned off, to reduce source of errors
        ~11:30AM: ASAP3 cluster up and running again
            core filesystem has "gained" ~80TB of space
            - unclear if data loss or result of planned deletion of data on Friday
        Start GPFS on Maxwell and mount filesystems again
        - ASAP3 cluster dies again
        Additional debugging with 2 GPFS developers
        - Maxwell node steals internal node number from dead ESS node
        - triggers failover to a node without access to the disks
        Developers were able to catch and understand the issue, could be a SW bug
        Recover ASAP3 cluster final time to establish service for users
        - GPFS on Maxwell stays stopped, risk is too high
        - ASAP3 service available again at ~11:00PM for PETRA IIII and FLASH data taking
        IBM: Will create test setup to reproduce this and ship a possible fix
        - Mainboard replacement might happen before

    Monday (2017-10-23)
        First comparisons of file lists do not indicate a data loss
        - SW call has been opened, to understand this issue further
        Technician replaces the mainboard on broken ESS node
        - machine still unable to boot
        - HW has only 5x10 service, has to continue next day

    Tuesday (2017-10-24)
        Technician will return at ~11:00AM to fix the broken ESS node

    Wednesday (2017-10-25)
        Technician will return with CPU + DIMM
        ESS node is now online again

If the ESS node can be repaired, maxwell will be started.

If not, IT considers Plan B for maxwell: nfs mounts for gpfs

ICS (P05, etc.): openend optics hutch by accident, prompt for confirmation, if action affects other BL, status: ongiong
LaVue and HiDRA at P10 (E2): DONE