XMLWordPrintable

    Details

    • Type: Task
    • Status: Open (View Workflow)
    • Priority: Normal
    • Resolution: Unresolved
    • Component/s: None
    • Labels:
      None

      Description

      The MHS system depends on connectivity to a central hub, through which
      all instrument control and telemetry traffic is routed. This is fine
      in an environment where computing and networking are provisioned with
      that in mind, but a recent outage at Subaru highlighted our exposure
      to inalterable non-PFS design choices.

      Specifically, the main Subaru computer room ("CB2F") and much of the
      observatory-wide network are not backed by the generator: once the
      CB2F UPSes run down, the CB2F computers and most observatory network
      gear shut down. As it stands, the MHS hub and all data and software
      storage is in CB2F, so when CB2F goes down, all PFS actors lose
      connection with all other parts of the system. Note that in any case,
      when CB2F is down there is no network connection to the observatory
      from the outside (Hilo).

      The SPS is much better protected: the PFS UPS on IR3 protects much of
      the SPS, and is in turn backed by the diesel generator: so long as the
      generator comes online, the important parts of the SPS will never see
      a power or networking outage.

      Only SPS has significant instrument safety requirements: if the PFI or
      MCS lose power or connectivity no real harm will follow. Because of
      this I will treat SPS as the core part of PFS, and the only part I
      will address here. Further discussion with CDM etc. might let us do
      better with the rest of the instrument.

      To add a bit more detail, our risk appears to come from many
      components, but I will argue that we only need to worry about
      four:

      • the tron hub itself must remain up and connected to the SPS actors.
      • the shared /software filesystem has to remain available.
      • the archiver must remain connected to the hub.
      • hostname resolution, DHCP

      The LAM group faced problems with the dependability of the
      site-provided /software server, and made the obvious fix: they moved
      /software onto PFS machines, near their SPS racks and piepans.

      We propose doing the same, but also moving the tron hub, the archiver,
      dnsmasq, and perhaps some stray actors onto the PFS UPS. It turns
      out that there is one computer there already ("rack5-ics"), but the
      machine/vm is not suitable as it stands. We do not need much processor
      power, but should have a pair of SSDs for /software, and a bit more
      RAM.

      As I understand the space and power constraints, we do not want to
      move the /data and postgresql servers or the large vm servers up to
      the SPS racks. Still, I think we need to better understand what
      rack5-ics is and what the real space and power limits are.

      The archiver always buffers keywords to disk before loading to the
      database. It does not actually need the database to save telemetry,
      just reliable disk space. OK, yes, I would argue that having the
      archiver database running at SPS would be comforting, but if we cannot
      have a beefier machine there we can live without.

      Finally, all actors write directly to /data/logs. Switching to rsyslog
      would let us buffer logging output for a while, but not forever. I
      believe we can then arrange for traffic to simply be dropped. This
      data is not as important as the archiver telemetry.

      I think that services run at SPS can be backed up from CB2F.

      Right now, this all boils down to ticketable things:

      • provision rack5-ics or similar a bit better.
      • serve /software from there
      • move some mhs-ics and dnsmasq-ics services over (just configuration changes)
      • INSTRM-322

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                cloomis cloomis
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: