[INSTRM-1328] Move services onto PFS UPS Created: 21/Jul/21  Updated: 30/Dec/21

Status: Open
Project: Instrument control development
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Normal
Reporter: cloomis Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Blocks
is blocked by INFRA-307 Setup the new summit servers Open

 Description   

The MHS system depends on connectivity to a central hub, through which
all instrument control and telemetry traffic is routed. This is fine
in an environment where computing and networking are provisioned with
that in mind, but a recent outage at Subaru highlighted our exposure
to inalterable non-PFS design choices.

Specifically, the main Subaru computer room ("CB2F") and much of the
observatory-wide network are not backed by the generator: once the
CB2F UPSes run down, the CB2F computers and most observatory network
gear shut down. As it stands, the MHS hub and all data and software
storage is in CB2F, so when CB2F goes down, all PFS actors lose
connection with all other parts of the system. Note that in any case,
when CB2F is down there is no network connection to the observatory
from the outside (Hilo).

The SPS is much better protected: the PFS UPS on IR3 protects much of
the SPS, and is in turn backed by the diesel generator: so long as the
generator comes online, the important parts of the SPS will never see
a power or networking outage.

Only SPS has significant instrument safety requirements: if the PFI or
MCS lose power or connectivity no real harm will follow. Because of
this I will treat SPS as the core part of PFS, and the only part I
will address here. Further discussion with CDM etc. might let us do
better with the rest of the instrument.

To add a bit more detail, our risk appears to come from many
components, but I will argue that we only need to worry about
four:

  • the tron hub itself must remain up and connected to the SPS actors.
  • the shared /software filesystem has to remain available.
  • the archiver must remain connected to the hub.
  • hostname resolution, DHCP

The LAM group faced problems with the dependability of the
site-provided /software server, and made the obvious fix: they moved
/software onto PFS machines, near their SPS racks and piepans.

We propose doing the same, but also moving the tron hub, the archiver,
dnsmasq, and perhaps some stray actors onto the PFS UPS. It turns
out that there is one computer there already ("rack5-ics"), but the
machine/vm is not suitable as it stands. We do not need much processor
power, but should have a pair of SSDs for /software, and a bit more
RAM.

As I understand the space and power constraints, we do not want to
move the /data and postgresql servers or the large vm servers up to
the SPS racks. Still, I think we need to better understand what
rack5-ics is and what the real space and power limits are.

The archiver always buffers keywords to disk before loading to the
database. It does not actually need the database to save telemetry,
just reliable disk space. OK, yes, I would argue that having the
archiver database running at SPS would be comforting, but if we cannot
have a beefier machine there we can live without.

Finally, all actors write directly to /data/logs. Switching to rsyslog
would let us buffer logging output for a while, but not forever. I
believe we can then arrange for traffic to simply be dropped. This
data is not as important as the archiver telemetry.

I think that services run at SPS can be backed up from CB2F.

Right now, this all boils down to ticketable things:

  • provision rack5-ics or similar a bit better.
  • serve /software from there
  • move some mhs-ics and dnsmasq-ics services over (just configuration changes)
  • INSTRM-322


 Comments   
Comment by cloomis [ 24/Jul/21 ]

Yoshida, Hiroshige and I chatted, and noted the following:

  • The rack5-ics computer is from IPMU and was bought in 2017. Went off warranty in 2018. 1U R230, 16G RAM, 2 of 4 drive slots free. Even if we were not making this a primary PFS host it should on the list for an update.
  • Before a replacement can be spec'ed, Yoshida, Hiroshige needs to physically inspect rack5, in particular to check how deep a computer we can install and whether we are limited to 1U.
  • In the meanwhile, we can add a pair of hot-swap SSDs and a PCI NVME card. For decent gear [ 2x2TB Intels, 1TB MLC NVME ] that would be somewhere between $1000-1200, purchasable from Amazon, etc. And it can be installed quickly. Would be useable elsewhere.

Up to the money/purchasing people, I think. I suggest doing the quick upgrade now, and letting any new computer purchase take the usual time. Since the old configuration will still be spinning in CB2F as a fall back, I also suggest switching before any significant new PFI engineering.

Generated at Sat Feb 10 16:34:13 JST 2024 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.