The Question is:
I have a multi-site FDDI VAXcluster: Site A has 2 Vx7000 and a SW800 in CI
configuration and Site B has 7 Vax7000 anda SW800 also in CI configuration.
All servers uses FDDI as cluster interconnect and Ethernet as network access.
Each server contribute 1 v
ote and quarum is 3. Site A has 2 votes and Site B has 3.
I had a power outage recently and both the FDDI and Network switches were
downed. All servers and storages were up except for 1 of the 3 server in Site
The entire cluster hung-up for a while and later all SW800 volumes were
software disabled. When power was restored and FDDI and Network switches were
power up, and the down server was also powered up but fail to boot completely
due to not able to mount th
e disabled volumes. I had to shutdown - forced shutdown - all the servers and
reboot all of them to bring the cluster back up.
1) Losing both FDDI and Network connectivity will cause the entire cluster to
2) If Site A has high total votes (say 4) then Site B (3 currently), would Site
A server still hang up in this scenario?
3) In what sitiation wuold cluster partitioning occur?
You asnwers to these and all related input the these power outage problem would
be greatly appreciated.
The Answer is :
Your stated facts are unclear or contradictory. You first claim 2 nodes
at Site A and 7 at Site B, each with 1 vote -- but then say there are only
3 votes at Site B. Apparently, only 3 of the systems at Site B have 1 vote
each and the other 4 have 0 votes. The total of 5 votes matches the quorum
of 3. Perhaps the 7 was a typo?
The power outage hit both sites concurrently? If so, the OpenVMS Wizard
suspects only one site had a power problem but that it isolated the sites
from each other.
Site A systems should enter a quorum hang (block) as the lobe only has 2
votes together. Site B systems would have stayed online if all 3 servers
had been up. Why was the one system down at Site B? With the one system
down at Site B, the two remaining systems also trigger a quorum hang.
If all of the cluster interconnects go down, then the isolated nodes would
all encounter quorum hangs (blocks) as none of the nodes would have the 3
votes by themselves. This is an expected behavior.
The system parameter MVTIMEOUT specifies how long a volume can be offline
until it is made unavailable. After mount verify timeout occurs, only a
dismount and can make the volume available again. If the volume comes
back online before MVTIMEOUT, then the stalled I/O's are simply reissued
and the applications pick up from where they left off. You could consider
increasing MVTIMEOUT to tolerate longer temporary outages.
Although you might have been able to recover without rebooting all of the
systems in the cluster it was probably easier to do so. Usually when the
volume is made unavailable, the stalled I/O's fail back to the application
which report the error and exit. Typically, you can dismount the volumes
and remount them. The problem comes when an application keeps a channel
open to the volume despite the state change. Finding and stopping all
such applications cluster-wide can be tedious.
1: Not necessarily. Having the one system down at Site B was the problem.
2: No. With 4+3=7, the total votes would require quorum to be set to 4.
Site A would stay online (if all systems there were up).
3: Cluster partitioning -- where both sides are online, despite being
unconnected -- happens most often when EXPECTED_VOTES is
misconfigured. It can also happen when a system manager forces
the blocked systems to recalculate quorum dynamically. (For more
details on VOTES and EXPECTED_VOTES, please see the OpenVMS FAQ.)