HP OpenVMS Systems Documentation
Guidelines for OpenVMS Cluster Configurations
A.7.2.1 SCSI Bus Resets
When a host connected to a SCSI bus first starts, either by being turned on or by rebooting, it does not know the state of the SCSI bus and the devices on it. The ANSI SCSI standard provides a method called BUS RESET to force the bus and its devices into a known state. A host typically asserts a RESET signal one or more times on each of its SCSI buses when it first starts up and when it shuts down. While this is a normal action on the part of the host asserting RESET, other hosts consider this RESET signal an error because RESET requires that the hosts abort and restart all I/O operations that are in progress.
A host may also reset the bus in the midst of normal operation if it
detects a problem that it cannot correct in any other way. These kinds
of resets are uncommon, but they occur most frequently when something
on the bus is disturbed. For example, an attempt to hot plug a SCSI
device while the device is still active (see Section A.7.6) or halting
one of the hosts with Ctrl/P can cause a condition that forces one or
more hosts to issue a bus reset.
When a host exchanges data with a device on the SCSI bus, there are several different points where the host must wait for the device or the SCSI adapter to react. In an OpenVMS system, the host is allowed to do other work while it is waiting, but a timer is started to make sure that it does not wait too long. If the timer expires without a response from the SCSI device or adapter, this is called a timeout.
There are three kinds of timeouts:
Timeout errors are not inevitable on SCSI OpenVMS Cluster systems.
However, they are more frequent on SCSI buses with heavy traffic and
those with two initiators. They do not necessarily indicate a hardware
or software problem. If they are logged frequently, you should consider
ways to reduce the load on the SCSI bus (for example, adding an
Mount verify is a condition declared by a host about a device. The host declares this condition in response to a number of possible transient errors, including bus resets and timeouts. When a device is in the mount verify state, the host suspends normal I/O to it until the host can determine that the correct device is there, and that the device is accessible. Mount verify processing then retries outstanding I/Os in a way that insures that the correct data is written or read. Application programs are unaware that a mount verify condition has occurred as long as the mount verify completes.
If the host cannot access the correct device within a certain amount of
time, it declares a mount verify timeout, and application programs are
notified that the device is unavailable. Manual intervention is
required to restore a device to service after the host has declared a
mount verify timeout. A mount verify timeout usually means that the
error is not transient. The system manager can choose the timeout
period for mount verify; the default is one hour.
Shadow volume processing is a process similar to mount verify, but it is for shadow set members. An error on one member of a shadow set places the set into the volume processing state, which blocks I/O while OpenVMS attempts to regain access to the member. If access is regained before shadow volume processing times out, then the outstanding I/Os are reissued and the shadow set returns to normal operation. If a timeout occurs, then the failed member is removed from the set. The system manager can select one timeout value for the system disk shadow set, and one for application shadow sets. The default value for both timeouts is 20 seconds.
A.7.2.5 Expected OPCOM Messages in Multihost SCSI Environments
When a bus reset occurs, an OPCOM message is displayed as each mounted disk enters and exits mount verification or shadow volume processing.
When an I/O to a drive experiences a timeout error, an OPCOM message is displayed as that drive enters and exits mount verification or shadow volume processing.
If a quorum disk on the shared SCSI bus experiences either of these
errors, then additional OPCOM messages may appear, indicating that the
connection to the quorum disk has been lost and regained.
In the OpenVMS system, the Error Log utility allows device drivers to save information about unusual conditions that they encounter. In the past, most of these unusual conditions have happened as a result of errors such as hardware failures, software failures, or transient conditions (for example, loose cables).
If you type the DCL command SHOW ERROR, the system displays a summary of the errors that have been logged since the last time the system booted. For example:
In this case, 6 errors have been logged against host SALT's SCSI port B (PKB0), 10 have been logged against disk $1$DKB500, and so forth.
To see the details of these errors, you can use the command ANALYZE/ERROR/SINCE=dd-mmm-yyyy:hh:mm:ss at the DCL prompt. The output from this command displays a list of error log entries with information similar to the following:
For this discussion, the key elements are the ERROR TYPE and, in some
instances, the PORT STATUS fields. In this example, the error type is
03, COMMAND TRANSMISSION FAILURE, and the port status is 00000E32,
The error log entries listed in this section are likely to be logged in a multihost SCSI configuration, and you usually do not need to be concerned about them. You should, however, examine any error log entries for messages other than those listed in this section.
A.7.3 Restrictions and Known Problems
The OpenVMS Cluster software has the following restrictions when multiple hosts are configured on the same SCSI bus:
OpenVMS Cluster systems also place one restriction on the SCSI quorum disk, whether the disk is located on a single-host SCSI bus or a multihost SCSI bus. The SCSI quorum disk must support tagged command queuing (TCQ). This is required because of the special handling that quorum I/O receives in the OpenVMS SCSI drivers.
This restriction is not expected to be significant, because all disks
on a multihost SCSI bus must support tagged command queuing (see
Section A.7.7), and because quorum disks are normally not used on
The following sections describe troubleshooting tips for solving common
problems in an OpenVMS Cluster system that uses a SCSI interconnect.
Verify that two terminators are on every SCSI interconnect (one at each
end of the interconnect). The BA350 enclosure, the BA356 enclosure, the
DWZZx, and the KZxxx adapters have internal
terminators that are not visible externally (see Section A.4.4.)
OpenVMS automatically detects configuration errors described in this
section and prevents the possibility of data loss that could result
from such configuration errors, either by bugchecking or by refusing to
mount a disk.
For versions prior to OpenVMS Alpha Version 7.2, there are three types of configuration errors that can cause a bugcheck during booting. The bugcheck code is VAXCLUSTER, Error detected by OpenVMS Cluster software .
When OpenVMS boots, it determines which devices are present on the SCSI bus by sending an inquiry command to every SCSI ID. When a device receives the inquiry, it indicates its presence by returning data that indicates whether it is a disk, tape, or processor.
Some processor devices (host adapters) answer the inquiry without assistance from the operating system; others require that the operating system be running. The adapters supported in OpenVMS Cluster systems require the operating system to be running. These adapters, with the aid of OpenVMS, pass information in their response to the inquiry that allows the recipient to detect the following configuration errors:
A.18.104.22.168 Failure to Configure Devices
In OpenVMS Alpha Version 7.2, SCSI devices on a misconfigured bus (as
described in Section A.22.214.171.124) are not configured. Instead, error messages
that describe the incorrect configuration are displayed.
There are two types of configuration error that can cause a disk to fail to mount.
First, when a system boots from a disk on the shared SCSI bus, it may fail to mount the system disk. This happens if there is another system on the SCSI bus that is already booted, and the other system is using a different device name for the system disk. (Two systems will disagree about the name of a device on the shared bus if their controller names or allocation classes are misconfigured, as described in the previous section.) If the system does not first execute one of the bugchecks described in the previous section, then the following error message is displayed on the console:
The decoded representation of this status is:
This error indicates that the system disk is already mounted in what appears to be another drive in the OpenVMS Cluster system, so it is not mounted again. To solve this problem, check the controller letters and allocation class values for each node on the shared SCSI bus.
Second, SCSI disks on a shared SCSI bus will fail to mount on both systems unless the disk supports tagged command queuing (TCQ). This is because TCQ provides a command-ordering guarantee that is required during OpenVMS Cluster state transitions.
OpenVMS determines that another processor is present on the SCSI bus during autoconfiguration, using the mechanism described in Section A.126.96.36.199. The existence of another host on a SCSI bus is recorded and preserved until the system reboots.
This information is used whenever an attempt is made to mount a non-TCQ device. If the device is on a multihost bus, the mount attempt fails and returns the following message:
If the drive is intended to be mounted by multiple hosts on the same SCSI bus, then it must be replaced with one that supports TCQ.
Note that the first processor to boot on a multihost SCSI bus does not
receive an inquiry response from the other hosts because the other
hosts are not yet running OpenVMS. Thus, the first system to boot is
unaware that the bus has multiple hosts, and it allows non-TCQ drives
to be mounted. The other hosts on the SCSI bus detect the first host,
however, and they are prevented from mounting the device. If two
processors boot simultaneously, it is possible that they will detect
each other, in which case neither is allowed to mount non-TCQ drives on
the shared bus.
Having excessive ground offset voltages or exceeding the maximum SCSI
interconnect length can cause system failures or degradation in
performance. See Section A.7.8 for more information about SCSI
Adequate signal integrity depends on strict adherence to SCSI bus
lengths. Failure to follow the bus length recommendations can result in
problems (for example, intermittent errors) that are difficult to
diagnose. See Section A.4.3 for information on SCSI bus lengths.
Only one initiator (typically, a host system) or target (typically, a peripheral device) can control the SCSI bus at any one time. In a computing environment where multiple targets frequently contend for access to the SCSI bus, you could experience throughput issues for some of these targets. This section discusses control of the SCSI bus, how that control can affect your computing environment, and what you can do to achieve the most desirable results.
Control of the SCSI bus changes continually. When an initiator gives a command (such as READ) to a SCSI target, the target typically disconnects from the SCSI bus while it acts on the command, allowing other targets or initiators to use the bus. When the target is ready to respond to the command, it must regain control of the SCSI bus. Similarly, when an initiator wishes to send a command to a target, it must gain control of the SCSI bus.
If multiple targets and initiators want control of the bus simultaneously, bus ownership is determined by a process called arbitration, defined by the SCSI standard. The default arbitration rule is simple: control of the bus is given to the requesting initiator or target that has the highest unit number.