 |
» |
|
|
 |
 |
|
 |
 |
 |
Ken Moreau, Solutions Architect, OpenVMS Ambassador
|
 |
 |
 |
 |
|
 |
 |
This paper surveys the cluster
technologies for the operating systems available from many vendors, including
IBM AIX, HP HP-UX, Linux, HP NonStop Kernel, HP OpenVMS, PolyServe Matrix
Server, Sun Microsystems Solaris, HP Tru64 UNIX and Microsoft Windows
2000/2003. In addition, it discusses
some technologies that operate on multiple platforms, including MySQL Cluster,
Oracle 9i and 10g Real Application Clusters, and Veritas clustering products. It describes the common functions that all of
the cluster technologies perform, shows where they are the same and where they
are different on each platform, and introduces a method of fairly evaluating
the technologies to match them to business requirements. As much as possible, it does not discuss
performance, base functionality of the operating systems, or system hardware.
The focus for the
audience of this document is a person who is technically familiar with one or
more of the clustering products discussed here and who wishes to learn about
one or more of the other clustering products, as well as anyone who is
evaluating various cluster products to find which ones fit a stated business
need.
|
 |
 |
|
 |
 |
Clustering technologies are highly interrelated, with almost everything affecting everything
else. This subject is broken down into
five areas:
- Single/multisystem views, which defines how you
manage and work with a system, whether as individual systems or as a single
combined entity.
- Cluster file systems, which defines how you work
with storage across the cluster. Cluster
file systems are just coming into their own in the UNIX world, and this article
will describe how they work, in detail.
- Configurations, which defines how you assemble a cluster, both physically and logically.
- Application support, which discusses how
applications running on your single standalone system today, can take advantage
of a clustered environment. Do they need
to change, and if so how? What benefits
are there to a clustered environment?
- Resilience, which describes what happens when
bad things happen to good computer rooms.
This covers host-based RAID, wide area "stretch" clusters, extended
clusters, and disaster tolerant scenarios.
This article covers
the capabilities of IBM High Availability Cluster Multiprocessing (HACMP) 5.1
for AIX 5L, Linux LifeKeeper V4.3, Microsoft SQL Server 2000 Enterprise
Edition, MySQL Cluster 4.1, HP NonStop Kernel G06.22, HP OpenVMS Cluster
Software V7.3-2, Oracle 9i/10g Real Application Clusters, PolyServe Matrix
Server and Matrix HA for Linux and Windows, HP Serviceguard 11i for HP-UX, HP
Serviceguard A.11.16 for Linux, Sun Microsystems SunCluster 3.1 in a SunPlex
cluster of Solaris 9 servers, HP TruCluster V5.1b, Veritas Cluster Server and
SANPoint Foundation Suite V3.5 and Windows 2000/2003 with the Cluster
Service. It also discusses Windows SQL
Server 2005 Enterprise Edition, which offers additional capabilities beyond SQL
Server 2000. This has not been
officially released by Microsoft at this time, but is close enough that it is
safe to describe its functionality.
For Linux, the focus is on the high availability side, not the HPTC (i.e., Beowulf) technologies.
|
 |
 |
|
 |
 |
In order to evaluate
the cluster technologies fairly, you need to understand four terms:
scalability, reliability, availability and manageability.
- Availability defines whether the application
stays up, even when components of the cluster go down. If you have two systems in a cluster and one
goes down but the other picks up the workload, that application is available
even though half of the cluster is down.
Part of availability is failover time, because if it takes 30 seconds
for the application to fail over to the other system, the users on the first
system think that the application is down for those 30 seconds. Any actions that the users are forced to take
as part of this failover, such as logging in to the new system, must be
considered as part of the failover time, because the users are not doing
productive work during that time.
Further, if the application is forced to pause on the surviving system
during the failover, to the users on the second system the application is down
for those 30 seconds.
- Reliability defines how well the system performs
during a failure of some of the components.
If you get subsecond query response and if a batch job finishes in 8
hours with all of the systems in the cluster working properly, do you still get
that level of performance if one or more of the systems in the cluster is
down? If you have two systems in a
cluster and each system has 500 active users with acceptable performance, will
the performance still be acceptable if one of the systems fails and there are
now 1,000 users on a single system? Keep
in mind that the users neither know nor care how many systems there are in the
cluster; they only care whether they can rely on the environment to get their
work done.
Notice that reliability and availability are orthogonal concepts, and it is possible
to have one but not the other. How many
times have you logged into a system (it was available), but it was so slow as
to be useless (it was not reliable)?
- Scalability defines the percentage of useful
performance you get from a group of systems.
For example, if you add a second system to a cluster, do you double
performance, or do you get a few percentage points less than that? If you add a third, do you triple the
performance of a single system, or not?
- Manageability defines how much additional work
it is to manage those additional systems in the cluster. If you add a second system to the cluster,
have you doubled your workload because now you have to do everything
twice? Or have you added only a very
small amount of work, because you can manage the cluster as a single entity?
Multisystem-view clusters are generally comprised of two systems, where each system is dedicated
to a specific set of tasks. Storage is
physically cabled to both systems, but each file system can only be mounted on
one of the systems. Applications cannot
simultaneously access data from both systems at the same time, and the
operating system files cannot be shared between the two systems. Therefore, a fully-independent boot device
(called a "system root" or "system disk") with a full set of operating system
and cluster software files for each system is required.
|
 |
 |
|
 |
 |
The physical
environment of a multisystem-view cluster in active-active mode is identical to
that of active-passive mode. Two or more systems are physically cabled to
a common set of storage, but only able to mount each file system on one of the
systems. The difference is that multiple
systems are performing useful work as well as monitoring each other's health. However, they are not running the same application
on the same data, because they are not sharing any files between the systems.
For example, a
database environment could segment their customers into two groups, such as by
the first letter of the last name (for example, A-M and N-Z). Then each group would be set up on separate
disk volumes on the shared storage, and each system would handle one of the
groups. This is known as a "federated"
database. Or one of the systems could be
running the entire database and the other system could be running the
applications that access that database.
In the event of a failure, one system would handle both groups.
This is called an N+M
cluster, because any of the systems can take over for any of the other
systems. One way to define N and M is to
think about how many tires you have on your automobile. Most people automatically say four, but
including the spare, they really have five tires. They are operating in an N+1 environment,
because four tires are required for minimum operation of the vehicle. A variation is to use the equivalent of the
"donut" tire -- a server that offers limited performance but enough to get by
for a short period of time. This can be
thought of as having 4½ tires on the vehicle.
The key is to define what level of performance and functionality you
require, and then define N and M properly for that environment.
Failover can be manual
or automatic. The "heartbeat" function can come over the network or from a
private interface. Comparing a single
system to a multisystem-view cluster in active-active mode, the availability,
reliability, scalability, and manageability characteristics are as follows:
- Availability is increased because you now have
multiple systems available to do the work.
As in the active-passive environment, the odds of all systems being
broken at the same time are fairly low but still present.
- Reliability may be increased, but is commonly
decreased. If two systems are each
running at 60% of capacity, the failure of one will force the surviving system
to work at 120% of capacity, which is not optimum because you should never
exceed about 80% of capacity.
- Scalability for any given workload is poor in
this situation because each workload must still fit into one system. There is no way to spread a single
application across multiple systems.
- Manageability is slightly worse than the
active-passive scenario, because you still have two independent systems, as
well as the overhead of the failover scripts and heartbeat.
|
 |
 |
|
 |
 |
One of the factors
affecting availability is the amount of time it takes to accomplish the
failover of a multisystem-view cluster, whether active-active or
active-passive. The surviving system
must:
- Notice that the other system is no longer available,
which is detected when the "heartbeat" function on the surviving system does
not get an answer back from the failed system.
- Mount the disks that were on the failing
system. Remember that the file systems
are only mounted on one system at a time: this is part of the definition of
multisystem-view clusters. The surviving
system must mount the disks that were mounted on the other system, and then
possibly perform consistency checking on each volume. If you have a large number of disks, or large
RAID sets, this could take a long time.
- Start the applications that were active on the failing system.
- Initiate the recovery sequence for that
software. For databases, this might
include processing the journalling logs in order to process any in-flight
transactions that the failing system was performing at the time of the failure.
In large environments,
it is not unusual for this operation take 30-60 minutes. During this recovery time, the applications
that were running on the failed system are unavailable, and the applications
that were running on the surviving system are not running at full speed,
because the single system is now doing much more work.
|
 |
 |
|
 |
 |
In contrast,
single-system-view clusters offer a unified view of the entire cluster. All systems are physically cabled to all
shared storage and can directly mount all shared storage on all systems. This means that all systems can run all
applications, see the same data on the same partitions, and cooperate at a very
low level. Further, it means that the
operating system files can be shared in a single "shared root" or "shared
system disk," reducing the amount of storage and the amount of management time
needed for system maintenance. There may
be spare capacity, but there are no spare systems. All systems can run all applications at all
times.
In a single-system-view cluster, there can be many systems. Comparing a series of independent systems to
the same number of systems in a single-system-view cluster, the availability,
reliability, scalability, and manageability characteristics are as follows:
- Availability is increased because you now have
multiple systems to do the work. The
odds of all systems being broken at the same time is now much lower, because
potentially you can have many systems in the cluster.
- Reliability is much better, because with many
systems in the cluster, the workload of a single failed system can be spread
across many systems, increasing their load only slightly. For example, if each system is running at 60%
capacity and one server out of four fails, 1/3 of the load is placed on each of
the other systems, increasing their performance to 80% of capacity, which will
not affect reliability significantly.
- Scalability is excellent because you can spread
the workload across multiple systems. If
you have an application that is simply too big for a single computer system
(even one with 64 or 128 CPUs and hundreds of gigabytes of memory and dozens of
I/O cards), you can have it running simultaneously across many computer
systems, each with a large amount of resources, all directly accessing the same
data.
- Manageability is much easier than the equivalent
job of managing this number of separate systems, because the entire cluster is
managed as a single entity. There is no
increase in management workload even when you have many systems.
The advantages in
failover times over multisystem-view clusters comes from not having to do quite
so much work during a failover:
- The surviving systems must detect the failure of
the system. This is common between the
two types of clusters.
- The surviving systems do not have to mount the
disks from the failed system; they are already mounted.
- The surviving systems do not have to start the
applications; they are already started.
- The execution of the recovery script is common
between the two schemes, but it can begin almost instantly in the
single-system-view cluster case. The
application recovery time will be similar on both types of clusters, but if you
have a large number of small systems, you can achieve parallelism even in
recovery, so that your recovery can be faster in this case as well.
One criticism of
shared root environments with a single root for the entire cluster is that this
represents a single point of failure. If
a hardware failure causes the shared root device to be inaccessible, or an
operator error causes corruption on the shared root (such as applying a patch
incorrectly or deleting the wrong files), the entire cluster will be
affected. These concerns must be
balanced against the amount of work involved in maintaining multiple system roots. Furthermore, an incorrect patch on one system
root can cause incompatibility with the other cluster members. Such problems can be difficult to diagnose.
The system
administrator must set up the operational procedures (including the number of
shared roots) for their environment in such as way that the possibility of
failure is minimized, and services are still delivered in a cost-effective
manner. Frequent backups, hardware and
software RAID, and good quality assurance and testing procedures can help
reduce the possibility of failure in either environment.
Now that the terms are defined, you can see how different cluster products work.
|
Multisystem view |
Single-system view |
Shared root |
HACMP AIX, Linux |
Yes |
No |
No |
LifeKeeper Linux, Windows |
Yes |
No |
No |
MySQL Cluster AIX, HP-UX, Linux, Solaris, Windows |
Yes (MySQL Server) |
Yes |
No |
| NonStop Kernel |
Yes |
Yes |
Each node (16 CPUs) |
OpenVMS Cluster Software OpenVMS |
Yes |
Yes |
Yes |
Oracle 9i/10g RAC Many O/S's |
Yes (Oracle DB) |
Yes |
Effectively yes ($ORACLE_HOME) |
PolyServe Matrix Linux, Windows |
Yes |
Yes |
No |
Serviceguard HP-UX, Linux |
Yes |
No |
No |
SQL Server 2000/2005 Windows |
Yes |
No |
No |
SunCluster Solaris |
Yes |
No |
No |
TruCluster Tru64 UNIX |
No |
Yes |
Yes |
Veritas Cluster Server AIX, HP-UX, Linux, Solaris, Windows |
Yes |
No |
No |
Windows 2000/2003 Cluster Service Windows |
Yes |
No |
No |
Figure 1 Types of Clusters
|
 |
HACMP |
 |
 |
|
High Availability
Cluster Multiprocessing (HACMP) 5.1 for AIX 5L runs on the IBM pSeries
(actually an RS/6000 using the Power4 chip), and for Linux runs on a variety of
platforms. It is a multisystem image
cluster, where each system in the cluster requires its own system disk. Management is done either through the
included Cluster Single Point Of Control (C-SPOC) or by the layered product
Cluster Systems Management (CSM), which can manage mixed AIX and Linux systems
in the same cluster. In both cases, you
issue commands one time and they are propagated to the different systems in the
cluster. Clusters can be configured
either as active-passive (which IBM calls "standby") or active-active (which
IBM calls "takeover") configurations.
Previous versions of
HACMP came in two varieties: HACMP/ES (Enhanced Scalability) and HACMP
(Classic). V5.1 includes all of the
features of HACMP/ES.
|
 |
Linux Clustering |
 |
 |
|
Linux clustering is
focused either on massive system compute farms (Beowulf and others) or a high
availability clustering scheme. This
article specifically does not address the High Performance Technical Computing
market here, which breaks down a massive problem into many (hundreds or
thousands) tiny problems and hands them off to many (hundreds or thousands)
small compute engines. This is not
really a high availability environment because if any of those compute engines
fails, that piece of the job has to be restarted from scratch.
Most of the Linux high
availability efforts are focused on multisystem-view clusters consisting of a
small number of systems from which applications can fail over from one system
to the other. Cluster file system
projects such as Lustre and GFS are discussed later, but these do not offer
shared root, so systems in Linux clusters require individual system disks.
There are some other
projects that are focused on single-system-view clusters. One of these is the work being done by HP as
part of the Single System Image Linux project.
Another is from Qlusters Corporation, specifically the ClusterFrame XHA
and ClusterFrame SSI products based on OpenMosix. At this time these are focused on the HPTC
market, but when they prove themselves in the commercial high availability
market space, they will have significant capabilities that match or exceed
every other clustering product. Visit http://openssi.org
for more information on the HP project, and http://www.qlusters.com
for more information on ClusterFrame XHA and SSI.
|
MySQL Cluster |
 |
 |
|
MySQL Cluster is a
layer on top of MySQL, the open source database that runs on AIX, HP-UX, Linux
(Red Hat and SUSE), Mac OS X, Windows 2000 and XP, and is being planned for
OpenVMS. The software and intellectual
property were acquired from Ericsson, and was integrated as open source into
the Storage Engine of MySQL Server.
There are three
components to a MySQL Cluster: application nodes, database server or storage
nodes, and management nodes. Application
nodes run MySQL Server and connect to the database server nodes running MySQL
Cluster, and are managed by the management nodes. The different nodes can either be processes
on a single server or distributed on multiple servers. MySQL Cluster is designed to work on "shared
nothing" operating systems, where each node has private storage.
MySQL offers a
multisystem view of the database, and MySQL Cluster adds single-system
view. It does not support sharing of
disks, but transparently fragments the database over the systems in the cluster
with real-time replication, so that the
database information can be accessed from any system in the cluster.
|
NonStop Kernel |
 |
 |
|
NonStop Kernel (NSK,
formerly the Tandem Guardian operating system) runs on the HP NonStop servers
(formerly NonStop Himalaya or Tandem Himalaya servers), and is configured as a
single-system-view cluster. It offers
true linear scalability as you add processors to the environment, because of
the shared-nothing architecture and superb cluster interconnect. 2 to 16 processors can be configured to have
a shared root and be considered one system.
A cluster of systems, both local and geographically distributed, is
centrally managed with the Open Systems Manager (OSM) console.
|
OpenVMS Cluster Software |
 |
 |
|
OpenVMS Cluster
Software has always been the gold standard of clustering, with almost linear
scalability as you add systems to the cluster.
It can be configured as either multisystem view or single-system view,
although the most common is single-system view.
It supports single or multiple system disks.
|
Oracle 9i/10g Real Application Clusters |
 |
 |
|
Oracle 9i/10g Real
Application Clusters (RAC) is the next generation of Oracle Parallel Server,
and runs on the Oracle 9i and 10g database on every major computing
platform. It offers a single-system-view
of the database files, such that external applications can connect to the
database instance on any of the systems in the cluster. It does not offer a multisystem-view of the
database, but this is easily achieved by simply running the database without RAC.
Oracle achieves the
functionality of a shared root (called $ORACLE_HOME), but accomplishes it
differently on the different host operating systems. On single-system-view operating systems that
offer clustered file systems, $ORACLE_HOME is placed in a shared volume and
made available to all of the systems in the cluster. On multisystem-view operating systems that do
not offer clustered file systems, Oracle replicates all of the operations to
individual volumes, one per system in the cluster, without forcing the user to
take any action. The installation,
patches, and monitoring are the same whether there is one $ORACLE_HOME or
multiple, replicated $ORACLE_HOMEs.
Oracle is steadily
adding functionality to RAC, which requires less support from the base
operating systems. For example, 9i RAC
required the addition of Serviceguard Extensions for RAC (SGeRAC) on HP-UX,
while 10g RAC does not require SGeRAC.
Further, 10g RAC is capable of running without the underlying operating
system itself being clustered. As a
result, HACMP, Serviceguard, SunClusters, and Windows 2000/2003 Cluster Server
are now optional for Oracle 10g RAC.
|
PolyServe Matrix HA and Matrix Server |
 |
 |
|
PolyServe Matrix
Server is a clustered file system for Linux and Windows which includes a high
availability and a cluster management component. The HA component provides automated failover
and failback of applications. Each node
in a PolyServe cluster requires its own system disk, which can be local or SAN
boot. Matrix Server allows the
underlying disk volumes to be accessed for read-write simultaneously from all
nodes. It also allows a unified view of
device management, such that device names are common across all systems in the
cluster regardless of the order that the devices were discovered during a
boot. The management application is CLI
and GUI based, and allows the cluster to be managed as a single entity from any
node in the cluster. Matrix Server is
primarily an installable file system, and so does not itself offer a
multisystem view because the underlying operating systems offer that as the
default. Similarly, Matrix Server does
not offer a shared root, because it is a layer on top of the operating system
and is activated late in the boot process.
|
Serviceguard |
 |
 |
|
Serviceguard (also
known as MC/Serviceguard) is a multisystem-view failover cluster. Each system in a Serviceguard cluster
requires its own system disk. There are
excellent system management capabilities from the Service Control Manager and
the Event Management Service, including the ability to register software in the
System Configuration Repository, get system snapshots, compare different
systems in the cluster, and install new instances of the operating system and
applications by copying existing instances using Ignite/UX. It is also well integrated with HP/OpenView.
Serviceguard Manager
can configure, administer, and manage HP-UX and Linux Serviceguard clusters
through a single interface. Each cluster
must be homogeneous; that is, each cluster
can only be running one operating system.
Business continuity solutions to achieve disaster tolerance are
available. HP-UX offers Campuscluster,
Metrocluster, and Continentalcluster.
Metrocluster functionality is offered on Linux through Serviceguard for
Linux integration with Cluster Extension XP.
Additional complementary products on Linux include Serviceguard
Extension for SAP for Linux and an application toolkit for Oracle. Contributed toolkits are available for
leading Linux applications.
|
SQL Server 2000/2005 Enterprise Edition |
 |
 |
|
Microsoft SQL Server
2000 Enterprise Edition is a multisystem-view failover clustered database,
running on Microsoft Windows 2000/2003.
SQL Server 2005 is the next release of this product, and is available on
Windows 2003. They are available in both
32-bit and 64-bit versions for the various hardware platforms. They provide both manual and automatic
failover of database connections between servers. A database can be active on only a single
instance of SQL Server, and each server requires its own installation. Unless specifically noted, all references to
functionality in this article apply to both versions equally.
|
SunCluster |
 |
 |
|
SunCluster 3.1 is a
multisystem-view failover cluster. A
group of Solaris servers running SunCluster software is called a SunPlex
system. Each system in a SunPlex requires
its own system disk, and Sun recommends keeping the "root" passwords the same
on all systems. This has to be done
manually, which gives you some idea
about the level of management required by a SunPlex. The Cluster File System (CFS) offers a
single-system-view of those file systems that are mounted as a CFS. The Sun Management Center and SunPlex Manager
are a set of tools that manage each system as a separate entity but from a
centralized location.
|
TruCluster V5.1b |
 |
 |
|
TruCluster V5.1b
represents a major advance in UNIX clustering technology. It can only be configured as a
single-system-view. The clustering focus
is on managing a single system or a large cluster in exactly the same way, with
the same tools, and roughly the same amount of effort. It offers a fully-shared root and a single
copy of almost all system files.
|
Veritas |
 |
 |
Veritas offers several
products in this area, but then offers many combinations of these products
under separate names. The base products
are:
- Veritas Cluster Server (VCS) manages systems in
a cluster, with a GUI interface. It is
unique in that it can manage multiple different clusters at a time. It can simultaneously manage systems running
AIX, HP-UX, Linux, Solaris, and Windows running the Veritas Cluster Server
software. Each cluster must be
homogeneous; that is, each cluster can only be running one operating system.
- Veritas File System (VxFS) is a journaled file
system that works in either a standalone system or a cluster. A "light" version of this product is included
with HP-UX 11i Foundation Operating Environment, and the full version is
included in the Enterprise Operating Environment as Online JFS.
- Veritas Global Cluster Manager (GCM) manages
geographically-distributed Veritas Cluster Server clusters from a central console. Applications can be monitored across multiple
clusters at multiple sites, and can be migrated from one site to another. The application service groups in each
cluster must be setup by VCS, but can then be monitored and migrated through
GCM.
- Veritas Volume Manager (VxVM) manages volumes,
whether they are file systems or raw devices.
A "light" version of this product is included with HP-UX 11i, and offers similar functionality as the HP-UX
11i Logical Volume Manager and the TruCluster Logical Storage Manager.
- Veritas Cluster Volume Manager (CVM) offers the
same functionality as VxVM but does it across multiple systems in a
cluster. An important distinction is
that CVM requires that every system mount every shared volume.
- Veritas Volume Replicator (VVR) allows disk
volumes to be dynamically replicated by the host, both locally and
remotely. This is similar to the
"snap/clone" technology in the StorageWorks storage controllers.
Veritas combines these into many different packages. The two important ones for this discussion are:
- SANPoint Foundation Suite - HA (SPFS - HA),
which includes VxFS with cluster file system extensions, VxVM with cluster
extensions, and the Veritas Cluster Server
- Veritas DataBase Extension Advanced Cluster
(DBE/AC) for Oracle 9i/10g Real Application Clusters (RAC), which includes
VxFS, VxVM, and CVM, along with an implementation of the Oracle Disk Manager
(ODM) API for Oracle to use to manage the volumes
The Veritas Network Backup (NBU) is not a cluster technology; therefore, it is not addressed in this paper.
Most of these products
run under AIX, HP-UX, Linux, Solaris, and Windows, but SANPoint Foundation
Suite - HA runs only under HP-UX and Solaris.
Check with Veritas for specific versions and capabilities of the software
for specific versions of the operating systems, and look for more discussion of
these in later sections of this paper.
In some cases the products replace the operating system's clusterware
(Cluster Server, Cluster Volume Manager), and in other cases they are enhancements
to the operating system's products (Cluster File System, Volume
Replicator). All of the products are
offered by both HP and Veritas, and supported by either company through the
Cooperative Service Agreement (ISSA).
|
Windwos 2000/2003 Cluster Service |
 |
 |
|
Windows 2000/2003
DataCenter is a multisystem-view failover cluster. Applications are written to fail over from
one system to another. Each system in a
Windows 2000/2003 cluster requires its own system disk, but the Cluster
Administrator tool can centralize the management of the cluster.
|
 |
|
 |
 |
Cluster file systems
are how systems communicate with the storage subsystem in the cluster. There are really two technologies here: one
addresses how a group of systems communicates with volumes that are physically
connected to all of the systems, and the other addresses how a group of systems
communicates with volumes that are only physically connected to one of the
systems.
Network I/O allows all
of the systems in a cluster to access data, but in a very inefficient way that
does not scale well in most implementations.
Let's say that volume A is a disk or tape drive which is physically
cabled to a private IDE or SCSI adapter on system A. It cannot be physically accessed by any other
system in the cluster. If any other system in the cluster wants to access files
on the volume, it must do network I/O, usually by some variation of NFS.
Specifically, if
system B wants to talk to the device that is mounted on system A, the network
client on system B communicates to the network server on system A in the
following way:
- An I/O connection is initiated across the cluster interconnect from system B to system A.
- System A receives the request, and initiates the I/O request to the volume.
- System A gets the data back from the volume,
and then sends an I/O request back to system B.
Notice that there are
three I/Os for each disk access. For NFS, there is also significant locking
overhead with many NFS clients. This
leads to poor I/O performance in an active-active system.
Every system offers
network I/O in order to deal with single-user devices that cannot be shared,
such as tapes, CD-ROM, DVD, or diskettes, and to allow access to devices that
are on private communications paths, such as disks on private IDE or SCSI
busses. This type of access is known as
"proxy file system."
In contrast, direct
access I/O (also known as "concurrent I/O") allows each system to independently
access any and all devices, without going through any other node in the
cluster. Notice that this is different
from UNIX direct I/O, which simply bypasses the file system's cache. Most database systems do direct I/O both in a
clustered and non-clustered environment, because they are caching the data anyway,
and don't need to use the file system's cache.
Implementing direct
access I/O allows a cluster file system to eliminate two of the three I/Os
involved in the disk access in network I/O, because each system talks directly
over the storage interconnect to the volumes.
It also provides full file system transparency and cache coherency
across the cluster.
You may object that we
could overwhelm a single disk with too many requests. This is absolutely true, but this is no
different from the same problem with other file systems, whether they are clustered
or not. Single disks, and single
database rows, are inevitably going to become bottlenecks. You design and tune around them on clusters
in exactly the same way you design and tune around them on any other
single-member operating system, using the knowledge and tools you use now.
These technologies are
focused on the commercial database environments. But in the High Performance Technical
Computing (HPTC) environment, the requirements are slightly different. The IBM General Parallel File System (GPFS)
offers direct access I/O to a shared file system, but focuses on the HPTC model
of shared files, which differs from the commercial database model of shared
files in the following ways:
- The commercial model optimizes for a small
number of multiple simultaneous writers to the same area (byte range, record or
database row) of a shared file, but assumes that this occurs extremely
frequently, because commercial databases and applications require this
functionality.
- The HPTC model optimizes for throughput because,
while the number of multiple simultaneous writers to any given file may be
large (hundreds or even thousands of systems), the applications are designed so
that only one process is writing to any given byte range. In the unlikely event of multiple writers to
a single byte range of a shared file, the HPTC model switches to network I/O
semantics, and ships all of the data to a single master system for that byte
range. This has been found to be more
efficient overall because the condition occurs so infrequently in the HPTC
world.
This paper focuses on the commercial database environment.
The I/O attributes of cluster products are summarized in the following table.
|
Network I/O |
Direct Access I/O |
Distributed Lock Manager |
HACMP AIX, Linux |
Yes |
Raw devices and GPFS |
Yes (API only) |
LifeKeeper Linux, Windows |
NFS |
Supplied by 3rd parties |
Supplied by 3rd parties |
MySQL Cluster AIX, HP-UX, Linux, Solaris, Windows |
No (supplied by native O/S) |
Yes (effectively) |
Yes (for database only) |
| NonStop Kernel |
Data Access Manager |
Effectively Yes |
Not applicable |
OpenVMS Cluster Software OpenVMS |
Mass Storage Control Protocol |
Files-11 on ODS-2 or -5 |
Yes |
Oracle 9i/10g RAC Many O/S's |
No (supplied by native O/S) |
Yes, both raw devices & Oracle file systems |
Yes |
PolyServe Matrix Linux, Windows |
No (supplied by native O/S) |
Yes |
Yes (for file system only) |
Serviceguard HP-UX, Linux |
Yes |
Supplied by 3rd parties |
Supplied by 3rd parties |
SQL Server 2000/2005 Windows |
No (supplied by the native O/S) |
No |
No |
SunCluster Solaris |
Yes |
Supplied by 3rd parties |
Supplied by 3rd parties |
TruCluster Tru64 UNIX |
Device Request Dispatcher |
Cluster File System (requires O_DIRECTIO) |
Yes |
Veritas SPFS HP-UX, Solaris |
No (supplied by the native O/S) |
Yes (SPFS or DBE/AC) |
Yes (SPFS or DBE/AC) |
Windows 2000/2003 Cluster Service Windows |
NTFS |
Supplied by 3rd parties |
Supplied by 3rd parties |
Figure 2 Cluster I/O Attributes
Every system in the
world can do network I/O in order to share devices that are on private storage
busses.
HACMP, LifeKeeper, and
Serviceguard do network I/O using NFS; NonStop Kernel does it with the Data
Access Manager (DAM, also called the "disk process" or DP2); OpenVMS Cluster
Software does it with the Mass Storage Control Protocol (MSCP); SunCluster does
it with NFS or the Cluster File System; TruCluster does it both with the Device
Request Dispatcher (DRD) and the Cluster File System; and Windows 2000/2003
does it with NTFS and Storage Groups.
MySQL, Oracle, SQL Services, and Veritas use the native I/O system of
the operating system on which they are running.
The more interesting case is direct access I/O.
HACMP offers direct
access I/O to raw devices for two to eight systems in a cluster. However, HACMP does not itself handle the
locks for raw devices. Instead, it
requires that applications use the Cluster Lock Manager APIs to manage
concurrent access to the raw devices.
The Concurrent Logical Volume Manager provides "enhanced concurrent
mode," which allows management of the raw devices through the cluster
interconnect, which should not be confused with a cluster file system as
it applies only to raw devices.
Linux has projects
being done by HP and Cluster File Systems Inc for the US Department of Energy
to enhance the Lustre File System originally developed at Carnegie Mellon
University. This enhancement is focused
on high-performance technical computing environments and is called the Scalable
File Server. This uses Linux and Lustre
to offer high throughput and high availability for storage, but does not expose
this clustering to the clients.
MySQL Cluster does not
offer direct access I/O, but it achieves the same effect by fragmenting the
database across the systems in the cluster and allowing access to all data from
any application node in the cluster.
This provides a unified view of the database to any application that
connects to the MySQL Cluster. Each
database node is responsible for some section of the database, and when any
data in that section is updated, the database nodes synchronously replicate the
changed information to all other database nodes in the cluster. It is in fact a "fragmented" (using MySQL
terminology) and a "federated" (using Microsoft and Oracle terminology) database,
and yet it behaves as a single-system image database.
NonStop Kernel is
interesting because, strictly speaking, all of the I/O is network I/O. But because of the efficiencies and
reliability of the NSK software and cluster interconnect, and the ability of
NSK to transparently pass ownership of the volume between CPUs within a system,
it has all of the best features of direct access I/O without the poor
performance and high overhead of all other network I/O schemes. Effectively, NSK offers direct access I/O,
even though it is done using network I/O.
The NonStop Kernel (including NonStop SQL) utilizes a "shared-nothing"
data access methodology. Each processor
owns a subset of disk drives whose access is controlled by the Data Access
Manager (DAM) processes. The DAM
controls and coordinates all access to the disk so a DLM is not needed.
OpenVMS
Cluster Software extends the semantics of the Files-11 file system
transparently into the cluster world, offering direct I/O to any volume in the
cluster from any system in the cluster that is physically connected to the
volume. For volumes that are not
physically connected to a specific system, OpenVMS Cluster Software
transparently switches to network I/O.
Opening a file for shared access by two processes on a single system,
and opening the same file for shared access by two processes on two different
systems in a cluster, works identically.
In effect, all file operations are automatically cluster-aware.
Oracle 9i/10g RAC does
not offer network I/O, but requires that any volume containing database files
be shared among all systems in the cluster that connect to the shared
database. 9i/10g RAC offers direct
access I/O to raw devices on every major operating system, with the exception
of OpenVMS, where it has used the native clustered file system for many years
(starting with the original version of Oracle Parallel Server). Oracle has implemented its own Oracle
Clustered File System (OCFS) for the database files on Linux and Windows as
part of Oracle 9i RAC 9.2, and is extending OCFS to other operating systems in
Oracle 10g as part of the Automated Storage Manager (ASM).
In general, the OCFS
cannot be used for the Oracle software itself ($ORACLE_HOME), but can be used
for the database files. The following
table shows which cluster file systems can be used for Oracle software and
database files:
|
Oracle software |
Oracle database files |
| HACMP/ES on AIX |
Local file system only |
Raw, GPFS |
| LifeKeeper on Linux |
Local file system only |
RAW, OCFS |
| OpenVMS Cluster SW |
OpenVMS cluster file system |
OpenVMS cluster file system |
| Serviceguard on HP-UX |
Local file system only |
RAW, Veritas DBE/AC |
| Serviceguard on Linux |
Local file system only |
Raw, OCFS |
| SunClusters on Solaris |
Solaris GFS |
Raw, Veritas DBE/AC (Solaris GFS is not supported for database files) |
| TruCluster on Tru64 UNIX |
TruCluster CFS |
Raw, TruCluster CFS |
| Windows 2000/2003 Cluster |
OCFS |
Raw, OCFS |
Figure 3 Cluster File Systems for Oracle
"Local file system
only" means that the Oracle software ($ORACLE_HOME) cannot be placed on a
shared volume; each server requires its own copy, as described above. Interestingly, the Solaris Global File
Service does support a shared $ORACLE_HOME, but does not support shared Oracle
database files.
PolyServe Matrix
Server does not offer network I/O as such, because it is available in the
underlying operating systems. Matrix
Server performs direct access I/O to any volume of the cluster file system
under its control and uses its distributed lock manager to perform file locking
and cache coherency. It supports on-line
addition and removal of storage, and the meta-data is fully journaled. PolyServe Matrix Server and OpenVMS Cluster
Software are the only cluster products with fully distributed I/O
architectures, with no single master server for I/O. PolyServe manages the lock structure for each
file system independently of the lock structures for any other file systems, so
there is no bottleneck or single point of failure.
Serviceguard and
Windows 2000/2003 Cluster Service do not offer a direct access I/O methodology
of their own, but rely on 3rd party tools such as Oracle raw devices
or the Veritas SANpoint Foundation Suite - High Availability. Serviceguard uses an extension to the
standard Logical Volume Manager for clusters, called the Shared Logical Volume
Manager (SLVM) to create volumes that are shared among all of the systems in
the cluster. Notice that this only
creates the volume groups: the access to the data on those volumes is the
responsibility of the application or the 3rd party cluster file system.
SQL Services 2000/2005 does not offer direct access I/O.
SunCluster
does not offer direct access I/O in its cluster file system (Global File
Service, or GFS), which simply allows access to any device connected to any
system in the cluster, independent of the actual path from one or more systems
to the device. In this way devices on
private busses such as tapes or CD-ROMs can be accessed transparently from any
system in the SunPlex. The GFS is a
proxy file system for the underlying file systems, such as UFS or JFS, and the
semantics of the underlying file system are preserved (that is, applications
see a UFS file system even though it was mounted as GFS). Converting a file system to a GFS destroys
any information about the underlying file system. GFSs can only be mounted cluster-wide, and
cannot be mounted on a subset of the systems in the cluster. There must be entries in the /etc/vfstab file
on each system in the cluster, and they must be identical. (SunClusters does not provide any checks on
this or tools to help manage this.)
Multiported
disks can also be part of the GFS, but Sun recommends that only two systems be
connected to a multiported disk at a time (see below). Secondary systems are checkpointed by the
primary system during normal operation, which causes significant cluster
performance degradation and memory overhead.
The master system performs all I/O to the cluster file system upon
request by the other systems in the cluster, but cache is maintained on all
systems that are accessing it.
SunCluster
manages the master and secondary systems for multiported disks in a list of
systems in the "preferenced" property, with the first system being the master,
the next system being considered the secondary, and the rest of the systems
being considered spares. If the master
system fails, the next system on the "preferenced" list becomes the master
system for that file system and the first spare becomes the secondary. This means that the "preferenced" list must
be updated whenever systems are added to or removed from the cluster, even
during normal operation.
TruCluster
offers a cluster file system that allows transparent access to any file system
from any system in the cluster. However,
all write operations, as well as all read operations on files smaller than 64K
bytes, are done by the CFS server system upon request by the CFS client
systems. Thus, TruCluster generally acts
as a proxy file system using network I/O.
The only exceptions are applications that have been modified to open the
file with O_DIRECTIO. Oracle is the only
application vendor that has taken advantage of this.
Veritas
offers a cluster file system in two different products. The SANPoint Foundation Suite - High
Availability (SPFS - HA) enhances VxFS with cluster file system extensions on
HP-UX and Solaris, providing direct I/O to any volume from any system in the
cluster that is physically connected to the volume. SPFS requires that any volume managed this
way be physically connected to every system in the cluster. This offers direct I/O functionality for the
general case of file systems with flat files.
For Oracle 9i/10g RAC, the Veritas Database Edition/Advanced Cluster
(DBE/AC) for 9i/10g RAC supports direct access I/O to the underlying VxFS. Note that if you do not use either SPFS - HA
or DBE/AC, the Veritas Volume Manager defines a volume group as a "cluster disk
group" (a special case of a "dynamic disk group"); this is the only type of
volume that can be moved from one system in a cluster to another during
failover. This is not a cluster file
system, since Veritas emphasizes that only one system in a cluster can make use
of the cluster disk group at a time.
All of the
above systems that implement direct access I/O use a "master" system to perform
meta-data operations. Therefore,
operations like file creations, deletions, renames, and extensions are
performed by one of the systems in the cluster, but all I/O inside the file or
raw device can be performed by any of the systems in the cluster using direct
access I/O. OpenVMS Cluster Software and
PolyServe have multiple "master" systems to optimize throughput and reduce
contention.
An advantage of direct access I/O, whether implemented with a file system or with
raw devices, is that it allows applications to be executed on any system in the
cluster without having to worry about whether the resources they need are
available on a specific system. For
example, batch jobs can be dynamically load balanced across all of the systems
in the cluster, and are more quickly restarted on a surviving system if they
were running on a system that becomes unavailable. Notice that the availability of the resources
does not address any of the recovery requirements of the application, which
must be handled in the application design.
Every operating system
has a lock manager for files in a non-clustered environment. A distributed lock manager simply takes this
concept and applies it between and among systems. There is always a difference in performance
and latency between local locks and remote locks (often up to an order of
magnitude difference (10x)), which may affect overall performance. You must take this into account during
application development and system management.
HACMP offers the
Cluster Lock Manager, which provides a separate set of APIs for locking, in
addition to the standard set of UNIX System V APIs. All locking is strictly the responsibility of
the application. The Cluster Lock
Manager is not supported on AIX with the 64-bit kernel. HACMP also offers the General Parallel File
System, which was originally written for the High Performance Technical
Computing (HPTC) environment but is now available in the commercial space.
MySQL Cluster tracks
the locks for the database itself, but does not offer a generalized distributed
locking mechanism.
NSK does not even have
the concept of a distributed lock manager, as none is required. Ownership of all resources (files, disk
volumes, and so forth) is local to a specific CPU, and all communication to any
of those resources uses the standard messaging between CPUs and systems. The DAM responsible for a given volume keeps
its locks and checkpoints this information to a backup DAM located on a
different CPU. Because of the
efficiencies of the messaging implementation, this scales superbly.
OpenVMS Cluster
Software uses the same locking APIs for all locks, and makes no distinction
between local locks and remote locks. In
effect, all applications are automatically cluster-aware.
Oracle implements a
distributed lock manager on HACMP, Linux LifeKeeper, SunClusters, and Windows
2000/2003, but takes advantage of the native distributed lock manager on
OpenVMS Cluster Software, Serviceguard Extension for OPS/RAC, and TruCluster.
SQL Server
2000/2005 tracks the locks for the database itself, but, because only a single
instance of the database can be running at one time, there is no distributed
lock manager.
SunCluster and
TruCluster extend the standard set of UNIX APIs for file locking in order to
work with the cluster file system, resulting in a proxy for, and a layer on top
of, the standard file systems. Keep in
mind that even though the file system is available to all systems in the
cluster, almost all I/O is performed by the master system, even on shared disks.
Veritas uses the Veritas Global Lock Manager (GLM) to
coordinate access to the data on the cluster file system.
Windows 2000/2003 does not have a distributed lock manager.
|
 |
Quorum |
 |
 |
|
When discussing
cluster configurations, it is important to understand the concept of
quorum. Quorum devices (which can be
disks or systems) are a way to break the tie when two systems are equally
capable of forming a cluster and mounting all of the disks, but cannot
communicate with each other. This is
intended to prevent cluster partitioning, which is known as "split brain."
When a cluster is
first configured, you assign each system a certain number of votes (generally
1). Each cluster environment defines a
value for the number of "expected votes" for optimal performance. This is almost always the number of systems
in the cluster. From there, we can
calculate the "required quorum" value, which is the number of votes that are
required in order to form a cluster. If
the actual quorum value is below the required quorum value, the software will
refuse to form a cluster, and will generally refuse to run at all.
For example, assume
there are two members of the cluster, system A and system B, each with one
vote, so the required quorum of this cluster is 2.
In a running cluster,
the number of expected votes is the sum of all of the members with which the
connection manager can communicate. As
long as the cluster interconnect is working, there are 2 systems available and
no quorum disk, so the value is 2. Thus,
actual quorum is greater than or equal to required quorum, resulting in a valid
cluster.
When the cluster
interconnect fails, the cluster is broken, and a cluster transition occurs.
The connection manager
of system A cannot communicate with system B, so the actual number of votes
becomes 1 for each of the systems. Applying the equation, actual quorum becomes
1, which is less than the number of required quorum required to form a cluster,
so both systems stop and refuse to continue processing. This does not support the goal of high
availability; however, it does protect the data, as follows.
Notice what would
happen if both of the systems attempted to continue processing on their
own. Because there is no communication
between the systems, they both try to form a single system cluster, as follows:
- System A decides to form a cluster, and mounts all of the cluster-wide disks.
- System B also decides to form a cluster, and also mounts all of the cluster-wide disks. The cluster is now partitioned.
- As a result, the common disks are mounted on two
systems that cannot communicate with each other. This leads to instant disk corruption, as
both systems try to create, delete, extend, and write to files without locking
or cache coherency.
To avoid this, we use a quorum scheme, which usually involves a quorum device.
Picture the same
configuration as before, but now we have added a quorum disk, which is
physically cabled to both systems. Each
of the systems has one vote, and the quorum disk has one vote. The connection manager of system A can
communicate with system B and with the quorum disk, so expected votes is
3. This means that the quorum is 2. In this case, when the cluster interconnect
fails, the following occurs:
- Both systems attempt to form a cluster, but
system A wins the race and accesses the quorum disk first. Because it cannot connect to system B, and
the quorum disk watcher on system A observes that at this moment there is no
remote I/O activity on the quorum disk, system A becomes the founding member of
the cluster, and writes information, such as the system id of the founding
member of the cluster and the time that the cluster was newly formed, to the
quorum disk . System A then computes the
votes of all of the cluster members (itself and the quorum disk, for a total of
2) and observes that it has sufficient votes to form a cluster. It does so, and then mounts all of the disks
on the shared bus.
- System B comes in second and accesses the quorum
disk. Because it cannot connect to
system A, it thinks it is the founding member of the cluster, so it checks this
fact with the quorum disk, and discovers that system A is in fact the founding
member of the cluster. But system B
cannot communicate with system A, and as such, it cannot count either system A
or the quorum disk's votes in its inventory.
So system B then computes the votes of all of the cluster members
(itself only for a total of 1) and observes it does not have sufficient votes
to form a cluster. Depending on other
settings, it may or may not continue booting, but it does not attempt to form
or join the cluster. There is no
partitioning of the cluster.
In this way only one
of the systems will mount the cluster-wide disks. If there are other systems in the cluster,
the value of required quorum and expected quorum would be higher, but the same
algorithms allow those systems that can communicate with the founding member of
the cluster to join the cluster, and those systems that cannot communicate with
the founding member of the cluster are excluded from the cluster.
This example uses a
"quorum disk," but in reality any resource can be used to break the tie and
arbitrate which systems get access to a given set of resources. Disks are the most common, frequently using
SCSI reservations to arbitrate access to the disks. Server systems can also be used as tie-breakers,
a scheme that is useful in geographically distributed clusters.
|
 |
|
 |
 |
The following table
summarizes important configuration characteristics of cluster products.
|
Max Systems In Cluster |
Cluster Interconnect |
Quorum Device |
HACMP AIX, Linux |
32 |
Network, Serial, Disk bus (SCSI, SSA) (p) |
No |
LifeKeeper Linux, Windows |
16 |
Network, Serial (p) |
Yes (Optional) |
MySQL Cluster AIX, HP-UX, Linux, Solaris, Windows |
64 |
Network |
Yes |
| NonStop Kernel |
255 |
ServerNet (a) |
Regroup algorithm |
| OpenVMS Cluster Software |
96 |
CI, Network, MC, Shared Memory (a) |
Yes (Optional) |
Oracle 9i/10g RAC Many O/S's |
Dependent on the O/S |
Dependent on the O/S |
n/a |
PolyServe Matrix Linux, Windows |
16 |
Network |
Yes (membership partitions) |
Serviceguard HP-UX, Linux |
16 |
Network, HyperFabric (HP-UX only) |
Yes = 2, optional >2 |
SQL Server 2000/2005 Windows |
Dependent on the O/S |
Dependent on the O/S |
n/a |
SunCluster Solaris |
8 |
Scalable Coherent Interface (SCI), 10/100/1000Enet (a) |
Yes (Optional), recommended for each multiported disk set |
TruCluster Tru64 UNIX |
8 generally, 512 w/Alpha SC |
100/1000Enet, QSW, Memory Channel (p) |
Yes (Optional) |
Veritas Cluster Server AIX, HP-UX, Linux, Solaris, Windows |
32 |
Dependent on the O/S |
Yes (using Volume Manager) |
Windows 2000/2003 DataCenter |
4/8 |
Network (p) |
Yes |
Figure 4 Cluster Configuration Characteristics
HACMP can have up to
32 systems or dynamic logical partitions (DLPARs, or soft partitions) in a
system. Except for special environments
like SP2, there is no high speed cluster interconnect, but serial cables and
all Ethernet and SNA networks are supported as cluster interconnects. The cluster interconnect is strictly
active/passive, and multiple channels cannot be combined for higher
throughput. The disk busses (SCSI and
SSA) are also supported as emergency interconnects if the network interconnect
fails. Quorum is supported only for disk
subsystems, not for computer systems.
LifeKeeper supports up
to 16 systems in a cluster, connected by either the network or by serial
cable. These are configured for failover
only, and are therefore active/passive.
Any of the systems can take over for any of the other systems. Quorum disks are supported but not required.
MySQL Cluster can have
up to 64 systems in the cluster, connected with standard TCP/IP
networking. These can be split among any
combination of MySQL nodes, storage engine nodes, and management nodes. MySQL uses the management node as an
arbitrator to implement the quorum scheme.
NonStop Kernel can
have up to 255 systems in the cluster, but, given the way the systems interact,
it is more accurate to say that NonStop Kernel can have 255 systems * 16
processors in each system = 4,080 processors in a cluster. Each system in the cluster is independent and
maintains its own set of resources, but all systems in the cluster share a
namespace for those resources, providing transparent access to those resources
across the entire cluster, ignoring the physical location of the resources. This is one of the methods that NSK uses to
achieve linear scalability. ServerNet is
used as a communications path within each system as well as between relatively
nearby S-series systems. ServerNet
supports systems up to 15 kilometers and remote disks up to 40 kilometers, with
standard networking supporting longer distances. The ServerNet-Fox gateway provides the
cluster interconnect to the legacy K-series.
The cluster interconnect is active/active. NSK uses a message-passing quorum scheme
called Regroup to control access to resources within a system, and does not
rely on a quorum disk.
OpenVMS Cluster
Software supports up to 96 systems in a cluster, spread over multiple
datacenters up to 500 miles apart. Each
of these can also be any combination of VAX and Alpha systems, or (starting in
2005) any combination of Itanium and Alpha systems, in mixed architecture
clusters. There are many cluster
interconnects, ranging from standard networking, to Computer Interconnect (the
first cluster interconnect ever available, which was introduced in 1984), to
Memory Channel, and they are all available active/active. The quorum device can be a system, a disk, or
a file on a disk, with the restriction that this volume cannot use host-based
shadowing.
Oracle 9i/10g RAC uses
the underlying operating system functionality for cluster configuration,
interconnect, and quorum. The most
common type is 100BaseT or 1000BaseT in a private LAN, often with port
aggregation to achieve higher speeds.
For low latency cluster interconnects, HP offers HyperFabric and Memory
Channel, and Sun offers Scalable Cluster Interconnect. Oracle 9i/10g RAC does not use a quorum
scheme; instead, it relies on the underlying operating system for this functionality.
PolyServe Matrix
Server uses the underlying operating system functionality for cluster
configuration and interconnect.
PolyServe on both Linux and Windows primarily uses gigabit Ethernet or
Infiniband in a private LAN. Matrix
Server uses a quorum scheme of membership partitions, which contain the
metadata and journaling for all of the file systems in the cluster. There are three membership partitions: all
data is replicated to all of them, providing redundancy. One or even two of these partitions could
fail, and PolyServe could still rebuild the information from the surviving
membership partition. These membership
partitions provide an alternate communications path, allowing servers to
correctly arbitrate ownership and coordinate the file systems, even if the
cluster interconnect fails. It is good
practice to place the three membership partitions on three separate devices
that are not the devices of the file systems themselves.
Serviceguard can have
up to 16 systems, using standard networking or HyperFabric (HP-UX only) as a
cluster interconnect, and uses Auto Port Aggregation for a high speed
active/active cluster interconnect. A
special requirement is that all cluster members must be present to initially
form the cluster (100% quorum requirement), and that >50% of the cluster must
be present in order to continue operation.
Serviceguard can use either one or two quorum disks (two in an Extended
Campus Cluster), a quorum server that is not a member of the cluster, or an
arbitrator system that is a member of the cluster. The quorum disk, quorum system, or arbitrator
system is used as a tie breaker when there is an even number of production
systems and a 50/50 split is possible.
For two nodes, a quorum device is required: either a system (on HP-UX
and Linux), a disk volume (on HP-UX), or a LUN (on Linux). Quorum devices are optional for any other
size cluster. Cluster quorum disks are
supported for clusters of 2-4 nodes, and cluster quorum systems are supported
for clusters of 2-16 systems. Single
quorum servers can service up to 50 separate Serviceguard clusters for either
HP-UX or Linux. Note that quorum servers
are not members of the clusters that they are protecting.
SQL Server 2000/2005
uses the underlying operating system functionality for cluster configuration,
interconnect, and quorum. The maximum
number of servers in an SQL Server environment grew from four with Windows 2000
to eight with Windows 2003.
SunCluster can have up
to eight systems in a cluster, using standard networking or the Scalable
Coherent Interconnect (SCI) as a cluster interconnect. SCI interfaces run at up to 1GByte/second,
and up to four can be striped together to achieve higher throughput, to support
active/active cluster interconnects.
However, Sun only offers up to a four-port SCI switch, so only four
systems can be in a single SunPlex domain.
Quorum disks are recommended by Sun, and each multiported disk set
requires its own quorum disk. So, for
example, if there are four systems (A, B, C and D) with two multiported disk
arrays (X and Y) where disk array X is connected to systems A and B, and disk
array Y is connected to systems C and D, two quorum disks are required.
TruCluster can have up
to eight systems of any size in a cluster.
For the HPTC market, the Alpha System SuperComputer system farm can have
up to 512 systems. This configuration
uses the Quadrix Switch (QSW) as an extremely high-speed interconnect. For the
commercial market, TruCluster uses either standard networking or Memory Channel
as an active/passive cluster interconnect.
Both OpenVMS Cluster
Software and TruCluster recommend the use of quorum disks for 2-node clusters,
but make it optional for clusters with a larger number of nodes.
Veritas Cluster Server
can have up to 32 systems in a cluster of AIX, HP-UX, Linux, Solaris, and
Windows 2000/2003 systems. Standard
networking from the underlying operating system is used as the cluster
interconnect. Veritas has implemented a
special Low Latency Transport (LLT) to efficiently use these interconnects
without the high overhead of TCP/IP.
Veritas implements a standard type of quorum in Volume Manager, using
the term "coordinator disk" for quorum devices.
Windows 2000 clusters
can have up to four systems in a cluster, while Windows 2003 extends this to
eight. Keep in mind that Windows
2000/2003 DataCenter is a services sale, and only Microsoft qualified partners
like HP can configure and deliver these clusters. The only other cluster interconnect available
is standard LAN networking, which works as active/passive.
|
 |
 |
|
 |
 |
The following table
summarizes application support provided by the cluster products.
|
Single-instance (failover mode) |
Multi-instance (cluster-wide) |
Recovery Methods |
HACMP AIX, Linux |
Yes |
Yes (using special APIs) |
Scripts |
LifeKeeper Linux, Windows |
Yes |
No |
Scripts |
MySQL Cluster AIX, HP-UX, Linux, Solaris, Windows |
No (MySQL Server) |
Yes |
Failover |
| NonStop Kernel |
Yes (takeover) |
Effectively Yes |
Paired Processing |
OpenVMS Cluster Software OpenVMS |
Yes |
Yes |
Batch /RESTART |
Oracle 9i/10g RAC Many O/S's |
No (Oracle DB) |
Yes |
Transaction recovery |
PolyServe Matrix Linux, Windows |
Yes |
No |
Scripts |
Serviceguard HP-UX, Linux |
Yes |
No |
Packages and Scripts |
SQL Server 2000/2005 Windows |
Yes |
No |
Scripts |
SunCluster Solaris |
Yes |
No |
Resource Group Manager |
TruCluster Tru64 UNIX |
Yes |
Yes |
Cluster Application Availability |
Veritas Cluster Server AIX, HP-UX, Linux, Solaris, Windows |
Yes |
No |
Application Groups |
Windows 2000/2003 Cluster Service Windows |
Yes |
No |
Registration, cluster API |
Figure 5 Cluster Support for Applications
Single-Instance and Multi-Instance Applications |
 |
 |
|
With respect to
clusters, there are two main types of applications: single-instance
applications and multi-instance applications.
Notice that these are the opposite of multisystem-view and
single-system-view. Multisystem-view
clusters allow single-instance applications, providing failover of applications
for high availability, but don't allow the same application to work on the same
data on the different systems in the cluster.
Single-system-view clusters allow multi-instance applications, which
provide failover for high availability, and also offer cooperative processing,
where applications can interact with the same data and each other on different systems
in the cluster.
A good way to
determine if an application is single-instance or multi-instance is to run the
application in several processes on a single system. If the applications do not interact in any
way, and therefore run properly whether there is only one process running the
application or multiple processes on a single system are running the
application, then the application is single-instance.
An example of a
single-instance application is telnet.
Multiple systems in a cluster can offer telnet services, but the
different telnet sessions themselves do not interact with the same data or each
other in any way. If a system fails, the
users on that system simply log in to the next system in the cluster and
restart their sessions. This is simple
failover. Many systems, including HACMP,
Linux, Serviceguard, SunCluster, and Windows 2000/2003 clusters set up telnet
services as single-instance applications in failover mode.
If, on the other hand,
the applications running in the multiple processes interact properly with each
other, such as by sharing cache or by locking data structures to allow proper
coordination between the application instances, then the application is
multi-instance.
An example of a
multi-instance application is a cluster file system that allows the same set of
disks to be offered as services to multiple systems. This requires a cluster file system with a
single-system-view cluster, which can be offered either in the operating system
software itself (as on OpenVMS Cluster Software) or by other clusterware (as on
Oracle 9i/10g RAC). Although HACMP,
LifeKeeper, Serviceguard, SunCluster, and Windows 2000/2003 do not support
multi-instance applications as part of the base operating system cluster-ware,
3rd party tools can add multi-instance application capability. For example, NonStop Kernel uses messaging
and DAMs to provide this functionality.
Applications, whether
single-instance or multi-instance, can be dynamically assigned network
addresses and names, so that the applications are not bound to any specific
system address or name. At any given
time, the application's network address is bound to one or more network
interface cards (NICs) on the cluster.
If the application is a single-instance application running on a single
system, the network packets are simply passed to the application. If the application is a single-instance
application running on multiple systems in the cluster (like the telnet example
above), or is a multi-instance application, the network packets are routed to
the appropriate system in the cluster for that instance of the application,
thus achieving load balancing. Future
communications between that client and that instance of the application may
either continue to be routed in this manner, or may be sent directly to the
most appropriate NIC on that server.
|
Recovery Methods |
 |
 |
|
The recovery method is
the way the cluster recovers the applications that were running on a system
that has been removed, either deliberately or by a system failure, from the
cluster.
HACMP allows
applications and the resources that they require (for example, disk volumes) to
be placed into resource groups, which are created and deleted by running
scripts specified by the application developer.
These scripts are the only control that HACMP has over the applications,
and IBM stresses that the scripts must take care of all aspects of correctly
starting and stopping the application, otherwise recovery of the application
may not occur. The resource groups can
be concurrent (that is, the application runs on multiple systems of the cluster
at once) or non-concurrent (that is, the application runs on a single system in
the cluster, but can fail over to another system in the cluster). For each resource group, the system
administrator must specify a "node-list" that defines the systems that are able
to take over the application in the event of the failure of | | |