Pacemaker のリソースフェイルオーバーについて

Pacemaker で想定される故障は、start［起動失敗］, monitor［監視による検出］, stop［停止失敗］の3パターンがあります。故障が発生した際は、on-fail の設定に応じた動作を行います。on-fail を指定しない場合は、デフォルトの restart が適用されます。on-fail で指定可能な設定値は下記のとおりです。

ignore : 何の処理も行わない。
block : 故障したリソースの管理を停止し、待機する。
fence : リソース故障が発生したサーバーを STONITH によって再起動し、フェイルオーバーする。
restart : 故障したリソースを、他のサーバへフェイルオーバーする。（デフォルト）

以前書いた記事の環境を使って、あらためてフェイルオーバーをテストしてみます。

リソースフェイルオーバーの確認

pm01 の設定を確認します。

# crm configure show
node $id="420126ea-1e1e-4632-b4fb-eaa2a8915909" pm02 \
        attributes standby="off"
node $id="ca3de0e0-dff2-4d13-a126-b9fbee7e3ec9" pm01 \
        attributes standby="off"
primitive apache lsb:httpd \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="30s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="fence"
primitive mnt_fs ocf:heartbeat:Filesystem \
        params device="/dev/sdb2" directory="/data" fstype="ext3" \
        op monitor interval="20s" timeout="40s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive pingd ocf:pacemaker:pingd \
        params name="default_ping_set" host_list="192.168.1.2" interval="10" timeout="10" attempts="5" multiplier="100" \
        op start interval="0" timeout="90" on-fail="restart" \
        op monitor interval="10" timeout="20" on-fail="restart" start-delay="1m" debug="true" \
        op stop interval="0" timeout="100" on-fail="block"
primitive vip ocf:heartbeat:IPaddr2 \
        params ip="192.168.1.103" cidr_netmask="24" nic="eth0" iflabel="0" \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="10s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="block"
group Cluster vip mnt_fs apache
clone clone_ping pingd
location vip_location vip \
        rule $id="vip_location-rule" -inf: not_defined default_ping_set or default_ping_set lt 100
property $id="cib-bootstrap-options" \
        dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
        cluster-infrastructure="Heartbeat" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
        resource-stickiness="INFINITY" \
        migration-threshold="3"

リソース故障が1回発生するとフェイルオーバーするように設定を変更します。

# crm configure edit<...snip...>
rsc_defaults $id="rsc-options" \
        resource-stickiness="INFINITY" \
        migration-threshold="1"※値を 3 から 1 に変更します。

リソースの状況を確認します。※-f オプション故障状況を確認できます。

# crm_mon -f
============
Last updated: Tue May 28 03:37:46 2013
Stack: Heartbeat
Current DC: pm01 (ca3de0e0-dff2-4d13-a126-b9fbee7e3ec9) - partition with quorum
Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ pm02 pm01 ]

 Resource Group: Cluster
     vip        (ocf::heartbeat:IPaddr2):       Started pm01
     mnt_fs     (ocf::heartbeat:Filesystem):    Started pm01
     apache     (lsb:httpd):    Started pm01
 Clone Set: clone_ping
     Started: [ pm02 pm01 ]

Migration summary:
* Node pm02:
* Node pm01:

プロセス故障を想定し、リソース［apache］を kill します。

# ps -ef | grep apache
apache    7438  7436  0 03:44 ?        00:00:00 /usr/sbin/httpd
# kill -9 7436

リソース故障を検知しプロセスの再起動を試みます。再起動に成功すると、fail-count が加算されます。
fail-count の合計が migration-threshold を超過すると、リソースをフェイルオーバーします。
再起動に失敗した場合は、fail-count は INFINITY となり、即時フェイルオーバします。
本例では、migration-threshold を 1 に指定しているので、pm02 へフェイルオーバします。

# crm_mon -f
============
Last updated: Tue May 28 03:47:47 2013
Stack: Heartbeat
Current DC: pm01 (ca3de0e0-dff2-4d13-a126-b9fbee7e3ec9) - partition with quorum
Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ pm02 pm01 ]

 Resource Group: Cluster
     vip        (ocf::heartbeat:IPaddr2):       Started pm02
     mnt_fs     (ocf::heartbeat:Filesystem):    Started pm02
     apache     (lsb:httpd):    Started pm02
 Clone Set: clone_ping
     Started: [ pm02 pm01 ]

Migration summary:
* Node pm02:
* Node pm01:
   apache: migration-threshold=1 fail-count=1 ※ fail-count がカウントされています。

Failed actions:
    apache_monitor_30000 (node=pm01, call=22, rc=7, status=complete): not running

fail-count がカウントされているノードは、リソースを管理する事ができません。
当該ノードへリソースを移行するためには、fail-count をクリアする必要があります。

リソースの設定状況を確認します。

# crm resource show
 Resource Group: Cluster
     vip        (ocf::heartbeat:IPaddr2) Started
     mnt_fs     (ocf::heartbeat:Filesystem) Started
     apache     (lsb:httpd) Started
 Clone Set: clone_ping
     Started: [ pm02 pm01 ]

リソース［apache］の fail-count を確認します。

# crm resource failcount apache show pm01
scope=status  name=fail-count-apache value=1

リソース［apache］の fail-count をクリアします。

# crm resource failcount apache delete pm01

リソース［apache］の fail-count がクリアされ、値が 0 となります。

# crm resource failcount apache show pm01
scope=status  name=fail-count-apache value=0

fail-count はクリアされましたが、故障履歴［Faied actions］が残っています。

# crm_mon -f
============
Last updated: Tue May 28 04:03:35 2013
Stack: Heartbeat
Current DC: pm01 (ca3de0e0-dff2-4d13-a126-b9fbee7e3ec9) - partition with quorum
Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ pm02 pm01 ]

 Resource Group: Cluster
     vip        (ocf::heartbeat:IPaddr2):       Started pm02
     mnt_fs     (ocf::heartbeat:Filesystem):    Started pm02
     apache     (lsb:httpd):    Started pm02
 Clone Set: clone_ping
     Started: [ pm02 pm01 ]

Migration summary:
* Node pm02:
* Node pm01:

Failed actions:
    apache_start_0 (node=pm01, call=36, rc=1, status=complete): unknown error

故障履歴［Faied actions］をクリアします。

# crm resource cleanup apache pm01
Cleaning up apache on pm01
Waiting for 2 replies from the CRMd..

fail-count 及び Faied actions がクリアされました。

# crm_mon -f
============
Last updated: Tue May 28 04:07:18 2013
Stack: Heartbeat
Current DC: pm01 (ca3de0e0-dff2-4d13-a126-b9fbee7e3ec9) - partition with quorum
Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ pm02 pm01 ]

 Resource Group: Cluster
     vip        (ocf::heartbeat:IPaddr2):       Started pm02
     mnt_fs     (ocf::heartbeat:Filesystem):    Started pm02
     apache     (lsb:httpd):    Started pm02
 Clone Set: clone_ping
     Started: [ pm02 pm01 ]

Migration summary:
* Node pm02:
* Node pm01:

リソーススイッチバックの確認

pm02 へ resource move を実行しリソースを pm01 へ戻します。

# crm resource move Cluster pm01 force
WARNING: Creating rsc_location constraint 'cli-standby-Cluster' with a score of -INFINITY for resource Cluster on pm02.
        This will prevent Cluster from running on pm02 until the constraint is removed using the 'crm_resource -U' command or manually with cibadmin
        This will be the case even if pm02 is the last node in the cluster
        This message can be disabled with -Q

pm01 へリソースが移動します。

# crm_mon -f
============
Last updated: Tue May 28 04:57:08 2013
Stack: Heartbeat
Current DC: pm02 (420126ea-1e1e-4632-b4fb-eaa2a8915909) - partition with quorum
Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ pm02 pm01 ]

 Resource Group: Cluster
     vip        (ocf::heartbeat:IPaddr2):       Started pm01
     mnt_fs     (ocf::heartbeat:Filesystem):    Started pm01
     apache     (lsb:httpd):    Started pm01
 Clone Set: clone_ping
     Started: [ pm02 pm01 ]

Migration summary:
* Node pm02:
* Node pm01:

同時に移動元に移動禁止フラグがたちます。

# crm configure show
node $id="420126ea-1e1e-4632-b4fb-eaa2a8915909" pm02 \
        attributes standby="off"
node $id="ca3de0e0-dff2-4d13-a126-b9fbee7e3ec9" pm01 \
        attributes standby="off"
primitive apache lsb:httpd \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="30s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="fence"
primitive mnt_fs ocf:heartbeat:Filesystem \
        params device="/dev/sdb2" directory="/data" fstype="ext3" \
        op monitor interval="20s" timeout="40s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive pingd ocf:pacemaker:pingd \
        params name="default_ping_set" host_list="192.168.1.2" interval="10" timeout="10" attempts="5" multiplier="100" \
        op start interval="0" timeout="90" on-fail="restart" \
        op monitor interval="10" timeout="20" on-fail="restart" start-delay="1m" debug="true" \
        op stop interval="0" timeout="100" on-fail="block"
primitive vip ocf:heartbeat:IPaddr2 \
        params ip="192.168.1.103" cidr_netmask="24" nic="eth0" iflabel="0" \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="10s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="block"
group Cluster vip mnt_fs apache
clone clone_ping pingd
location cli-prefer-Cluster Cluster \
        rule $id="cli-prefer-rule-Cluster" inf: #uname eq pm01      ※移動禁止フラグ
location cli-standby-Cluster Cluster \
        rule $id="cli-standby-rule-Cluster" -inf: #uname eq pm02    ※移動禁止フラグ
location vip_location vip \
        rule $id="vip_location-rule" -inf: not_defined default_ping_set or default_ping_set lt 100
property $id="cib-bootstrap-options" \
        dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
        cluster-infrastructure="Heartbeat" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1369683385"
rsc_defaults $id="rsc-options" \
        resource-stickiness="INFINITY" \
        migration-threshold="1"

移動禁止フラグが立つと pm02 へフェイルオーバーできないためクリアします。

# crm resource unmove Cluster

移動禁止フラグが削除されます。

# crm configure show
node $id="420126ea-1e1e-4632-b4fb-eaa2a8915909" pm02 \
        attributes standby="off"
node $id="ca3de0e0-dff2-4d13-a126-b9fbee7e3ec9" pm01 \
        attributes standby="off"
primitive apache lsb:httpd \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="30s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="fence"
primitive mnt_fs ocf:heartbeat:Filesystem \
        params device="/dev/sdb2" directory="/data" fstype="ext3" \
        op monitor interval="20s" timeout="40s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive pingd ocf:pacemaker:pingd \
        params name="default_ping_set" host_list="192.168.1.2" interval="10" timeout="10" attempts="5" multiplier="100" \
        op start interval="0" timeout="90" on-fail="restart" \
        op monitor interval="10" timeout="20" on-fail="restart" start-delay="1m" debug="true" \
        op stop interval="0" timeout="100" on-fail="block"
primitive vip ocf:heartbeat:IPaddr2 \
        params ip="192.168.1.103" cidr_netmask="24" nic="eth0" iflabel="0" \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="10s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="block"
group Cluster vip mnt_fs apache
clone clone_ping pingd
location vip_location vip \
        rule $id="vip_location-rule" -inf: not_defined default_ping_set or default_ping_set lt 100
property $id="cib-bootstrap-options" \
        dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
        cluster-infrastructure="Heartbeat" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1369683385"
rsc_defaults $id="rsc-options" \
        resource-stickiness="INFINITY" \
        migration-threshold="1"

Pacemaker のリソースフェイルオーバーについて

リソースフェイルオーバーの確認

リソーススイッチバックの確認

Trending Articles

モーツァルトディヴェルティメント変ホ長調 K.563 の名盤

井上貴博アナウンサー彼女や結婚の噂は？実家や親が話題？人気は？

Ke Aloha Kalikimakaの歌詞を和訳します

PaliのLepe `Ula`ulaと歌詞の和訳

2014年6月6日号　三菱東京ＵＦＪ銀行（5月14日付）

LNK2019:未解決の外部シンボルと LNK1120:外部参照 1 が未解決について

ヴァンパイア・ノーツ　攻略

大阪・泉南イオンで飛び降り自殺とみられる転落事件が発生：ネットで拡散された理由とは

メールディーラーで受信するアドレスを追加できますか？

Robocopy のエラー (戻り値) について

林要の結婚や経歴&評判とWikiプロフやLOVOT(ラボット)とグルーブエックス株価は

【極☆寒】「凍った髪」を競い合う『国際ヘア・フリージング・コンテスト』！寒〜い写真に身震いしつつ過ぎ行く冬にサヨナラだ!!

滋賀の部落（同和地区）一覧

【銃刀法違反】吉田総業組長代行恩田達志容疑者を再逮捕

和歌山県代表決まる　都道府県対抗中学バレー

大浦街道で重体事故

【世界大学ランキング】第１位にジュリアード音楽院とウィーン国立音大、日本勢は？

【対策済】「SKYSEA Client View」のアップデートに失敗する問題についてのお知らせ

Lahaina Lunaの歌詞を和訳しました

画像・写真】ららぽーと横浜で16歳男子高校生が転落死不審な動き→逃走し警備員に追いかけられ→柵越え飛び降り・12m転落窃盗・万引き？それとも盗撮？