---
canonical: https://safekit.evidian.com/wp-content/uploads/downloads_safekit/version-82/safekituserguidehtml/documentation/safekituserguideen.htm
---

## 4.4             Tests of checkers common to mirror and farm

### 4.4.1         Test <errd> checker with action restart or stopstart

For a description of process/service
monitoring, refer to section 13.10.

|  |  |
| --- | --- |
| In userconfig.xml:  <errd>    <proc name="appli.exe" atleast="1" action="restart" class="prim"/>  </errd>  The checker monitors the process named appli.exe.  ·         name="appli.exe" atleast="1": at least one process appli.exe must run  ·         class  o    class="prim" for mirror module  checker started/stopped on the server in state PRIM or ALONE (Ready), after/before the application (start\_prim/stop\_prim)  o    class="both" for farm module  checker started/stopped on all servers UP (Ready) after after/before the application (start\_both/stop\_both)  ·         action  If appli.exe is not running, the checker set the resource proc.appli.exe to down. Then, it executes a restart or stopstart.  o    action="restart"  it restarts locally the application (stop\_xx; start\_xx)  o    action="stopstart"  it stops the module, as well as the application, and then automatically starts it | 1.    Kill of process appli.exe on the server in (Ready) state. That is in states PRIM or ALONE for a mirror module, UP for a farm module.  o    messages in the log:  "Process appli.exe not running"   "Action restart|stopstart called by errd"   o    the module becomes (Transient), respectively in state PRIM, ALONE or UP  o    in the restart case, the module becomes  (Ready), respectively in state PRIM, ALONE or UP  o    in the stopstart case, the module becomes (Ready), respectively in state SECOND, ALONE or UP  message in the log:   "Action start called automatically"  Note: a stopstart on PRIM (Ready) causes a failover  2.    Repeat the test on the same server if it still runs the application (i.e., (Ready) in state ALONE, PRIM or UP).  By default, on the 4th error detection within 24 hours (see maxloop and loop\_interval described in section 13.3.3), the module becomes STOP (NotReady). In the log, message before stopping:  "Action stop called by maxloop" |

### 4.4.2         Test <tcp> checker with action restart or stopstart

For a description of TCP checker, refer to section 13.12.

|  |  |
| --- | --- |
| In userconfig.xml:  <check>    <tcp ident="id" when="prim" action="restart" >      <to addr="addr" port="port"/>    </tcp>  </chek>  The checker checks that the application responds to connection requests.  ·         addr="addr" port="port"  test TCP connections on addr:port  ·         when  o    when="prim" for mirror module  checker started/stopped on the server in state   PRIM or ALONE (Ready), after/before the application (start\_prim/stop\_prim)  o    when="both" for farm module  checker started/stopped on all servers UP (Ready) after after/before the application (start\_both/stop\_both)  ·         action  If the connection fails, the checker sets the resource tcp.id to down. The associated failover rule, named **t\_**id, executes a restart or stopstart.  o    action="restart"  It restarts locally the application (stop\_xx; start\_xx)  o    action="stopstart"  It stops completely the module and then automatically starts it. | 1.    Stop the application listening addr:port on the server in state  (Ready). That is in states PRIM or ALONE for a mirror module, UP for a farm module:  o    messages in the log:  "Resource tcp.id set to down by tcpcheck"   "Action restart|stopstart from failover rule t\_id"  o    the module becomes (Transient)  o    in case of restart, the module becomes (Ready), respectively in state PRIM, ALONE or UP  o    in case of stopstart, the module becomes (Ready), respectively in state SECOND, ALONE or UP  Message in the log:  "Action start called automatically"  Note: a stopstart on PRIM (Ready) causes a failover.  2.    Repeat the test on the same server if it still runs the application (i.e., (Ready) in state ALONE, PRIM or UP).  By default, on the 4th error detection within 24 hours (see maxloop and loop\_interval in section 13.3.3), the module becomes STOP (NotReady). In the log, message before stopping:  "Action stop called by maxloop" |

### 4.4.3         Test <tcp> checker with action wait

For a description of TCP checker, refer to section 13.12.

|  |  |
| --- | --- |
| In userconfig.xml:  <check>    <tcp ident="id" when="pre" action="wait" >      <to addr="addr" port="port"/>    </tcp>  </check>  The checker checks that an application, external to the module, responds to connection requests.  ·         addr="addr" port="port"  It checks TCP connections on addr:port  ·         when="pre"  The checker starts before, stops after, the application integrated into the module (in start\_xx /stop\_xx).  ·         action="wait"  If the connection fails, the checker sets the resource tcp.id to down. The associated failover rule, named **t\_**id, executes a wait.  It stops the module, and its application, then puts it in the state WAIT, waiting for tcp.id reset to up by the checker. | 1.    Stop the external application listening on addr:port, when the server is in (Ready) state.  o    messages in the log:  "Resource tcp.id set to down by tcpcheck"  "Action wait from failover rule t\_id"  o    the module becomes WAIT (NotReady)on all nodes  Note: a wait on PRIM (Ready) causes a failover  2.    Restart the application listening on addr:port.  o    messages in the verbose log  "Resource tcp.id set to up by tcpcheck"  " Action wakeup from failover rule Implicit\_wakeup "  o    the module becomes (Ready), respectively in state SECOND, ALONE, or UP  3.    Repeat the test.  By default, on the 4th error detection within 24 hours (see maxloop and loop\_interval in section 13.3.3), the module becomes STOP (NotReady). In the log, message before stopping:  "Action stop called by maxloop"  Note: This test allows testing of connectivity to an external service. But if the external service is down or is unreachable on all servers, all servers are in state WAIT (NotReady) and the application is unavailable. |

### 4.4.4         Test <interface check="on"> with action wait

For a description of interface checker,
refer to section 13.14. For its
automatic configuration with <interface
check="on">, see section 13.6.5.

|  |  |
| --- | --- |
| In userconfig.xml:  <vip>   <interface\_list>   <interface **check="on"**>    <real\_interface>     <virtual\_addr addr="172.17.0.20"                where="one\_side\_alias"              check="on"/>  </real\_interface>    </interface>   </interface\_list>  </vip>  The checker checks that the Ethernet cable is connected in the interface where the virtual IP address is set.  ·         If the cable is disconnected, the checker set the associated resource intf.172.17.0.0 to down. The prefix is intf and the suffix is the network corresponding to the virtual IP.  ·         The default failover rule, named interface\_failure, executes a wait.  It stops the module, and its application, then puts it in the state WAIT, waiting for intf.172.17.0.0 reset to up by the checker.     Note: do not use check="on" on bonding or teaming interface because these interfaces bring their own failover mechanisms from interface to interface | 1.    Remove the Ethernet cable from the network card (on which the virtual IP is configured) on the server in (Ready) state. That is in state PRIM or ALONE for a mirror module, UP for a farm module.  o    messages in the log:  "Resource intf.172.17.0.0 set to down by intfcheck"  "Action wait from failover rule interface\_failure"   o    the module becomes WAIT (NotReady)  Note: a wait on PRIM (Ready) causes a failover  2.    Plug the cable again  o    messages in the log  "Resource intf.172.17.0.0 set to up by intfcheck"  "Action wakeup from failover rule Implicit\_wakeup"  o    the module becomes (Ready), respectively in state SECOND, ALONE or UP  3.    Repeat the test on the same server  By default, on the 4th error detection within 24 hours (see maxloop and loop\_interval in section 13.3.3), the module becomes STOP (NotReady). In the log, message before stopping:  "Action stop called by maxloop"     Note: disabling the interface (instead of unplugging the ethernet cable) leads to STOP (NotReady) if this network is also used for heartbeat. The reason is that the module cannot start (or restart) without local IP address. |

### 4.4.5         Test <ping> checker with action wait

For a description of ping checker, refer to
section 13.13.

 

|  |  |
| --- | --- |
| In userconfig.xml:  <check>    <ping ident="id" when="pre" action="wait">      <to addr="extip"/>    </ping>  </check>  The checker checks that the external device (ex.: a router) with address extip responds to ping.  ·         when="pre"  The checker starts before, stops after, the application integrated into the module (in start\_xx /stop\_xx).  ·         action="wait"  If the ping fails, the checker sets the resource ping.id to down. The associated failover rule, named p**\_**id, executes a wait.  It stops the module, and its application, then puts it in the state WAIT, waiting for ping.id reset to up by the checker. | 1.    Break the link between the pinged external device and the server the server in (Ready) state. That is in state PRIM, ALONE or SECOND for a mirror module, UP for a farm module  o    messages in the log:  "Resource ping.id set to down by pingcheck"  "Action wait from failover rule p\_id"  o    the module becomes WAIT (NotReady)on all nodes  Note: a wait on PRIM (Ready) causes a failover  2.    Restore the network connection  o    messages in the verbose log  "Resource ping.id set to up by pingcheck"  " Action wakeup from failover rule Implicit\_wakeup "  o    the module becomes (Ready), respectively in state SECOND, ALONE, PRIM or UP  4.    Repeat the test  By default, on the 4th error detection within 24 hours (see maxloop and loop\_interval in section 13.3.3), the module becomes STOP (NotReady). In the log, message before stopping:  "Action stop called by maxloop"  Note: This test allows testing of connectivity to an external device. But if this one is down or is unreachable on all servers, all servers are in state WAIT (NotReady) and the application is unavailable. |

### 4.4.6         Test <module> checker with action wait

For a description of module checker, refer
to section 13.17.

|  |  |
| --- | --- |
| In userconfig.xml of AM module:  <check>    <module name="othermodule">      <to addr="ip" port="9010"/>    </module>  </check>  The checker in AM checks the module othermodule on its virtual IP address ip.  ·         If the module othermodule is not started, the checker set the associated resource module.othermodule\_ip to down. The prefix is module, and the suffix is the other module name and address.  ·         The default failover rule, named module\_failure, executes a wait.  It stops the module AM, and its application, then puts it in the state WAIT, waiting for module.othermodule\_addr reset to up by the checker.  ·         If the module othermodule is restarted, the checker executes a stopstart on AM.  Note: if the module AM is a mirror module using file replication and because of rule notuptodate\_server, you may experience a wrong behavior with module AM blocked in a WAIT state, if the stopstart action happens when AM in the transition SECOND to ALONE | 1.    Stop the module othermodule. And start the module AM on all servers.  o    messages in the log of module AM  "Resource module.othermodule\_ip set to down by modulecheck   "Action wait from failover rule module\_failure"  o    the module AM becomes WAIT (NotReady) on all servers  2.    Start the module othermodule  o    messages in the verbose log of module AM  "Resource module.othermodule\_ip set to up by modulecheck"  "Action wakeup from failover rule Implicit\_wakeup"  o    the module AM goes (Ready)on all nodes  3.    Run a restart on othermodule  o    messages in the log of module AM  "Action stopstart called by modulecheck"  o    the module AM stops and then automatically starts  4.    Repeat the test on the same server  By default, on the 4th error detection within 24 hours (see maxloop and loop\_interval in section 13.3.3), the module becomes STOP (NotReady). In the log, message before stopping:  "Action stop called by maxloop" |

### 4.4.7         Test <custom> checker with action wait

For a description of custom checker, refer
to section 13.16.

|  |  |
| --- | --- |
| In userconfig.xml:  <check>    <custom ident="id" when="pre" exec="customscript" action="wait" />    </custom>  </check>  The custom checker is an infinite loop that performs a test and assigns the associated resource as up or down based on the test result.  ·         when="pre"  The checker starts before, stops after, the application integrated into the module (in start\_xx /stop\_xx).  ·         exec="customscript"  Script located under *AM**/*bin/customscript that sets the resource custom.id:  o    on error  SAFE/safekit set -r custom.id -v down  -i customscript  o    on success  SAFE/safekit set -r custom.id -v up  -i customscript  ·         action="wait"  When the custom.id is down, the associated failover rule, named c**\_**id, executes a wait.  It stops the module, its application, and the checker, then puts it in the state WAIT, waiting for custom.id reset to up by the checker. | 1.    Cause the failure of the custom checker test when the server is in state (Ready). That is in state PRIM, ALONE or SECOND for a mirror module, UP for a farm module:  o    messages in the log:  "Resource custom.id set to down by customscript"  "Action wait from failover rule c\_id"  o    the module becomes WAIT (NotReady)on all nodes  Note: a wait on PRIM (Ready) causes a failover  2.    Fix the error tested by the custom checker  o    messages in the verbose log  "Resource custom.id set to up by customscript"  "Action wakeup from failover rule Implicit\_wakeup"  o    the module becomes (Ready), respectively in state SECOND, ALONE, PRIM or UP  3.    Repeat the test on the same server.  By default, on the 4th error detection within 24 hours (see maxloop and loop\_interval in section 13.3.3), the module becomes STOP (NotReady). In the log, message before stopping:  "Action stop called by maxloop" |

 

The action associated with the custom
checker can be defined through an explicit failover rule instead of the action
attribute, which in this case is set to noaction. The
following example is equivalent to the previous one, except for the name of the
failover rule, which is customid\_failure:

<check>

  <custom
ident="id" when="pre" exec="customscript"
action="noaction" />  
  </custom>

</check>

<failover>  
  <![CDATA[  
   customid\_failure: if (custom.id == down) then wait();  
  ]]>  
</failover>

This syntax is the one supported before
SafeKit 8.

### 4.4.8         Test <custom> checker with action restart or stopstart

For a description of custom checker, refer
to section 13.16.

#### 4.4.8.1      Action through a failover rule

|  |  |
| --- | --- |
| In userconfig.xml:  <check>    <custom ident="id" when="prim" exec="customscript" action="restart" />    </custom>  </check>  The custom checker is an infinite loop that performs a test and assigns the associated resource as up or down based on the test result.  ·         when  o    when="prim" for mirror module  checker started/stopped on the server in state   PRIM or ALONE (Ready), after/before the application (start\_prim/stop\_prim)  o    when="both" for farm module  checker started/stopped on all servers UP (Ready) after after/before the application (start\_both/stop\_both)  ·         exec="customscript"  Script located under *AM**/*bin/customscript that sets the resource custom.id:  o    on error  SAFE/safekit set -r custom.id -v down  -i customscript  o    on success  SAFE/safekit set -r custom.id -v up  -i customscript  ·         action  When the custom.id is down, the associated failover rule, named c**\_**id, executes a restart or stopstart.  o    action="restart"  It restarts locally the application (stop\_xx; start\_xx).  o    action="stopstart"  It stops completely the module, its application, and the checker, and then automatically starts it. | 1.    Cause the failure of the custom checker test when the server is in state (Ready). That is in state PRIM, ALONE or SECOND for a mirror module, UP for a farm module:  o    messages in the verbose log:  "Resource custom.id set to down by customscript"   and  "Action restart from failover rule c\_id "   or  "Action stopstart from failover rule c\_id "   o    the module becomes (Transient).  o    in case of restart, the module becomes (Ready), respectively in state PRIM, ALONE or UP  o    in case of stopstart, the module becomes (Ready), respectively in state SECOND, ALONE or UP  Message in the log:  "Action start called automatically"  Note: a stopstart on PRIM (Ready) causes a failover.  2.    Repeat the test on the same server.  By default, on the 4th error detection within 24 hours (see maxloop and loop\_interval in section 13.3.3), the module becomes STOP (NotReady). In the log, message before stopping:  "Action stop called by maxloop" |

 

The action associated with the custom
checker can be defined through an explicit failover rule instead of the action
attribute, which in this case is set to noaction. The
following example is equivalent to the previous one, except for the name of the
failover rule, which is customid\_failure:

<check>

  <custom
ident="id" when="pre" exec="customscript"
action="noaction" />  
  </custom>

</check>

<failover>  
  <![CDATA[  
   customid\_failure: if (custom.id == down) then restart();  
  ]]>  
</failover>

This syntax is the one supported before
SafeKit 8.

#### 4.4.8.2      Action through a command in the custom checker

|  |  |
| --- | --- |
| In userconfig.xml:  <check>    <custom ident="id" when="prim" exec="customscript" action="noaction" />    </custom>  </check>  The custom checker is an infinite loop that performs a test and execute a restart or stopstart based on the test result.  ·         when  o    when="prim" for mirror module  checker started/stopped on the server in state   PRIM or ALONE (Ready), after/before the application (start\_prim/stop\_prim)  o    when="both" for farm module  checker started/stopped on all servers UP (Ready) after after/before the application (start\_both/stop\_both)  ·         action="noaction"  No failover rule generated.  ·         exec="customscript"  Script located under *AM**/*bin/customscript that sets the resource custom.id:  o    on error  SAFE/safekit restart -i customscript  It restarts locally the application (stop\_xx; start\_xx).  or  o    on error  SAFE/safekit stopstart -i customscript  It stops completely the module, its application, and the checker, and then automatically starts it. | 1.    Cause the failure of the custom checker test when the server is in state (Ready). That is in state PRIM, ALONE or SECOND for a mirror module, UP for a farm module:  o    messages in the verbose log:  "Action restart called by customscript"   ou  "Action stopstart called by customscript"    o    the module becomes (Transient).  o    in case of restart, the module becomes (Ready), respectively in state PRIM, ALONE or UP  o    in case of stopstart, the module becomes (Ready), respectively in state SECOND, ALONE or UP  Message in the log:  "Action start called automatically"  Note: a stopstart on PRIM (Ready) causes a failover.  2.    Repeat the test on the same server.  By default, on the 4th error detection within 24 hours (see maxloop and loop\_interval in section 13.3.3), the module becomes STOP (NotReady). In the log, message before stopping:  "Action stop called by maxloop"  Note: on a direct action in the custom checker, the maxloop counter is incremented only if -i identity is passed to the command restart or stopstart. Without identity, SafeKit considers the command is as an administrative operation. The counter is reset and there is no stop after 4 restarts. |