---
canonical: https://safekit.evidian.com/wp-content/uploads/downloads_safekit/version-82/safekituserguidehtml/documentation/safekituserguideen.htm
---

## 13.15      IP checker - <ip>

By default, there is a stopstart
of the module when the IP checker detects that the IP address is not configured
locally. On Windows, it also detects conflicts with that address.

|  |  |
| --- | --- |
| Commentaire important contour | Insert the <ip> tag into the <check> section if this one is already defined. |

### 13.15.1  <ip> example

<check>

  <ip
ident="ip\_check" >

    <to
addr="192.168.1.10" />

  </ip>

</check>

·        
The resource associated with the checker is
named ip.ip\_check (with the prefix ip.)

·        
The failover rule, which performs a stopstart
when an ip class resource goes down, is static and defined by the
default failover rule:

ip\_failure: if (ip.? ==
down) then stopstart();

For a description of checkers, refer to section 13.11.3.

 

|  |  |
| --- | --- |
| Sous-titres contour | See also the example in section 15.11. It presents the configuration via the web console along with the corresponding userconfig.xml. |

### 13.15.2  <ip> syntax

  <ip

   
ident="ip\_checker\_name"

   
[when="prim"|"both"]

  >

    <to

    
addr="IP address or name to check"

    
[interval="10s"]

     />

  </ip>

### 13.15.3  <ip> attributes

|  |  |  |  |
| --- | --- | --- | --- |
| <ip | |  |  | | --- | --- | | Commentaire, ajouter contour | <ip> sections are automatically generated on the virtual IPs when <virtual\_addr check="on"> is set (see the virtual IP definition in section 13.6). | |
| ident="*ip\_checker\_name*" | Interface checker name.  It defines the resource associated with the checker:  **ip**.*ip\_checker\_name* (with the prefix ip.) |
| [when="prim"|"both"] | Default if not set.  ·         when="prim" for a mirror module  The checker is started after/ended before the execution of the start\_prim/stop\_prim scripts.  ·         when="both" for a farm module  The checker is started after/ended before the execution of the start\_both/stop\_both scripts.  In case of error detection, the action is stopstart. The name of the failover rule, **ip\_failure**, is static and predefined. |
| <to |  |
| addr="*IP address or name*" | Local IP address or name to check.  IPv4 or IPv6 address. |
| [interval="10s"] | Interval in seconds between two checks.  Default value: 10s (10 seconds)   |  |  | | --- | --- | | Commentaire, ajouter contour | Time unit supported since SafeKit 8.2.5 (see section 13.1). | |
| </ip> |  |

## 13.16      Custom checker - <custom>

A custom checker is an executable (script
or binary) that you develop to test a resource or application. It consists of a
loop that performs a test at appropriate intervals. Its role is to set the
associated resource's status to up or down. Then, a
failover rule decides the action to be taken on the module when the resource is
down.

Since SafeKit 8, the action can be
configured using the action attribute of the <custom> tag.

|  |  |
| --- | --- |
| Commentaire important contour | Insert the <custom> tag into the <check> section if this one is already defined. |

### 13.16.1  <custom> example

·        
Example with action!="noaction"

<check>

  <custom ident="AppChecker"
when="prim" exec="mychecker"
action="stopstart"/> </check>

o    The resource associated with the checker is named custom.AppChecker (with the prefix custom.)

o    The generated failover rule, which performs a stopstart
when the resource goes down, is named
c\_AppChecker
(with the prefix c\_) and is equivalent to:

c\_AppChecker: if (custom.AppChecker == down) then stopstart();

·        
Example with action="noaction"

<check>

  <custom ident="AppChecker"
when="prim" exec="mychecker"
action="noaction"/> </check>

No failover rule
is generated. The user has the option to define one explicitly in the <failover> tag. For example:

…

<failover>

  <![CDATA[

    custom\_failure: if
(custom.AppChecker == down) then stopstart();

  ]]>

</failover>

 

|  |  |
| --- | --- |
| Commentaire important contour | In SafeKit < 8, the action attribute did not exist, and the action was configured by defining a failover rule in the <failover> tag, as shown in the example above. Therefore, the default value of the action attribute is equivalent to noaction to maintain backward compatibility with older configurations. |

 

|  |  |
| --- | --- |
| Sous-titres contour | See also the example in section 15.7. It presents the configuration via the web console along with the corresponding userconfig.xml. |

### 13.16.2  <custom> syntax

<custom

  
ident="custom\_checker\_name"

  
when="pre|prim|second|both"

  
exec="executable\_path"

  
arg="executable\_arguments"

   action="wait|stop|stopstart|restart|noaction"

/>

### 13.16.3  <custom> attributes

 

|  |  |
| --- | --- |
| <custom | Set as many custom sections as there are custom checkers. |
| ident="*custom\_checker\_name*" | Custom checker name.  It defines the resource associated with the checker:  **custom**.*custom\_checker\_name* (with the prefix custom.)  A custom checker must set its associated resource state itself, using the command  safekit set -r custom.custom\_checker\_name -v up|down   |  |  | | --- | --- | | Commentaire important contour | Note that SafeKit automatically initializes the state of the resource to init, and the failover machine stays in the WAIT state if its value is not set. | |
| when="pre"  action="wait"|"noaction" | Use this value to test an external component before the application starts:  ·         when="pre"  The checker starts after/ends before the execution of the prestart/poststop scripts  Since SafeKit 8, you can configure the action to be taken in case of error detection with:  ·         action="wait"  wait on the module. The name of the associated failover rule is **c\_***custom\_checker\_name* (with the prefix c\_)  ·         action="noaction"  No failover rule generated. The action must be explicitly written in the <failover> tag (see section 13.19). |
| when="prim"|"second"|"both"  action="stop"|"stopstart"|"restart"|"noaction" | Use this value to test a component after the application starts:  ·         when="prim" for a mirror module  The checker is started after/stopped before the execution of the start\_prim/stop\_prim scripts.  ·         when="both" for a farm module  The checker is started after/stopped before the execution of the start\_both/stop\_both scripts.  ·         when="second" for a mirror module  The checker is started after/stopped before the execution of the start\_second/stop\_second scripts.  Since SafeKit 8, you can configure the action to take when an error is detected with:  ·         action="stop|stopstart|restart"  stop, stopstart or restart the module. The name of the associated failover rule is **c\_***custom\_checker\_name* (with the prefix c\_)  ·         action="noaction"  No failover rule generated. The action must be explicitly written in the <failover> tag (see section 13.19). |
| exec="*executable\_path*" | Defines the executable path of the custom checker.  Can be a binary executable or a script file.  When the path of *executable\_path* is relative, it is relative to SAFEUSERBIN. In this case, put your executable file in SAFE/modules/*AM*/bin/ of your application module and use a relative path. See section  10.1 for more information on path values.  We recommend a relative path and an executable inside the module.  ·         In Windows, the executable can be a binary or a ps1, vbs or cmd script  ·         In Linux, the executable can be a binary or a shell script |
| arg="*executable\_arguments*" | Defines the executable arguments when the custom checker is started. |

## 13.17      Module checker - <module>

By default, there is a wait of
the module when the module checker detects the unavailability of another
SafeKit module. The module checker also performs a stopstart action when
it detects that the external module has been restarted (whether by a restart, a
stopstart, or because of a failover). The module checker retrieves the status
of the module by connecting to the SafeKit web service running on the server
where the module is activated (see section 10.9 for details on the web service).

|  |  |
| --- | --- |
| Commentaire important contour | Insert the <module> tag into the <check> section if this one is already defined. |

### 13.17.1  <module> example

·        
Example using the default configuration of the
SafeKit web service (protocol: HTTP, port: 9010):

<check>

  <module name="mysql">

    <to addr="172.24.190.21"
port="9010"/>

  </module>

</check>

mysql is the name of
the external module and 172.24.190.21 is its virtual IP address.

o    The resource associated with the checker is named module.mysql\_172.24.190.21 (with the prefix module.)

o    The failover rule, which performs a wait when a module
class resource goes down, is static and defined by the default failover rule:

module\_failure: if
(module.? == down) then wait();

·        
The same example using the secured configuration
of the SafeKit web service (protocol: HTTPS, port: 9453):

<check>

  <module name="mysql">

    <to  addr="172.24.190.21"
port="9453" secure="on"/>

  </module>

</check>

For a description of checkers, refer to section 13.11.3.

 

|  |  |
| --- | --- |
| Sous-titres contour | See also examples in section 15.9. It presents the configuration via the web console along with the corresponding userconfig.xml. |

### 13.17.2  <module> syntax

 <module

   
[ident="module\_checker\_name"]

   
name="external\_module\_name">

    [<to

    
addr="IP addres or name the Safekit server running the external
module"

     port="port
of the SafeKit web server"

    
[interval="10s"]

    
[timeout="5s"]

    
[secure="on"|"off"]

     />]

</module>

### 13.17.3  <module> attributes

|  |  |
| --- | --- |
| <module | Set as many <module> sections as there are module checkers. |
| name="*external\_module\_name*"] | Name of the module checker. |
| [ident="*module\_checker\_name*"] | Name of the external SafeKit module to check.  It defines the resource associated with the checker:  **module**.*module\_checker\_name* (with the prefix module.)  If this attribute is not provided, the resource name is constructed from the name and addr attributes:  **module**.*external\_module\_name*\_a*ddress\_or\_name* |
| [<to | Definition of the server(s) running the external module to check.  Default is the local server. |
| addr="*address\_or\_name*" | IP address or name of the external module.  IPv4 or IPv6 address. |
| port="*port of the SafeKit web service*" | Port of the SafeKit web service.  9010 for HTTP ; 9453 for HTTPS |
| [interval="10s"] | Interval in seconds between two checks.  Default value: 10s (10 seconds)   |  |  | | --- | --- | | Commentaire, ajouter contour | Time unit supported since SafeKit 8.2.5 (see section 13.1). | |
| [timeout="5s"] | Check reply timeout in seconds.  Default value: 5s (5 seconds)   |  |  | | --- | --- | | Commentaire, ajouter contour | Time unit supported since SafeKit 8.2.5 (see section 13.1). | |
| [secure="on"|"off"] | Use HTTP protocol (secure="off") or HTTPS (secure="on")  Default value: off |
| />] |  |
| </module> |  |

 

## 13.18      Splitbrain checker - <splitbrain>

SafeKit
provides a split-brain checker that is suits mirror architectures. Split-brain
is a situation where, due to temporary failure of all network links between
SafeKit nodes, and possibly due to software or human error, both nodes switched
to the primary role while isolated. This is a potentially harmful state, as it
implies that the application is running on both nodes. Moreover, when file
replication is enabled, modifications to the data are made on the two nodes.

The split-brain
checker detects the loss of all connectivity between nodes and selects only one
node to become the primary. The other node is not up to date anymore and goes
into the WAIT state until:

·        
the heartbeat becomes available again

or

·        
the administrator runs safekit commands to
force the start as primary (safekit
stop then safekit prim).

 

The primary node election is based on the
ping of an IP address, called the **witness**. The network topology must be designed
so that only one node can ping the witness in case of split-brain. If this is not
the case, both nodes will go primary.

|  |  |
| --- | --- |
| Commentaire important contour | ·         Ping between nodes and witness must be enabled  ·         Since SafeKit 8.2.1, multiple witnesses can be defined. This makes it possible to tolerate the failure of one witness, at least one of which must be accessible. |
| Commentaire important contour | Insert the <splitbrain> tag into the <check> section if this one is already defined. |

### 13.18.1  <splitbrain> example

<check>

 
<splitbrain ident="witness" exec="ping" arg="192.168.1.100 192.168.2.120"/>

</check>

·        
The resource associated with the checker is
named splitbrain.witness (with the prefix splitbrain.)

·        
In case of network isolation between nodes, the
split-brain checker assigns the splitbrain.uptodate resource as up or down
according to access to the witness.

·        
The failover rule, which performs a wait when
the splitbrain.uptodate resource goes down, is static and defined by the default failover rule:

splitbrain\_failure:
if (splitbrain.uptodate
== down) then wait();

For a description of checkers, refer to section 13.11.3.

|  |  |
| --- | --- |
| Sous-titres contour | See also example in section 15.8. It presents the configuration via the web console along with the corresponding userconfig.xml. |

### 13.18.2  <splitbrain> syntax

  <splitbrain

    
ident="*witness*"

    
exec="ping"

     arg="witness1\_IP\_name 
witness2\_IP\_name"

  />

|  |  |
| --- | --- |
| Commentaire important contour | The <splitbrain> tag and full subtree can be changed with a dynamic configuration. |

### 13.18.3  <splitbrain> attributes

|  |  |
| --- | --- |
| <splitbrain | Set only one split-brain checker. |
| ident="*witness\_name*" | Custom checker name.  It defines the resource associated with the checker:  **splitbrain**.*witness\_name* (with the prefix custom.)  The resource is assigned to:  ·         up, if at least one witness responds  ·         down, if not all witnesses respond |
| [when="pre"] | Fixed value.  ·         when="pre"  The checker starts after/ends before the execution of the prestart/poststop scripts  On split-brain detection:  ·         The node that has access to the witness (splitbrain.witness\_name="up") sets the resource splitbrain.uptodate to up and becomes primary  ·         The other server that does not have access to the witness (splitbrain.witness\_name="down") sets the resource splitbrain.uptodate to down. This triggers the wait action of the static and predefined failover rule, named **splitbrain\_failure.** |
| exec="ping" | Fixed value.  Use a pinger to ping the witness and set splitbrain.witness\_name state. |
| arg=" *witness1\_IP\_name  witness2\_IP\_name*" | List of IP addresses or witness names to ping.  IPv4 or IPv6 address.          |  |  | | --- | --- | | Commentaire, ajouter contour | Multiple witness definition supported since SafeKit 8.2.1. | |
| </splitbrain> |  |

## 13.19      Failover machine - <failover>

SafeKit provides checkers that test a
critical element and affect the state of the associated resource based on the
test result. Upon error detection by a checker, the failover machine executes
an action on the module according to the failover rule associated with the
checker. For a complete description, see section 13.11.

Some SafeKit components (<heart>, <rfs>,
<vipd>, <errd>) manage their own resources
and provide their own failover rules. These rules should not be modified or
deleted, as doing so may lead to abnormal behavior of SafeKit.

The failover machine regularly evaluates
(by default, every 5 seconds) the overall state of all resources and applies an
action based on the true failover rules.

In farm architecture, the failover machine
can work only on the states of local resources whereas in mirror architecture,
the failover machine can work on the states of local and remote resources.

As the states of resources are exchanged on
heartbeat channels, it is better to have several heartbeat channels (see section 13.4 for heartbeats definition).

Failover rules can be written in a simple
language specific to SafeKit or in Lua using SafeKit function calls

### 13.19.1  <failover> example

The examples of rules written in this
section are added to the default rules or those generated based on the
configuration of the checkers.

·        
Example of adding a rule written in the failover
machine language

<failover>

   <![CDATA[

     custom\_failure: if
(custom.AppChecker == down) then stopstart();

   ]]>

</failover>

·        
Example of adding a rule using the Lua language
and the if\_then function call

The
prefix "--Lua
Rules" indicates that the following section should
be interpreted using the Lua interpreter.

<failover>

   <![CDATA[

     --Lua Rules

     Rules = Rules +

     **{**
custom\_failure=**if\_then**("custom.AppChecker","down",Action.stopstart),\_group="checker"**}**

   ]]>

</failover>

·        
Example of a rule to disable the default rule
named ip\_failure and add the rule allip\_failure

<failover>

<![CDATA[

  --Lua Rules

  **Rules.disable**("ip\_failure")

  -- Add here
any Lua rules intended to replace the mentioned rules, or write the legacy
rules in another CDATA section

]]>

<![CDATA[

  
allip\_failure: if (ip.\* == down) then stopstart();

]]>

</failover>

|  |  |
| --- | --- |
| Commentaire important contour | Use a separate <![CDATA[ … ]]> section for each language. |

### 13.19.2  <failover> syntax

<failover
[extends="yes"] [period="5000ms"] [handle\_time="15000ms"]>

<![CDATA[

  label: if
(expression) then action;

  …

]]>

</failover>

|  |  |
| --- | --- |
| Commentaire, ajouter contour | The <failover> tag and subtree cannot be changed with a dynamic configuration. |

### 13.19.3  <failover> attributes

|  |  |
| --- | --- |
| <failover |  |
| [extends="yes"|"no"] | ·         extends="yes"  The new failover rules extend the default failover rules (see section 13.19.4 for its definition).  ·         extends="no"  The new failover rules overwrite the default one (avoid this configuration).  Default value: yes. |
| [period="5000ms"] | Period in milliseconds between two evaluations of failover rules.  Default value: 5000ms (5000 milliseconds)   |  |  | | --- | --- | | Commentaire, ajouter contour | Time unit supported since SafeKit 8.2.5 (see section 13.1). | |
| [handle\_time="15000ms"] | A failover action must be stable (the same) at least during the time handle\_time (in milliseconds) before being applied by the failover machine.  Default value: 15000ms (15000 milliseconds)   |  |  | | --- | --- | | Commentaire, ajouter contour | Time unit supported since SafeKit 8.2.5 (see section 13.1). | | Commentaire important contour | handle\_time must be a multiple of the period value. | |

### 13.19.4  <failover> description

#### 13.19.4.1                      Module resources

The syntax to design the resources is as follows:

 

resource ::= [**local.** | **remote.**]0/1resource*\_*class***.***resource*\_*id   (default: **local**)

resource*\_*class ::= **ping**
| **intf | tcp | custom | module | heartbeat | rfs**

resource*\_*id::= **\***
| **?** | name

resource*\_*state::= **init**
| **down** | **up** | **unknown**

|  |  |
| --- | --- |
| init | Special initialization state of a resource when the checker is not started.  If a resource in the init state is used in a failover rule, SafeKit does evaluate the rule. |
| up | Resource OK |
| down | Resource KO |
| unknown | Special state of a remote resource; the remote state is unknown at the test time (ex.: when the remote module is stopped). |

 

#### 13.19.4.2                      Failover rules

SafeKit provides default failover rules and
generated failover rules from the module checkers’ configuration. Users can
also write their own failover rules.

Default failover rules

The default failover rules for the checkers (module, intf, ip, splitbrain)
are:

<failover>

<![CDATA[

  /\* rule for
module checkers \*/

 
module\_failure: if (module.? == down) then wait();

 

  /\* rule for
interface checkers \*/

 
interface\_failure: if (intf.? == down) then wait();

 

  /\* rule for
ip checkers \*/

  ip\_failure:
if (ip.? == down) then stopstart();

 

  /\* rules for
splitbrain \*/

 
splitbrain\_failure: if (splitbrain.uptodate == down) then wait();

]]>

</failover>

There are also:

·        
failover rules dedicated to file replication
management, heartbeats…

·        
the Implicit\_wakeup rule that is applied when no wait rule applies. It
runs the wakeup action.

|  |  |
| --- | --- |
| Commentaire, ajouter contour | Since SafeKit 7.5, default failover rules are using a new syntax based on the Lua language. |

Generated failover rules

The checkers tcp, ping, and custom
have a rule generated when the value of the action attribute if it is set to stop, stopstart,
restart or wait.

The name of the rule has as a prefix the
first letter of the checker’s name (t, p or c),
followed by \_, then the value of the attribute ident (e.g. p\_router, t\_service, c\_app).

Configured failover rules

The user can also define his own rules into
the section <failover><![CDATA[
… ]]></failover>. By default, these are added
to the default and generated rules.

|  |  |
| --- | --- |
| Sous-titres contour | See examples in section 13.19.1. |

Failover rules can be written using one of
the following syntaxes:

·        
Failover machine language

label: **if** **(** expression
**)** **then** action;

label ::= **string**

action ::= stop() | stopstart() | wait() |
restart() | swap()

expression ::= **(** expression **)**| **!** expression  
| expression **&&** expression   
| expression **||** expression  
| expression **==** expression  
| expression **!=** expression   
| resource ::= [**local.** | **remote.**] 0/1resource\_class**.**resource\_id  
| resource\_state

 

·        
Lua language

o    if\_then
function call to define a rule

--Lua Rules

Rules = Rules +

{ label=**if\_then**("resource","resource\_state",action),\_group="checker"
}

label ::= **string**

action ::= Action.stop | Action.stopstart |
Action.wait | Action.restart | Action.swap

| resource ::= resource\_class**.**resource\_id  
| resource\_state

o    Rules.disable function call to disable a rule based
on its label

--Lua Rules

Rules.disable("failover\_rule\_label")

|  |  |
| --- | --- |
| Commentaire important contour | Use a separate <![CDATA[ … ]]> section for each language. |

#### 13.19.4.3                      Actions

The actions (restart(), stopstart(), stop(), swap()) of the failover machine are equivalent to control commands (with
the -i identity parameter) described in section 9.3.

|  |  |
| --- | --- |
| Commentaire important contour | maxloop / loop\_interval / automatic\_reboot are applied if -i identity is passed to commands. This is the case when called from the failover machine or checkers. |

 

  

 

 

 

 

