---
canonical: https://safekit.evidian.com/wp-content/uploads/downloads_safekit/version-82/slides-en/8-checkers-en.pptx
---

# PowerPoint Converted to Markdown

Source: https://safekit.evidian.com/wp-content/uploads/downloads_safekit/version-82/slides-en/8-checkers-en.pptx


## Slide 1: SafeKit Checkers

- For mirror and farm modules


### Speaker notes

> These slides are timed and automatically move from one to the next after a delay. To remove this automation: Go to 'Slide Show' and uncheck 'Use Timings’.
> The slides have a soundtrack represented by an audio icon on the right side of each slide. To remove the soundtrack, click on each audio icon and lower the volume to the minimum.
> I’m going to present the checkers of SafeKit that can be applied to mirror and farm modules.


## Slide 2

- Overview


### Speaker notes

> Let’s start with an overview.


## Slide 3: RESTART checkers

- Test the application once started. On error detection, restart it locally or remotely
- <errd> 		
Monitors process and service
- <tcp>		
Checks a TCP service
- <custom>	
Checks an application with a specific script
- <ip>		
Checks the virtual IP


### Speaker notes

> Let's begin with the restart checkers.
> Once the application is started, the restart checker checks for errors in the application and restarts the application either locally or remotely. As shown in the figure on the right, the restart checker operates in a loop, periodically testing for errors and setting a SafeKit resource to either up or down. Then, a failover rule on the resource triggers one of the following actions.
> A restart action triggers a local restart of the module, which leads to the restart of the application on the same node.
> A stopstart action first stops the module locally and triggers a failover to the other node if it is running the module. The module is then automatically started immediately after it stops.
> A stop action stops the module locally and triggers a failover to the other node if it is running the module.
> The list of restart checkers is as follows, along with their configuration tags in userconfig.xml:
> err d: Monitors processes and services.
> tcp: Checks a TCP service.
> custom: Checks an application with a specific script.
> ip: Checks the virtual IP.


## Slide 4: WAIT checkers

- Test an external critical component. On error detection, block and wait for its repair while the application is down
- <intf>
Checks a network interface
- <tcp>		
Checks an external TCP service
- <ping>
Checks an external component via a ping
- <custom>
Checks a component with a customized script
- <splitbrain>
Avoids two ALONE servers
- <module>
Verifies status of another module


### Speaker notes

> Let’s now consider the wait checkers.
> A wait checker tests an external mandatory component and, upon error detection, blocks and waits for its repair before allowing the module to transition to its green state. As shown in the figure on the right, the wait checker operates in a loop, periodically testing the external component and setting a SafeKit resource to either up or down. Then, a failover rule on the resource triggers a wait action if the resource is down. When the resource is up, then the module is unblocked and can transient to its green state. Note that an application is not running when a module is in the WAIT state.
> The list of wait checkers is as follows, along with their configuration tags in userconfig.xml:
> INTF: Checks a network interface.
> TCP: Checks an external TCP service.
> Ping: Checks an external component via a ping.
> Custom: Checks a resource with a customized script.
> Splitbrain: Avoids two ALONE servers in case of network isolation.
> Module: Verifies status of another module.


## Slide 5: Overview of userconfig.xml

- <service [maxloop="3"] [loop_interval="24"]>
- <errd> </errd>
- <check>
  - <intf> </intf>
  - <ip> </ip>
  - <tcp> </tcp>
  - <ping> </ping>
  - <module> </module>
  - <splitbrain> </splitbrain>
  - <custom> </custom>
- </check>
- A single section for 
<errd> … </errd>
- Another section for all other checkers
- A maximum of 3 local error detections within 24 hours before stopping the module


### Speaker notes

> Here is an overview of the userconfig.xml file for checkers. You will find all tags discussed previously in the err d and check tags. There are two specific variables for checkers in the service tag, named maxloop and loop interval. These variables avoid a loop in error detections, if a permanent error prevents a module from starting correctly. In the example of a restart checker, it has a maximum of three local error detections within 24 hours, before stopping the module, and triggering a failover to the other node if it is running the module.


## Slide 6: Summary

- Built-in checker
- Heartbeat loss/recovery
- Replication flow loss/recovery
- Actions are executed by the component, or the failover machine with predefined failover rules
- Configurable checkers
- <intf>, <ip>, <ping>, <tcp>, <module>, <splitbrain>, <custom>
- Actions are executed by the failover machine according failover rules
- Configurable checker
- Process and service monitoring
- Actions are executed by errd
- Failover rules to define actions
- Actions executed by the failover machine based on the state of resources
- <heart>
- <farm>
- <rfs>
- <errd>
- <check>
- <failover>


### Speaker notes

> In summary, let's talk about the different types of checkers in SafeKit.
> First, we have the built-in checkers. These include heartbeat loss and recovery, and replication flow loss and recovery. The actions for these checkers are executed either by the SafeKit component itself, or by the failover machine following predefined failover rules.
> Next, we have configurable checkers. These include firstly the err d checker, for process and service monitoring. The actions configured in this checker are directly executed by err d. Secondly, you have the checkers defined in the check tag, like INTF, IP, ping, tcp, module, splitbrain, and custom. The actions are executed by the failover machine according to failover rules and states of resources managed by these checkers.
> Finally, failover rules can be redefined inside the failover tag of userconfig.xml. Default failover rules are provided by SafeKit, avoiding the need to configure this tag in most cases.


## Slide 7

- errd checker


### Speaker notes

> Let’s begin with the err d checker configuration.


## Slide 8: Processes and services monitoring (1/2)*

- errd - RESTART checker
- <errd>
- <proc name="mysqld.exe" class="prim" [service="no"] [action="stopstart"] [atleast="1"] />
- <proc name="MySQL" class="prim" service="yes" action="restart"/>
- </errd>
- Process name or service name to monitor
- Resource: proc.mysql.exe, proc.MySQL
- "yes" for monitoring a service
- Action: "restart", "stop", "stopstart", "noaction"  by errd
- "noaction" just puts a message in the module log (debug)
- "both" for a farm module
- Check that at least this number of processes are running
- *See softerrd.safe for a full example


### Speaker notes

> Within the err d tag, you can define processes or services to monitor.
> For example, in the figure, the first proc tag and all the attributes that follow, defines the monitoring of my sql d.exe, on the primary node only, indicating that it is a process name and not a service name. The action is stopstart: if there is not at least one process with this name in the list of running processes on the primary node, the stopstart action means there will be a failover to the other node if it is running the module.
> The second proc tag defines the monitoring of MySQL on the primary node, indicating that it is a service name, with a restart action if the service is not running. If this is the case, the restart action will restart the application locally on the node. If the restart does not resolve the issue after three attempts (the value of maxloop), the module will stop locally, and there will be a failover to the other server if it is running the module.
> For a farm module, you need to change the class from prim to both. The possible actions are restart, stop or stopstart. There is also noaction to just put a message in the module log for debugging purposes.
> As shown in the first text box, a resource is associated to each proc tag, giving the state of the process and the service monitored by SafeKit.
> For a full example, you can refer to the "soft  errD.safe" file.


## Slide 9: Processes and services monitoring (2/2)

- errd - RESTART checker
- Delay, in seconds, between 2 evaluations
- Linux only. Regular expression on the command name
- Regular expression on command name and arguments
- Delay the checker
- assign to "3" for example, to start monitoring after 60 s
- seconds=(start_after-1)*polltimer
- <errd [polltimer="30"] >
- <proc name="oracle" argregex=".*Base1" class="prim" action="restart" [start_after="0"] />
- <proc name="oracle" nameregex="oracle_.*" class="prim" action="restart"/>
- </errd>
- Note: "safekit –r processtree list all"
- lists all running processes with command name and arguments


### Speaker notes

> Let's examine the advanced parameters of the `errD` tag.
> Firstly, you can define the `polltimer` attribute, that specifies the frequency in seconds, at which `errD` checks monitored processes or services.
> Secondly, you can define a regular expression in the `arg regex` attribute. This regex applies to the command name and the arguments of the running process. To list the command names and arguments of running processes, SafeKit provides a special safekit processtree command, displayed in the text box of the slide.
> Thirdly, an important attribute is the `start after` attribute. It is particularly useful, when the process or service being monitored, requires some time to start, before it can be effectively monitored. If the attribute is not set, monitoring begins immediately after the execution of the `start prim` script for a mirror module, or the `start both` script for a farm module. Otherwise, `start after` specifies a delay in monitoring the process or the service, measured in polling cycles. For example, with `polltimer` set to 30 seconds and `start after` set to 3, the delay will be 60 seconds.
> Fourthly, on Linux, you can set the `name regex` attribute. This attribute applies on the command name of the process, and not, on its arguments.


## Slide 10

- intf and ip checkers


### Speaker notes

> Let’s now detail the INTF and IP checkers.


## Slide 11: Network interface and virtual IP monitoring

- intfcheck – WAIT checker ; ipcheck – RESTART checker
- <vip>
- <interface_list>
- <interface [check="on"]>
- <virtual_addr addr="172.24.199.100" where="one_side_alias" [check="on"]/>
- </interface>
- </interface_list>
- </vip>
- Checker that detects duplicate VIP address conflict or removal
- Resource: ip.172.24.199.100
- Action: stopstart by predefined failover rule  ip_failure
- Checker that detects interface failure
- Resource: intf.172.24.199.0
- Action: wait by predefined failover rule interface_failure


### Speaker notes

> As shown in the figure, when you set check=“on”, in the interface tag of a virtual IP configuration, the interface checker named, INTF check is activated. This checker detects failures of the network interface where the virtual IP is set. As you can see in the text box of the slide, a resource is associated with this checker, and the failover rule named interface failure puts the module in the WAIT red state if there is a network interface failure. When the network interface is repaired, the interface checker detects it, and the module exits the WAIT red state.
> Now in the virtual ADDR tag, if you set check="on", you activate the virtual IP checker named ip check. This checker monitors the virtual IP and detects duplicate virtual IP address conflicts or virtual IP removal. As you can see in the text box, a resource is associated with this checker, and the failover rule named ip failure performs a stopstart action if there is a virtual IP failure. The stopstart action will deconfigure and then reconfigure the virtual IP address to solve the problem.


## Slide 12

- custom checker


### Speaker notes

> Let’s now explain the custom checkers.


## Slide 13: Customized check of the application*

- Custom checker – RESTART checker
- <check>
- <custom ident="sql" exec="checker_sql.ps1"  when="prim" action= "restart" />
- </check>
- checker_sql.ps1: specific script in SAFE/modules/AM/bin
- Resource: custom.sql
- Action: restart by generated failover rule c_sql
- “both“ for a farm module
- *See customchecker.safe for a full example
- Action: "restart", "stop", "stopstart"


### Speaker notes

> Firstly, let's consider the writing of the custom checker script. In this example, a user has written a checker named checker_sql.ps1. This script is essentially a loop, that periodically tests the application, and sets a SafeKit resource named custom.sql, to either up or down. This script must be placed in the bin directory of the module.
> Secondly, to activate the checker, a custom tag must be added in userconfig.xml, as shown on the right side of the slide. This tag defines the identity of the checker, the executable name, whether it should be executed on the primary server for a mirror module or on both servers for a farm module, and the action to take in case of failure.
> A failover rule will be automatically generated with the action. The name of this failover rule will be prefixed by 'c_' followed by the identity of the custom checker, in this case, SQL. This rule will be displayed in the console and in the SafeKit logs when the checker detects a failure.
> There is no further configuration needed. SafeKit will take care of launching the checker script, after executing the start prim script in a mirror module, or the start both script in a farm module. SafeKit will also handle stopping the checker script, before executing either the stop prim script, or the stop both script.
> For a full example, please refer to the customchecker.safe file.


## Slide 14: Customized check of an external component

- Custom checker – WAIT checker
- <check>
- <custom ident="router" exec="checker_router.ps1" arg="IP1 IP2"  when="pre" action="wait" />
- </check>
- checker_router.ps1: specific script in SAFE/modules/AM/bin
- Resource: custom.router
- Action: wait by generated failover rule c_router
- Arguments:
- 1st argument: resource name (ex. custom.router)
- 2nd argument: module name (ex. AM)
- Next arguments: arg value


### Speaker notes

> Now, let's consider the writing of a custom checker script, that checks an external component, and puts the module in the WAIT state, if it detects failures.
> As in the previous example, the custom checker script is essentially a loop, that periodically tests the external component and sets a SafeKit resource, named here custom.router, to either up or down.
> The configuration is exactly the same as the one explained in the previous slide, except for the setting of the when and action attributes, as you can see on the right of the slide.
> The when attribute is set to "pre" to indicate that the custom checker must be started in the prestart step of the module. Thus, if the custom checker detects that the external component has failed, the module will be immediately put in the WAIT state without starting the application.
> Note that, as shown in the right text box, a new attribute, named arg, can be configured to pass arguments to the custom checker.


## Slide 15

- splitbrain checker for a mirror module


### Speaker notes

> Let’s now explain the splitbrain checker for a mirror module.


## Slide 16: What is a split brain?

- Default behavior: 2 ALONEs
- Behavior with a splitbrain checker: 1 ALONE
- Network isolation between PRIM and SECOND
- heartbeats KO
- node1
- node2
- heartbeats KO
- node1
- node2
- uptodate
- Ping KO
- (all witnesses)
- not uptodate
- Ping OK
- (at least one)
- Witness(es)
- router…
- Ping between nodes and witnesses must be enabled


### Speaker notes

> The splitbrain checker is useful in case of network isolation between the primary and the secondary node. In this situation, all heartbeats are lost between both nodes, and the default behavior is that each node goes into the ALONE state, running the application. If you configure a splitbrain checker, you avoid the double execution of the application on the two nodes. To achieve this, you must configure the splitbrain checker with the IP address of a reliable witness, typically a router in the network. In case of network isolation between nodes and loss of all heartbeats, only the node that can access the router via a ping will go to the ALONE state. The other node will go to the WAIT not up-to-date state, waiting for the recovery of heartbeats to resynchronize its data and become secondary. When choosing the witness, ensure that only one node can reach the witness in an isolation situation. Otherwise, both nodes will go into the ALONE state during network isolation, and the issue of double execution of the application will not be resolved.
> Also, choose a reliable witness; otherwise, if the witness is down when a node wants to go to the ALONE state, it will not be able to become ALONE and will go to the WAIT red state because the witness does not respond. SafeKit allows configuring multiple witness IP addresses to remedy this.


## Slide 17: Splitbrain check

- splitbraincheck – WAIT checker
- <check>
- <splitbrain ident="witness"   exec="ping" arg="172.24.199.53   172.24.197.31" [when=“pre"]/>
- </check>
- Resource: splitbrain.witness and splitbrain.uptodate
- Action: wait by predefined failover rule splitbrain_failure
- Pings the witness address
- One or more witness address/name


### Speaker notes

> The configuration of a splitbrain checker is very simple. As shown in the figure, just create a `splitbrain` tag in `userconfig.xml`, with the identity of the checker, with `exec=ping`, and with the IP address of one or more witnesses. Configuring multiple witnesses allows the system to tolerate the failure of any witness, as the splitbrain checker only needs a response from one of them. Once again, ensure that only one node can reach witnesses in an isolation situation; otherwise, the issue of double execution of the application will not be resolved. As shown in the text box, a resource will be associated with the checker, and the predefined failover rule, named `splitbrain failure`  will make the wait action.


## Slide 18

- tcp, ping, module checkers


### Speaker notes

> Let’s now explain the tcp, ping and module checkers.


## Slide 19: TCP connection check to the application

- tcpcheck – RESTART checker
- Resource: tcp.Web_80
- Action: restart by generated failover rule t_Web_80
- "both" for a farm module
- Interval, in seconds, between 2 tests
- Timeout, in seconds, for error detection
- IP/port to test
- <check>
- <tcp ident="Web_80" when="prim" [action="restart" ] >
- <to addr="172.24.199.100" port="80"  [interval="10"] [timeout="5"]/>
- </tcp>
- <!-- As many <tcp> as TCP connections to test -->
- </check>
- Action: "restart", "stop", "stopstart", "noaction"
- "noaction" to prevent generating a failover rule
- Note:
- before SafeKit 8.2.3, the action attribute did not exist. It was statically set to restart by the predefined failover rule tcp_failure


### Speaker notes

> The TCP checker tests if a TCP connection can be established with the application. To configure it, add a `tcp` tag in `userconfig.xml`, with the identity of the checker, specifying if it is running only on the primary node for a mirror module, or on both nodes for a farm module, and the action. The action can be as usual, `restart`, `stop`, `stopstart`, or `noaction`. `noaction` prevents the automatic generation of a failover rule, in the special case where you want to write a custom failover rule. Then, inside the `tcp` tag, you will have to define the IP address and TCP port on which you want to test a connection. You can define also the interval, in seconds, between two tests, and the timeout, in seconds, for error detection.
> As shown in the text box, a resource will be associated with the TCP checker, and a failover rule, prefixed by 't_' followed by the checker identity,  will make the action. This name will be displayed in the console and in the module log, if the TCP checker detects connection errors.


## Slide 20: TCP connection check for an external service

- tcpcheck – WAIT checker, since SafeKit 8.2.3
- Resource: tcp.Web_80
- Action: wait by generated failover rule t_Web_80
- <check>
- <tcp ident="Web_80" when="pre" action="wait" >
- <to addr="172.24.199.100" port="80"  [interval="10"] [timeout="5"]/>
- </tcp>
- <!-- As many <tcp> as TCP connections to test -->
- </check>
- Interval, in seconds, between 2 tests
- Timeout, in seconds, for error detection
- IP/port to test


### Speaker notes

> Let’s consider a TCP checker, testing if an external TCP service is available, and putting the module in the WAIT state until this external service is ready.
> The configuration is exactly the same as the one explained in the previous slide, except for the setting of the when and action attributes, as you can see on the right of the slide.
> The when attribute is set to "pre" to indicate that the tcp checker must be started in the prestart step of the module. Thus, if the tcp checker detects that the external service has failed, the module will be immediately put in the WAIT state without starting the application.


## Slide 21: Ping check of an external device

- pingcheck – WAIT checker
- Resource: ping.router
- Action: wait by generated failover rule p_router
- Interval, in seconds, between 2 tests
- Timeout, in seconds, for error detection
- IP to ping
- <check>
- <ping ident="router"  [when= "pre" action="wait" ] >
- <to addr="172.24.199.1" [interval="10"] [timeout="5"]/>
- </ping>
- <!-- As many <ping> as needed -->
- </check>
- Note:
- before SafeKit 8.2.3, the action attribute did not exist. It was statically set to wait by the predefined failover rule ping_failure


### Speaker notes

> The ping checker tests if a ping to an external device is answering. To configure it, add a `ping` tag in `userconfig.xml`, with the identity of the checker. By default, it’s a wait checker started in the prestart step of the module. Then, inside the `ping` tag, you will have to define the IP address of the external device to ping. You can define also the interval, in seconds, between two tests, and the timeout, in seconds, for error detection.
> As shown in the text box, a resource will be associated with the ping checker, and a failover rule, prefixed by ‘p_' followed by the checker identity,  will make the wait action. This name will be displayed in the console and in the module log, if the ping checker detects errors.


## Slide 22: External module state check (UP/PRIM/ALONE)

- modulecheck – WAIT and RESTART checker
- <check>
- <module name="sqlserver">
- <to addr="172.24.199.140" port="9010" [interval="10"] [timeout="5"] [secure="off"]/>
- </module>
- </check>
- Checks that an external module named sqlserver is available
- Resource: module.sqlserver_172.24.199.140
- Action: if the sqlserver module is down, wait from predefined failover rule module_failure
- Action: when the sqlserver module is restarted, stopstart from modulecheck
- Virtual IP address of the sqlserver module
- Web service access to the sqlserver cluster
- 9010 and off  for http
- 9453 and on  for https


### Speaker notes

> Let’s now consider the module checker. The module checker tests if another external module is running its application, which means that the external module is in the UP, PRIM, or ALONE state. To configure it, add a `module` tag in `userconfig.xml`, and set the name of the external module, such as `sql server` in the example. Then, inside the `module` tag, you will have to define the virtual IP address and the SafeKit web service port of the external module, which can be 90 10 if the external web service runs HTTP, or 94 53 if it runs HTTPS. Also, set `secure` to `on` or `off` to indicate that. You can define the interval, in seconds, between two tests, and the timeout, in seconds, for error detection.
> As shown in the text box, a resource will be associated with the module checker, and a failover rule named `module_failure` will put the module in the WAIT state if the external module is not running its application. Another action is preconfigured, which consists of making a stopstart of the module if the external module is restarted.


## Slide 23

- Module state transitions with checkers


### Speaker notes

> Let's now examine the module state transitions when actions are triggered by checkers.


## Slide 24: Actions of checkers in a farm module

- start_both
- prestart
- wakeup
- wait
- stop_both
- poststop
- wait
- stop_both
- node1
- restart
- stop_both
- start_both
- Application is started (virtual IP set)
- Load share is 100%, or 50% if node2 is UP
- Application is stopped
- Network load share is 0%
- stop
- stopstart
- stop_both


### Speaker notes

> Let's first consider a farm module and node 1 in the UP green state at the bottom of the figure.
> Firstly, the restart action will trigger the execution of stop both and start both scripts, restarting the application locally without deconfiguring the virtual IP. During the restart, the module is in the UP orange state.
> Secondly, the stop action will trigger the execution of stopboth followed by poststop scripts. At the end, the module will be in the STOP red state, the virtual IP will be unset, and the application will not run on the node. And all the traffic will be managed by the other node if it is running the module.
> Thirdly, a stopstart action is first a stop action which will execute the stop both script and deconfigure the virtual IP address, followed by a start action which will reconfigure the virtual IP address and execute the `start both` script. At the end, the module will be in the UP green state. For a farm module, the difference between a stopstart action and a restart action is the deconfiguration and reconfiguration of the virtual IP, as well as the stop and start of checkers.
> Fourthly, a wait action, when the state is UP green, first executes `stop both` and the deconfiguration of the virtual IP address before going to the WAIT red state. The traffic managed by node 1 is then 0%, meaning that the traffic is managed by the other node if it is running the module.
> Fifthly, a failover action occurs when all heartbeats are down, and it involves taking 100% of the traffic on node 1.


## Slide 25: Actions of checkers in a mirror module

- start_prim
- prestart
- wakeup
- wait
- stop_prim
- poststop
- wait
- stop_prim
- node1
- stopstart
- restart
- stop_prim
- start_prim
- stop
- Application is started (virtual IP set)
- Application is stopped
- stop_prim
- uptodate
- not uptodate
- wakeup
- uptodate
- node1
- Application not running (virtual IP not set)


### Speaker notes

> Let’s now consider a mirror module and node 1 in the PRIM or ALONE green state at the bottom of the figure.
> Firstly, the restart action will trigger the execution of `stop prim` and `start prim` scripts, restarting the application locally without deconfiguring the virtual IP. During the restart, the module is in PRIM or ALONE orange state, depending on whether it was PRIM or ALONE green before.
> Secondly, the stop action will trigger the execution of `stop prim` followed by `poststop` scripts. At the end, the module will be in the STOP red state, the virtual IP will be unset, and the application will not run on node 1. If the other node was SECOND, it has switched to the ALONE state, has set the virtual IP, and has restarted the application.
> Thirdly, a stopstart action is first a stop action which will execute the stop prim script and deconfigure the virtual IP address, followed by a start action. There are two scenarios in the start action.
> If the state of node 1 was ALONE before the stopstart action, the module will return to the ALONE green state on node 1 after setting the virtual IP and restarting the application in start prim. If the state was PRIM before the stopstart action, then the other node has made an automatic failover and will be in the ALONE state when node 1 starts after its stop. In this case, node 1 will start as SECOND, starting by resynchronizing data in the SECOND orange state, before becoming SECOND green.
> Fourthly, a wait action, when the state is PRIM or ALONE green, first executes `stop prim` and deconfigures the virtual IP address before going to the WAIT red state. There is a failover if the other node is in the SECOND state: it will go to the ALONE state and restart the application.
> Fifthly, a failover action occurs when all heartbeats are down. Node 1 will transit from PRIM green state to ALONE green state. Nothing changes for the application and the virtual IP address. Only the replication to the SECOND node is stopped.


## Slide 26: Thank you !

- Contact us here


### Speaker notes

> Thank you for your attention. If you have any questions or need further clarification, please feel free to ask.
