---
canonical: https://safekit.evidian.com/wp-content/uploads/downloads_safekit/version-82/slides-en/6-mirror-module-en.pptx
---

# PowerPoint Converted to Markdown

Source: https://safekit.evidian.com/wp-content/uploads/downloads_safekit/version-82/slides-en/6-mirror-module-en.pptx


## Slide 1: SafeKit Mirror

_No extractable slide text found._


### Speaker notes

> These slides are timed and automatically move from one to the next after a delay. To remove this automation: Go to 'Slide Show' and uncheck 'Use Timings’.
> The slides have a soundtrack represented by an audio icon on the right side of each slide. To remove the soundtrack, click on each audio icon and lower the volume to the minimum.
> I’m going present the mirror module of SafeKit in detail, including how to configure the restart scripts, the internal parameters in userconfig.xml, and its operation with various state transitions according to failures.


## Slide 2

- Overview


### Speaker notes

> Let’s start with an overview.


## Slide 3: How a mirror module works?

- Real-time replication
- Automatic failover
- Resynchronization
- Automatic failback
- Real-time replication
- Real-time replication, virtual IP address and failover


### Speaker notes

> Let's examine how a mirror module works and the associated states and colors in the console.
> At step 1 in the figure, the mirror module is running in the PRIM-SECOND state. The application runs only on the PRIM server, and the virtual IP has been set on its network interface. Only modifications made by the application inside files are replicated in real-time.
> At step 2, the mirror module has been stopped on node 1, or node 1 has experienced a failure. This triggers an automatic failover to node 2. The module is now in the ALONE state on node 2, the virtual IP has been set on its network interface, and the application has been restarted. There is no more replication from node 2 to node 1.
> At step 3, the mirror module is restarted on node 1, and node1 resynchronizes its local folders from node 2. During this resynchronization, the state is SECOND and the color is orange on node 1. The application continues its execution on node 2 in the ALONE state.
> At step 4, the data has been resynchronized, and we are in the SECOND-PRIM state, ready for a failover. If you prefer the application to run on node 1, you can stop the module on node 2. This will trigger a failover and a restart of the application on node 1. You can do this manually through the console at an appropriate time, or automatically just after the resynchronization, by configuring the defaultprim parameter in userconfig.xml.


## Slide 4: Mirror modules

- For a new application
- mirror.safe for a new Windows application
- mirror.safe for a new Linux application
- List of all SafeKit solutions
- With a free trial here and quick installation guides
- Preconfigured for Linux
- oracle.safe for Oracle
- postgresql.safe for PostgreSQL
- mysql.safe for MySQL
- kvm.safe for KVM
- …
- Preconfigured for Windows
- slqserver.safe for Microsoft SQL server
- postgresql.safe for PostgreSQL
- mysql.safe for MySQL
- hyperv.safe for Hyper-V
- …
- Configuration files of a module named AM
- 1
- 2
- 3
- 4
- Real-time replication and failover


### Speaker notes

> Let’s present to you the various solutions that you can implement with a mirror module. Firstly, for a new application on Windows or Linux, you can utilize the mirror.safe module. We offer a comprehensive list of SafeKit solutions, all available with a free trial and a quick installation guide.
> Secondly, our preconfigured solutions for Windows include sql server.safe for Microsoft SQL Server, postgre sql.safe for PostgreSQL, my sql.safe for MySQL, hyper-v.safe for Hyper-V, and more.
> Thirdly, our preconfigured solutions for Linux include oracle.safe for Oracle, postgre sql.safe for PostgreSQL, my sql.safe for MySQL, kvm.safe for KVM, and more.
> Fourthly, once deployed, the configuration files are available in the modules directory of SafeKit. If the module has been deployed with the name AM, then you will find an AM subdirectory. Inside AM/bin, you will find the restart scripts named startprim and stopprim. And inside AM/CONF, you will find the userconfig.xml file.


## Slide 5

- userconfig.xml


### Speaker notes

> Let's now examine in detail the userconfig.xml parameters that can be configured for a mirror module.


## Slide 6: Overview of userconfig.xml

- <!DOCTYPE safe>
- <safe>
- <service mode="mirror">
- <heart>
- <heartbeat name="default" />
- </heart>
- <vip>
- <interface_list>
- <interface check="on">
- <real_interface>
- <virtual_addr addr="172.24.199.100" where="one_side_alias" check="on"/>
- </real_interface>
- </interface>
- </interface_list>
- </vip>
- <rfs>
- <replicated dir="c:\test1replicated" />
- </rfs>
- <user/>
- </service>
- </safe>
- heartbeat configuration
- virtual IP configuration
- real-time replication
- module scripts activation
- Slides "Checkers":
- <errd>         process or service monitoring
- <checker>  checkers
- <failover>   failover rules


### Speaker notes

> Here is an overview of the userconfig.xml file for a mirror module.
> First, you have the heartbeat section defining the networks through which heartbeats must pass.
> Then, you have the virtual IP configuration.
> Next, you see the real-time replication and the path of folders to replicate.
> Following that, you have the user tag, which enables the execution of restart scripts.
> The three other tags on the right of the slide concern checkers and are explained in the checkers slides.


## Slide 7

- Heartbeats


### Speaker notes

> Let’s detail the heartbeats configuration.


## Slide 8: <heartbeat> in userconfig.xml

- <heart [pulse="700"] [timeout="30000"]>
- <heartbeat name="default" [ident="default"]/>
- <heartbeat name="private" ident=“flow"/>
- <!-- As many <heartbeat> as desired network connections -->
- </heart>
- Heartbeats synchronize states and actions between 2 nodes through the network
- There is an automatic failover when all heartbeats are lost
- Delay, in ms, between 2 heartbeat sendings
- Name as defined in cluster.xml
- For a dedicated replication network, assign ident to "flow“
- Else replication network is assigned to the 1st heartbeat
- Timeout, in ms, for heartbeat loss detection
- Resources: heartbeat.default, heartbeat.flow
- Action:  failover on all heartbeats lost


### Speaker notes

> Heartbeats synchronize states and actions between two nodes through the network. There is an automatic failover when all heartbeats are lost.
> The heart tag allows the configuration of heartbeats in userconfig.xml. The pulse attribute defines the delay in milliseconds, between two heartbeat sendings. The timeout attribute defines the maximum time in milliseconds, before considering that heartbeats are lost and triggering a failover.
> In the heartbeat tag, the name attribute is the name of a LAN defined in the cluster.xml file. Here, two heartbeats are defined on the default and private LANs.
> The ident attribute is particularly interesting; if you assign it to the flow value, it designates the network that will support the replication flow. If no flow value is set, by default, the replication flow will be on the network of the first heartbeat.
> The ident attribute is also used for naming resources. In the example, heartbeat.default and heartbeat.flow will be the two resources giving the status of the two heartbeats.
> Finally, you can set as many heartbeat tags as you have network connections between both nodes.


## Slide 9

- Virtual IP address


### Speaker notes

> Let’s now detail the virtual IP configuration.


## Slide 10: Virtual IP address in the same subnet

- VIP @ → mac1 @
- VIP @/mac1 @
- mac2 @
- VIP is automatically configured on primary (alias)
- ARP caches associate the VIP address with the mac address of the primary (mac1 or mac2)


### Speaker notes

> A virtual IP address, also named VIP in the figure on the right, is a third IP address coming in addition to the two physical IP addresses of the two nodes. In a mirror module, the virtual IP address is set as an alias on the network interface of the primary node, which runs the application.
> Both nodes must be in the same subnet to be able to switch transparently the virtual IP address from one node to the other at the MAC address level (level 2 in the network layer). Clients, as presented in the example, are connected to the VIP. When the PRIM server is node 1, the VIP is associated with the Ethernet MAC address 1 of node 1. This mapping can be viewed in the ARP cache of clients. When node 1 fails, SafeKit automatically sets the VIP as an alias on the network interface of node 2, before restarting the application. Then, the ARP cache of clients is updated with the mapping of the VIP to the Ethernet MAC address 2 of node 2. Thus, clients are reconnected to the application on node 2.
> The virtual IP works with Ethernet interfaces in teaming, bonding, VLAN, and IPV6.


## Slide 11: <vip> in userconfig.xml

- With the default value for optional attributes [ ]
- <vip>
- <interface_list>
- <interface [check="on"]>
- <real_interface>
- <virtual_addr addr="172.24.199.100" where="one_side_alias" [check="on"]/>
- <!-- As many <virtual_addr> as there are virtual IP on this interface -->
- </real_interface>
- </interface>
- <!-- As many <interface> as there are interfaces with virtual address -->
- </interface_list>
- </vip>
- Checker that detects duplicate VIP address conflict or removal
- Resource: ip.172.24.199.100
- Action: stopstart on error
- Checker that detects interface failure
- Resource: intf.172.24.199.0
- Action: wait on failure
- Name or address of the virtual IP
- Prefer an IP address to be DNS independent
- IPv4 or IPv6 address


### Speaker notes

> Virtual IP addresses are configured in the `vip` tag of `userconfig.xml` as presented in the slide. The `check` attribute set to `on` in the `interface` tag means that a checker will detect interface failures. As shown in the text box of the slide, a resource is associated with this checker, and the action taken on interface failure is to put the module in the WAIT state.
> `virtual ADDR`, is the tag where the address or DNS name of the virtual IP is set. It is preferable to configure an IP address to be resilient to DNS failures. You can set here an IPv4 or IPv6 address for the virtual IP.
> One side alias means that the virtual IP address will be set as an alias of the physical IP address on the network interface. It also means that if the primary node initiates a connection to an external device, then this connection will start with its physical IP address. In some cases, it may be useful to initiate connections from the primary node with the virtual IP. In this case, you can set `one side` instead of `one side alias`.
> The `check` attribute set to `on` in the `virtual ADDR` tag means that a checker will detect duplicate VIP address conflicts or removals. As shown in the text box of the slide,  a resource is associated with this checker, and the action taken on error is to stop and then start the module.
> Finally, you can set several virtual IP addresses on the same network interface by replicating the `virtual ADDR` tag. And, you can also replicate the `interface` tag to set virtual IP addresses on several network interfaces.


## Slide 12: Virtual IP address in different subnets

- SafeKit implements
- Healthcheck = URL /var/modules/AM/ready.txt
- OK if PRIM or ALONE
- NOT FOUND otherwise
- VIP @
- VIP @
- OK
- NOT FOUND
- Load balancer with healthcheck
- VIP is defined in the load balancer
- Sends healthcheck
- To the IP addresses of the 2 nodes
- Route the traffic according healthchecks


### Speaker notes

> We consider now two nodes that are in two different subnets. This is particularly the case when implementing a high availability solution in public cloud infrastructure like Azure, AWS, or GCP. SafeKit nodes are put in two different high availability zones, which are in two different subnets.
> In this case, the vip tag must be removed from userconfig.xml, as the rerouting at the MAC address level is no longer possible. Instead, the virtual IP must be defined at a load balancer level. The load balancer must be configured with the physical IP addresses of the two nodes, and with a health check to route the traffic to available nodes.
> SafeKit offers such a health check per module, by answering to a URL with the ready.txt file. If the module is PRIM or ALONE, the URL returns OK. If not, the URL returns NOT FOUND. Thus, the load balancer routes the traffic to the primary server, where the application is running in the mirror module. In the text box of the slide, you have to replace ‘AM’ in the URL, by the name of the module.
> Note that the application must support this configuration with clients connected on the VIP of the load balancer and with the application receiving connections from the load balancer on the physical IP address of the primary server.
> There is another solution not explained in these slides, which consist in rerouting at the DNS level. But this solution is not working in most cases, because the prerequisite is that clients makes a DNS resolution after a failover to be rerouted to the new server. Most often, they do not and continue their execution with the IP address resolved when they started.
> Moreover, the time to propagate a DNS rerouting can vary, but typically it takes a few hours to up to 48 hours for the changes to fully propagate across the Internet.


## Slide 13

- Real-time file replication


### Speaker notes

> Let’s now detail the real-time file replication configuration.


## Slide 14: Prerequisites

- Replicated directories at the same location on both nodes
- It’s better to exclude from replication: binaries, temporary files, OS files
- Disable antivirus scanning of replicated directories when incompatible
- 1
- 2
- 3
- On Linux, same uid/gid for replicated files
- At the first start-up of a mirror module:
- Start the node with the up-to-date replicated directories as primary with safekit prim -m AM
- Start the other node as secondary with safekit second -m AM
- Subsequent starts with safekit start -m AM
- 4
- 5
- 6


### Speaker notes

> Let’s begin with the prerequisites.
> Firstly, it is important to ensure that the replicated directories are located at the same location on both nodes.
> Secondly, it is better to exclude certain types of files from replication, such as binaries, temporary files, and OS files.
> Thirdly, if antivirus scanning is incompatible with the replicated directories, it should be disabled.
> Fourthly, on Linux, the same uid and GID should be used for the replicated files.
> Fifthly, at the first start-up of a mirror module, follow these steps:
> Start the node with the up-to-date replicated directories as primary using the `safekit prim` command.
> Start the other node as secondary using the `safekit second` command.
> For subsequent starts, use the `safekit start` command.


## Slide 15: <rfs> in userconfig.xml

- With the default value for optional attributes [ ]
- <rfs>
- <replicated dir="c:\test1replicated" [mode="read_only"]>
- <!– Optional. As many <notreplicated> as there are directories, files to not replicate -->
- <notreplicated path="file1"/>
- <notreplicated path="subdir/bin"/>
- </replicated>
- <!-- As many <replicated> as there are directories to replicate -->
- </rfs>
- Absolute path of the directory to replicate
- Read-only access on the secondary to avoid corruption
- Relative path for not replicated files or directories inside a replicated directory
Note: rename between not replicated and replicated is not supported
- Resource: rfs.uptodate
- Action: wait remote when not uptodate


### Speaker notes

> Replication is configured in the rfs tag of userconfig.xml as presented in the slide. There is a resource associated with rfs to check if the local node has the up-to-date replicated data. If not, the action is to put the module in the WAIT state, waiting for the start of the other node to resynchronize its local data.
> The replicated tag is the main tag where you define the absolute path of a directory to replicate. The read-only mode means that the data cannot be written on the secondary to avoid corruption. Inside the replicated tag, you can define notreplicated tags, meaning that some files or subdirectories must not be replicated. In the path of the notreplicated tag, set the relative path of files or subdirectories inside the replicated directory.
> Note that renaming files or subdirectories between not replicated and replicated directories is not supported.
> Finally, you can duplicate the replicated tag as many times as there are directories to replicate.


## Slide 16: <rfs> in userconfig.xml

- Replicate all except  or replicate only
- <rfs>
- <replicated dir="c:\dir1">
- <notreplicated regexpath=".*\.tmp$" />
- <notreplicated regexpath=".*\.bak$" />
- </replicated>
- <replicated dir="c:\dir2">
- <notreplicated regexpath="!.*\.mdf$" />
- <notreplicated regexpath="!.*\.ldf$" />
- </replicated>
- </rfs>
- Regular expression
In the c:\dir1 directory and  sub-directories, replicate all except entries with the extension .tmp or .bak
- Regular expression prefixed with !
In the c:\dir2 directory and  sub-directories, replicate only entries with the extension .mdf or .ldf


### Speaker notes

> You can use regular expressions in the `not replicated` tag, allowing you to specify that you want to replicate all entries in a replicated directory except certain ones, or that you want to replicate only specific entries in a replicated directory.
> In the first example, you can specify that in the dir1 replicated directory and its sub-directories, you do not want to replicate entries with the extension `.tmp` or `.bak`.
> In the second example, you can specify that in the dir2 replicated directory and its sub-directories, you want to replicate only the entries with a `.mdf` or `.ldf` extension.


## Slide 17

- How real-time file replication works?


### Speaker notes

> Let’s now examine how the real-time file replication works.


## Slide 18: Interception of write access to replicated files

- Linux - Mount on localhost
- Windows – mini-filter
- RFS mini-filter intercepts write access
- Forwards them to local files system
- Then to nfsbox
- Mount on localhost
- NFS client intercepts write access
- Sends them to nfsbox
- Then to local file system


### Speaker notes

> The first thing to implement is the interception of write access inside replicated files. To do that, SafeKit uses a mini-filter in the Windows kernel and a localhost NFS mount on Linux.
> On Windows, the RFS mini-filter intercepts write access requests inside replicated files. Then, it forwards them to the local file system. Lastly, the mini-filter sends the data to the NFS box process.
> On Linux, SafeKit begins by mounting the replicated directory on localhost through NFS. The NFS client then intercepts write access in replicated files. The write requests are subsequently sent to the NFS box. Finally, the data is transferred to the local file system for storage.
> In both cases, the NFS box process will implement the transfer of data to the secondary node.


## Slide 19: Synchronous replication

- <rfs async="none"> in userconfig.xml
- nfsbox
- write
- ok = ok1 + ok2
- nfsbox forwards write request to the secondary over the network
- No data loss:
- waits for ok1 + ok2 before acknowledging to the application
- The last time the nodes were synchronized


### Speaker notes

> SafeKit implements synchronous replication with no data loss. This means that on a synchronous write I/O requested by the application, it waits for the acknowledgment from the secondary node before acknowledging to the application.
> Here is an explanation of the use case with `async` set to `none` in the `rfs` tag, meaning that the write I/O must be put on both disks before acknowledging to the application.
> In the figure, you see that the NFS box process of the primary node sends the replicated data to the NFS box process of the secondary node. Both "OK1" and "OK2" must be received before acknowledging to the application on the primary node.
> Each NFS box process retains the last time the nodes were synchronized in terms of replication in the `sync timestamp` variable.


## Slide 20: Semi-synchronous replication

- <rfs async="second"> in userconfig.xml
- nfsbox
- write
- ok = ok1 + ok2
- ok2
- The only difference:
- modifications are cached on the secondary
- nfsbox delays write on disk


### Speaker notes

> Here is an explanation of the semi-synchronous replication use case, with `async` set to `second` in the `rfs` tag. The only difference is that the NFS box process of the secondary node keeps the data in its cache, sends "OK2", and then writes to disk later. In case of a failure on node 1, there is still no data loss.
> Note that there is still a possibility of data loss in the special case of a simultaneous double power outage of both nodes, with the inability to restart on the former primary node and the requirement to restart on the secondary node.


## Slide 21: Bitmaps when ALONE

- nfsbox
- write
- ok = ok1
- Bitmaps:
- track changes within files when ALONE
- optimize next resynchronization of the SECOND
- Activated after a first successful synchronization
- Deactivated in case of crash of the SECOND


### Speaker notes

> When a node is alone without a secondary, the NFS box retains all modifications made by the application inside replicated files in bitmaps. The goal is to optimize file resynchronization on the secondary node when it is restarted.
> Note that the bitmaps are activated only after a first successful synchronization between nodes and are deactivated in the event of a crash of the secondary node.


## Slide 22: Resynchronization of the SECOND

- Without stopping the application
- appli
- reintegre
- 3 scenarios
- Entire copy of all files
- Applied at the first start of the secondary
- With "safekit second fullsync“
- Copy only the modified zones set in bitmaps
- Requires a first successful synchronization
- Requires a clean stop of the module
- On Windows, requires activation of the USN log, to work after reboot
- Entire copy of files modified since synctimestamp
- Requires a first successful synchronization
- Applied when the bitmaps are not safe
- After a node crash
- After a nfsbox exception
- After modification of the replicated directories during the module stop


### Speaker notes

> Let’s examine the process of resynchronizing the secondary node, without stopping the application on the primary node.
> The resynchronization is implemented by the process named reintegre and running on the secondary node.
> There are three key scenarios to consider.
> First, there is the entire copy of all files. This is applied at the first start of the secondary node or when using the safekit second fullsync command.
> Second, there is the copy of only the modified zones set in bitmaps.
> This requires a first successful synchronization and a clean stop of the module.
> On Windows, it also requires the activation of the USN log to work after a reboot.
> Third, there is the entire copy of files modified since the sync timestamp.
> It also requires a first successful synchronization.
> This method is used when the bitmaps are not safe, such as after a node crash, an NFS box exception, or modifications of the replicated directories during a module stop.


## Slide 23

- start_prim / stop_prim scripts


### Speaker notes

> Let's now explain the startprim and stopprim scripts.


## Slide 24: start_prim/stop_prim scripts

_No extractable slide text found._


### Speaker notes

> Let's begin with high availability at the application level. The startprim and stopprim scripts are used to start and stop an application. In this solution, you need to install the application with the same settings on both nodes. Additionally, you need to configure the clients to connect to the virtual IP address, which will allow seamless failover in case of a node failure.
> Next, let's discuss high availability at the virtual machine level. The startprim and stopprim scripts are used to start and stop a virtual machine. In this solution, place the critical application inside a VM and configure the clients to connect to the physical IP address of the VM. There is no need for a virtual IP address in this case. The physical IP address of the VM will be rerouted when the VM is restarted on the secondary node.
> Now, let's talk about automatic boot. Remove the automatic start at boot of the application or the VM. This start will be managed by the automatic start of the module at boot. To do this, configure the SafeKit module to start at boot by adding boot=on in the service tag of the userconfig.xml file or by using the safekit boot command.


## Slide 25: Generic scripts

- Available in mirror.safe


### Speaker notes

> Generic scripts are included in the mirror.safe module, as well as other modules, to eliminate the need for custom script creation. This approach significantly simplifies the integration of new applications, making the process more efficient and user-friendly. You only need to define a list of services in the macro named SERVICES in userconfig.xml. This list is then passed as an environment variable to the startprim and stopprim scripts.
> The startprim script starts all services in the order specified in the list, while the stopprim script stops all services in the reverse order. Additionally, startprim checks the startup of each service and stops the module if any service fails to start correctly. During module configuration, the boot startup of services will automatically be set to ‘Manual’. This ensures that services do not start automatically upon system boot, but instead, they will be initiated only when the module itself is started.


## Slide 26: Example on Windows

- start_prim.cmd
- Messages in logs
  - @echo off
  - echo "Running start_prim %*"
  - net start "myservice"   /Y
  - if NOT %errorlevel% == 0 goto stop
  - :stop
  - "%SAFE%\safekit" printe "start_prim failed"
  - "%SAFE%\safekit" stop -i "start_prim"
- Script log
- "Running start_prim WAIT ALONE"
- The myservice service failed to start
- Module log
- 10-20 18:28:12 … start_prim failed
- 10-20 18:28:12 … Action stop called by start_prim


### Speaker notes

> Let's now explain how to log messages from a restart script, either in the script log or in the module log.
> As shown in points 1 and 2 on the slide, all output messages of startprim go into the script log. Thus, for debugging purposes, you can write specific messages in the script log just with the echo command. More generally, you will find the outputs of service startups and stops in the script log, along with any potential error messages that can help with debugging.
> As shown in point 3 on the slide, by using the safekit print e command, you can log a message in the module log.
> A shown in point 4, when executing a command like the stop one with the dash i startprim option, you will have a stop message in the log, indicating that the stop action was initiated by startprim.


## Slide 27

- Mirror state transitions


### Speaker notes

> Let’s now examine the different mirror state transitions.


## Slide 28: Mirror module state

- SECOND


### Speaker notes

> Here are the main states of a mirror module.
> When the state is ALONE and the color is green on a node, it means a primary server without a secondary. The failover is not possible in this state.
> When the state is PRIM and the color is green, it indicates a primary server with a secondary. The failover is possible in this state.
> When the state is SECOND and the color is green, it signifies a secondary server with a primary. The failover is possible in this state.
> The STOP state means that the module is not running on the node.
> If the state is SECOND and the color is orange, it means that the node is currently resynchronizing replicated folders. The failover is not possible in this state.
> When the state is WAIT, the color is red, and the message is "not uptodate," it means that the node is waiting for the start of the other node, as it does not have up-to-date data. If the state is WAIT, the color is red, and the message is a failover rule name, it means that the node is waiting for a mandatory resource controlled by a checker before starting.
> Finally, if the state is ERROR, it means that the console cannot connect to the node because, either the node has crashed, or there is a communication issue between the console and the node, such as a firewall issue or the web service not running on the node side.


## Slide 29: Start the uptodate node1

- node1 goes from STOP to ALONE
- start_prim
- prim
- start
- Application running on node1 and virtual IP set
- prestart
- wakeup
- stop
- wait
- stop_prim
- poststop
- wait
- stop_prim
- node1 (uptodate)


### Speaker notes

> Let's consider the start of the up-to-date node 1, which goes from the STOP to ALONE state.
> In the figure on the left, when the prim or start command is executed on node 1 in the up-to-date state, the prestart script is first executed. Normally, the application integrated in the module should not run on the node, but if it is the case, the prestart script makes a preventive stop of the application before installing the replication mechanisms and the virtual IP address.
> Then, the start prim script is executed to start the application, and the state is ALONE orange.
> After the execution of the start prim script, the module is in the stable ALONE green state, meaning that the virtual IP is set, and the application is running.
> The module can be put in the WAIT red state, if a resource is set to down by a checker.
> Now let's consider the stop of the module by an administrator or a checker. In this case, it transitions from the ALONE state to the STOP state after executing the stop prim and poststop scripts.


## Slide 30: Start the not uptodate node2

- node2 goes from STOP to SECOND; node1 goes from ALONE to PRIM
- prestart
- wakeup
- stop
- wait
- poststop
- wait
- node2 (not uptodate)
- second
- start
- node1 (uptodate)
- Synchronisation of files from primary to secondary
- Uptodate and mirrored files
- Application running on node1 and virtual IP set


### Speaker notes

> In this slide, we continue the presentation with the start of node 2, which will become the secondary node.
> In the figure on the left, when the second or start command is executed on node 2, which is in the not up-to-date state, the prestart script is first executed. Normally, the application integrated in the module should not run on the node, but if it is the case, the prestart script makes a preventive stop of the application before installing the replication mechanisms. Then, node 2 resynchronizes replicated files from the primary node and stays in the SECOND orange state during this process, which can take a while, depending on the size of data to resynchronize.
> After the resynchronization, the module goes into the SECOND green stable state, ready to restart the application in the event of a failure of node 1.
> As shown in the figure in the middle, on node 1, the module transitions from the ALONE green state to the PRIM green state once the module is SECOND green on the other node.


## Slide 31: Stop or failure of the PRIM

- Automatic or manual failover on the SECOND according <service failover="on" | "off" > in userconfig.xml
- start_prim
- Application running on node2 and virtual IP set
- node2
- failover="on"
- node2
- failover="off"
- Manual failover with
- safekit stop –m AM
- safekit prim –m AM
- node1 (PRIM)
- stops or fails


### Speaker notes

> Let’s now consider a stop or a failure of the primary node and the resulting state transitions on the secondary node.
> In the left figure, we consider the standard use case with the failover attribute set to on.
> In the event of node 1 failure or stop, node 2 transitions from the SECOND state to the ALONE state.
> During the transition, the start prim script is executed in the ALONE orange state, and when it ends and the application is started, the state changes to ALONE green.
> In the middle figure, we consider the special case with the failover attribute set to off, meaning that the administrator does not want an automatic failover on the secondary. In this case, when node 1 experiences a failure or a stop, node 2 transitions from the SECOND green state to the WAIT red state, waiting for node 1 to restart. If administrators want to force the start as primary on node 2, they can stop the module on node 2 and use the safekit prim command to force the start. Thus, with the failover set to off, the administrator manually controls when the secondary node becomes primary.


## Slide 32: Stop or failure of the SECOND

- node1 goes from PRIM to ALONE
- Application still running on node1 and virtual IP set
- node1
- node2 (SECOND)
- stops or fails


### Speaker notes

> When there is a stop or failure of the secondary node, the primary node transitions from the PRIM green state to the ALONE green state. The application and the virtual IP are not impacted by this transition; only the replication to the secondary node is stopped.


## Slide 33: Network isolation between PRIM and SECOND

- On split-brain, by default PRIM and SECOND become ALONE
- heartbeats KO
- node1
- node2
- heartbeats OK
- On network isolation, by default:
- The application runs on both nodes and the virtual IP is set on both nodes
Data are locally modified on each node
- To avoid the double ALONE case, configure
- a splitbrain checker; see "Checkers" slides
- or heartbeats on several networks
- Exit from ALONE – ALONE state:
- One node becomes the primary
- The other node resynchronizes its data and becomes SECOND
- Final status can be PRIM/SECOND or SECOND/PRIM according the duplicate IP address checker


### Speaker notes

> Let’s now explain the implications of network isolation between the primary and secondary nodes.
> When both nodes are isolated, each node transitions to the ALONE state as all heartbeats are lost. In this situation, the application runs on both nodes, the virtual IP is set on both nodes, and the data is modified locally on each node.
> Once the isolation is repaired, one node becomes the primary and the other node resynchronizes its data and becomes the secondary. The final status can be either PRIM SECOND or SECOND PRIM, according to the duplicate IP address checker.
> Network isolation occurs when all heartbeats between node 1 and node 2 are lost. As long as there is a live heartbeat on a network, the split-brain situation cannot occur. Therefore, implementing an unbreakable private network, such as a direct Ethernet link between both nodes, can avoid this situation.
> Another way to avoid the ALONE ALONE state is to configure the split-brain checker of SafeKit. The split-brain checker is explained in the Checkers slides.


## Slide 34: Restart on PRIM/ALONE

- Application is stopped then started on the same node with scripts
- restart
- stop_prim
- start_prim
- Application is restarted locally on the node
- restart
- node1
- node1
- PRIM case
- ALONE case
- stop_prim
- start_prim


### Speaker notes

> Let’s now explain the restart action initiated by an administrator or a checker.
> The restart action consists of executing the stop prim and start prim scripts on a node to restart the application locally, without triggering a failover.
> In the left figure, in the PRIM case, the state is PRIM orange during the execution of the stop prim and start prim scripts.
> In the middle figure, in the ALONE case, the state is ALONE orange during the execution of the stop prim and start prim scripts.


## Slide 35: Swap or stopstart on PRIM

- Reverse roles of PRIM and SECOND between node1 and node2
- 2. start_prim
- swap
- stopstart
- 1. Application is stopped on node1 (virtual IP unset)
- 1. stop_prim
- node1
- node2
- 2. Application is started on node2 (virtual IP set)


### Speaker notes

> Let’s now talk about the swap action.
> This action reverses the roles of the primary and secondary nodes. The swap action is equivalent to the stopstart action executed on the primary server. The action can be initiated by an administrator or a checker.
> As shown in the figure, the application is initially running on node 1, the PRIM server. By initiating a swap or stopstart action on node 1, first the stop prim script is executed on node 1. During this execution, the state transitions from PRIM green to PRIM orange on node 1, during which the application is stopped, and the virtual IP is unset. After that, the virtual IP is set on node 2 and the start prim script is executed on node 2, which transitions from SECOND green to ALONE green, meaning that the application is restarted on node 2.
> After the module has stopped on node 1, it automatically restarts and transitions to the SECOND green state, after resynchronization of data.
> Finally, the roles of primary and secondary have been reversed between node 1 and node 2.


## Slide 36: Thank you !

- Contact us here


### Speaker notes

> Thank you for your attention. If you have any questions or need further clarification, please feel free to ask.
