---
canonical: https://safekit.evidian.com/wp-content/uploads/downloads_safekit/version-82/safekituserguidehtml/documentation/safekituserguideen.htm
---

# 5. Mirror module administration

![*](safekituserguideen_fichiers/image001.png)      
Section 5.1 “Operating mode of a mirror module”

![*](safekituserguideen_fichiers/image001.png)      
Section 5.2 “State automaton of a mirror module (STOP, WAIT, ALONE, PRIM, SECOND -
NotReady, Transient, Ready)”

![*](safekituserguideen_fichiers/image001.png)      
Section 5.3 “First start-up of a mirror module (safekit prim command)”

![*](safekituserguideen_fichiers/image001.png)      
Section 5.4 “Different reintegration cases (use of bitmaps)”

![*](safekituserguideen_fichiers/image001.png)      
Section 5.5 “Start-up of a mirror module with the up-to-date data   
![](safekituserguideen_fichiers/image214.jpg)STOP (NotReady) - ![](safekituserguideen_fichiers/image205.png)WAIT (NotReady)”

![*](safekituserguideen_fichiers/image001.png)      
Section 5.6 “Degraded replication mode (![](safekituserguideen_fichiers/image199.png)ALONE (Ready) degraded)”

![*](safekituserguideen_fichiers/image001.png)      
Section 5.7 “Automatic or manual failover”

![*](safekituserguideen_fichiers/image001.png)      
Section 5.8 “Default primary server (automatic swap after reintegration)”

![*](safekituserguideen_fichiers/image001.png)      
Section 5.9 “Prim command fails: why? (safekit primforce command)”

 

To test a mirror module, see section 4.2.

To analyze a
problem, see section 7.

## 5.1             Operating mode of a mirror module

|  |  |
| --- | --- |
| **1.** **Normal operation**  Stable state: primary with secondary.  Normal operation, stable states, of a mirror module  On the primary:  ·         Virtual IP is set  ·         Application is running  ·         Real-time file replication     The secondary is ready to run a failover and become primary. | **2.** **Automatic failover**    Stable state: primary without secondary.  Automatic failover of a mirror module     On primary stop, automatic failover of the virtual IP and application. |
| **3.** **Failback and reintegration**     Transient state: secondary reintegrating.     Failback and reintegration of a mirror module  Automatic file synchronization without application shutdown and updating only the files that were modified on the primary while the other node was stopped. | **4.** **Back to normal operation**     Stable state: primary with secondary. |

## 5.2             State automaton of a mirror module (STOP, WAIT, ALONE, PRIM, SECOND - NotReady, Transient, Ready)

![State automaton of a mirror module (STOP, WAIT, ALONE, PRIM, SECOND - NotReady, Transient, Ready)](safekituserguideen_fichiers/image219.png)

## 5.3             First start-up of a mirror module (safekit prim command)

| At first start-up of a mirror module, if both servers are started with the start command, both go into WAIT (NotReady)state with the message in the log:  "Data may be not uptodate for replicated directories (wait for the start of the remote server)"  “If you are sure that this server has valid data, run safekit prim to force start as primary”  At first start-up of a mirror module, use the special prim command on the server with the up-to-date directory, and the second command on the other one. Data is synchronized from the primary server to the secondary one.  For next start-up, use the start command on both servers. | |
| --- | --- |
| **1.** **initial state**  ·         the mirror module has just been configured with a new directory to replicate between node1 and node2  ·         node1 has the up-to-date directory  ·         node2 has an empty directory | *STOP                STOP     (NotReady)          (NotReady) |
| **2.** **command** **prim on node1**  ·         use the special prim command to force node1 to become primary  ·         for following start-ups, always prefer start: see section 5.5  ·         message in the log:  "Action prim called by admin@<IP>/SYSTEM/root" | ALONE                  STOP      (Ready)             (NotReady) |
| **3.** **command** **second on node2**  ·         start the other server as secondary  ·         the secondary reintegrates replicated directory from primary  ·         message in the log:  "Action second called by admin@<IP>/SYSTEM/root" | PRIM                 SECOND       (Ready)               (Ready) |

## 5.4             Different reintegration cases (use of bitmaps)

|  |  |
| --- | --- |
| To optimize file reintegration, different cases are considered:  1.    The module must have completed the reintegration (on the first start of the module, it runs a full reintegration) before enabling the tracking of modification into bitmaps  2.    If the module was cleanly stopped on the server, then at restart of the secondary, only the modified zones of modified files are reintegrated, according to a set of modification tracking bitmaps.  3.    If the server crashed (power off) or was incorrectly stopped (exception in nfsbox replication process), or if files have been modified while SafeKit was stopped, the modification bitmaps are not reliable and are therefore discarded. All the files bearing a modification timestamp more recent than the last known synchronization point minus a grace delay (typically one hour) are reintegrated.  4.    A call to the special safekit second|prim fullsync command triggers a full reintegration of all replicated directories on the secondary when it is started. | |
| **1.** **secondary server2 has been stopped**  ·         data is desynchronized | ALONE                  STOP      (Ready)             (NotReady) |
| **2.** **start command on node2**  ·         data is reintegrated with bitmap optimization (see above) | ALONE              SECOND       (Ready)           (Transient) |
| **3.** **end of reintegration**  ·         data is the same on both servers  ·         only modifications inside files are replicated with a real-time synchronous replication | PRIM                SECOND      (Ready)              (Ready) |

The replication system also keeps track of
the last date on which data was synchronized on each node. This synchronization
date, named synctimestamp, is assigned at the end of the reintegration and changes in the ![](safekituserguideen_fichiers/image184.jpg)PRIM (Ready) and ![](safekituserguideen_fichiers/image184.jpg)SECOND (Ready)states. When
the module is stopped on the secondary node and then restarted, the synctimestamp is one of the reintegration criteria: all files modified around
this date are potentially out of date on the secondary and must be
reintegrated. Since SafeKit 7.4.0.50, the synchronization date is also used to
implement an additional security. When the difference between the
synchronization date stored on the primary and on the secondary is greater than
90 seconds, the replicated data is considered unsynchronized in its entirety.
The reintegration is interrupted with the following message in the module log:

“|
2021-08-06 08:40:20.909224 | reintegre | E | Automatic synchronization cannot
be applied due to an abnormal delta between the dates of the last
synchronization”

The administrator can force the start in
secondary with full synchronization of the data, by executing the command: safekit second fullsync -m *AM*.

## 5.5             Start-up of a mirror module with the up-to-date data STOP (NotReady) - WAIT (NotReady)

|  |  |
| --- | --- |
| SafeKit determines which server must start as primary or not. SafeKit retains the information on the server with the up-to-date replicated directories. To take advantage of this feature, use the command start and NOT the command prim | |
| **1.** **initial state**  ·         server1 is primary ALONE  ·         directories are up to date on this server  ·         the module is stopped on node2  ·         node2 has desynchronized replicated directories | ALONE                  STOP      (Ready)             (NotReady) |
| **2.** **command** **stop on node1**  ·         stop of the server with the up-to-date directories | *STOP              STOP     (NotReady)         (NotReady) |
| **3.** **command** **start on node2**   - the module is put in the WAIT state waiting for the start of the other server and within   its log of messages:   "Data may be not uptodate for replicated directories (wait for the start of the remote server)"  "Action wait from failover rule notuptodate\_server"  "If you are sure that this server has valid data, run safekit prim to force start as primary"   - in this case, you must start server1 to   resynchronize data of server2 - if you really want to sacrifice the   up-to-date data and start node2 as primary with the data not up-to-date:   issue a stop command then a prim command on node2 | *STOP              WAIT     (NotReady)         (NotReady)                                           rfs.uptodate="down" |
| See also section 5.9. | |

## 5.6             Degraded replication mode (ALONE (Ready) degraded)

|  |  |
| --- | --- |
| If the replication process nfsbox fails on the primary server (for instance because of an unrecoverable replication problem), the application is not swapped on the secondary server  The primary server goes to the ALONE state in a degraded replication mode.  Degraded is displayed in the web console. A message is emitted in the log:  "Resource rfs.degraded set to up by nfsadmin"  safekit state -v -m *AM* returns resource rfs.degraded up (replace *AM* by the module name)  The primary server continues in ALONE state with a nfsbox process which does not replicate anymore.  You must stop and start the ALONE server to come back to a PRIM - SECOND state with replication | |
| **1.** **initial state**  the mirror is in a stable state:  node1 PRIM (Ready)  node2 SECOND (Ready) | PRIM                SECOND      (Ready)              (Ready) |
| **2.** **failure of replication process nfsbox on node1**   - node1 becomes ALONE (Ready) degraded with the message in its log   "Resource rfs.degraded set to up by nfsadmin".  ·         safekit state -v *AM* returns resource rfs.degraded=up (where *AM* is the module name)   - node1 ALONE   continues to execute the application without replication - node2 is in WAIT (NotReady) waiting for the replication process with the message in its   log   "Action wait from failover rule degraded\_server"  and with rfs.uptodate="down" | ALONE               WAIT    (Ready)              (NotReady)  rfs.degraded="up"  rfs.uptodate="down" |
| **3.** **come back to replication**  ·         administrator makes stop command and start command on node1 ALONE  ·         the nfsbox replication process is restarted on node1  ·         node2 reintegrates replicated directories before becoming SECOND (Ready)  ·         node1 becomes PRIM (Ready) | PRIM                SECOND      (Ready)             (Ready) |

## 5.7             Automatic or manual failover

|  |  |  |  |
| --- | --- | --- | --- |
| Automatic or manual failover on the secondary server is defined in userconfig.xml by <service mode="mirror" failover="on"|"off">. By default, if the parameter is not defined, failover="on"  The failover="off" mode is useful when the failover must be controlled by an administrator. This mode ensures that an application runs always on the same primary server whatever operations are made on the server (reboot, temporary stop of the module for maintenance...). Only an explicit administrative action (prim command) may promote the other server as primary.   |  |  | | --- | --- | | Commentaire, ajouter contour | Failover mode could be set dynamically with the safekit failover on|off -m *AM* (replace *AM* by the module name). | | |
| **1.** **initial state**  the mirror is in a stable state:  node1 PRIM (Ready)  node2 SECOND (Ready) | PRIM                SECOND      (Ready)              (Ready) |
| **2.** **restart with** **failover="on"**  ·         if node1 former PRIM fails and stops, node2 becomes automatically   ALONE (Ready) (default mode) | *STOP             ALONE     (NotReady)            (Ready) |
| **3.** **behavior with** **failover="off"**  ·         if node1 former PRIM fails and stops, node2 goes to WAIT (NotReady) state with message in its log  "Failover-off configured"  "Action stopstart called by failover-off"  "Transition STOPSTART from failover-off"  "Local state WAIT NotReady "  ·         the administrator in this situation can restart node1: the mirror restarts in its former stable state  node1 PRIM (Ready)  node2 SECOND (Ready)  ·         the administrator can decide to force node2 to become primary with the command: stop then prim on node2 | *STOP                  WAIT  (NotReady)             (NotReady) |
| See also section 5.9 | |

 

## 5.8             Default primary server (automatic swap after reintegration)

|  |  |
| --- | --- |
| After reintegration at failback, a server becomes by default secondary. The administrator may choose to swap the application back to the reintegrated server at an appropriate time with the swap command. This is the default behavior when userconfig.xml <service> is defined without the defaultprim variable  If the application must automatically swap back to a preferred server after reintegration, specify a defaultprim server in userconfig.xml: <service mode="mirror" defaultprim="hostname node1"> | |
| **1.** **initial state**  ·         node1 (former PRIM) fails and stops  ·         node2 secondary becomes automatically ALONE | STOP             ALONE     (NotReady)            (Ready) |
| **2.** **failback without** **defaultprim**  ·         node1 is restarted with command start  ·         it reintegrates replicated directories and then becomes secondary  ·         an administrator can swap the primary to node1 with the command swap in a timely manner  ·         swap stops the application on node2 and restarts it on node1 | SECOND                  PRIM      (Ready)                (Ready) |
| **3.** **failback with** **defaultprim="hostname node1"**  ·         node1 in STOP (NotReady) at step 1 (initial state) is restarted by command start  ·         it reintegrates replicated directories  ·         just after reintegration, an automatic swap is made on node1 with the message in its log:  "Transition SWAP from defaultprim"  "Begin of Swap"  ·         the application is then automatically stopped on node2 and restarted on node1  ·         at the end, node1 is PRIM | PRIM                SECOND      (Ready)              (Ready) |

 

## 5.9             Prim command fails: why? (safekit primforce command)

|  |  |
| --- | --- |
| A prim command may fail to start a server as primary: after trying a start-up, the server goes back to STOP (NotReady). | |
| **1.** **initial state**  ·         node1 ALONE has the up-to-date directory  ·         node2 is in the process of reintegrating files from node1 | ALONE              SECOND       (Ready)           (Transient)                        Partially synchronized |
| **2.** **command** **stop on node2 then on node1**  ·         stop of node2 during its reintegration: stop of node2 can be made while a file that is half copied (corrupted file)  ·         node1 is also stopped | STOP                STOP    (NotReady)           (NotRead y)                       Partially synchronized |
| **3.** **command prim on node2**   - fails with messages in the log   described above   "Data may be inconsistent for replicated directories (stopped during reintegration)"  "If you are sure that this server has valid data, run safekit primforce to force start as primary"   - in this case, you must start node1 with   start command or prim command. And to restart node2 with start command   to finish reintegration of files. While node2 is not in the state SECOND (Ready), its data may be corrupted - if you absolutely want to start as   primary on node2 partially reintegrated and with data potentially   corrupted, use the command safekit primforce -m *AM* on   node2 (command line only, where *AM* is the   module name). Message in the log:   "Action primforce called by SYSTEM/root" | STOP                STOP    (NotReady)           (NotRead y)                      Partially synchronized                                                                         The command prim fails since the data may be corrupted |
| Note: The safekit primforce -m *AM* command forces a full reintegration of replicated directories on the secondary when it is restarted. | |

 

  

 

# 6. Farm module administration

![*](safekituserguideen_fichiers/image001.png)      
Section 6.1 “Operating mode of a farm module”

![*](safekituserguideen_fichiers/image001.png)      
Section 6.2 “State automaton of a farm module (STOP, WAIT, UP - NotReady,
Transient, Ready)”

![*](safekituserguideen_fichiers/image001.png)      
Section 6.3 “Start-up of a farm module”

 

To test a farm module, see section 4.3.

To analyze a
problem, see section 7.

## 6.1             Operating mode of a farm module

|  |  |
| --- | --- |
| **1.** **Normal operation**  Stable state: 2 active nodes.  Normal operation, stable states, of a farm module     On all nodes:  ·         Virtual IP is set  ·         Application is running  ·         Network load sharing is distributed among all nodes     Each node is ready to run a failover and take 100% of the load. | **2.** **Automatic failover**    Stable state: 1 active node.  Automatic failover of a farm module     On remote node stop, automatic failover of the network load sharing. |
| **3.** **Back to normal operation**     Stable state: 2 active nodes. | |

 

  

 

## 6.2             State automaton of a farm module (STOP, WAIT, UP - NotReady, Transient, Ready)

 

![State automaton of a farm module (STOP, WAIT, UP - NotReady, Transient, Ready)](safekituserguideen_fichiers/image253.png)

Note: This is
also the state automation of a light module. A light module is identified by <service
mode="light"> in userconfig.xml file
under SAFE/modules/*AM*/conf (where *AM* is the module name). The light type corresponds to an application
module that runs on one node without synchronizing with other nodes (as can-do
mirror or farm modules). A light module includes the start and stop of an
application as well as the SafeKit checkers that can detect errors.

## 6.3             Start-up of a farm module

| Use the start command on each node running the module. An example with a farm of 2 servers is presented below. | |
| --- | --- |
| **1.** **initial state**  ·         the farm module has just been configured on node1 and node2 | STOP                STOP    (NotReady)          (NotReady) |
| **2.** **command** **start on node1 and node2**   - message in the log of both servers:   "farm membership: **node1 node2** (group FarmProto\_0)"  "farm load: **128/256** (group FarmProto\_0)"  "Local state UP Ready"   - resource of the module instance on both   nodes: FarmProto\_0   50% | UP                   UP    (Ready)          (Ready) |

  

