---
canonical: https://safekit.evidian.com/wp-content/uploads/downloads_safekit/version-82/safekituserguidehtml/documentation/safekituserguideen.htm
---

# 7. Troubleshooting

![*](safekituserguideen_fichiers/image001.png)       Section 7.1 “Connection issues with the web console”

![*](safekituserguideen_fichiers/image001.png)       Section 7.2 “Connection issues with the HTTPS web console”

![*](safekituserguideen_fichiers/image001.png)       Section 7.3 “Global environment checks (healthcheck script)”

![*](safekituserguideen_fichiers/image001.png)       Section 7.4 “How to read logs and resources of the module?”

![*](safekituserguideen_fichiers/image001.png)       Section 7.5 “How to read the commands log of the server?”

![*](safekituserguideen_fichiers/image001.png)       Section 7.6 “Stable module  ![](safekituserguideen_fichiers/image257.png)(Ready) and ![](safekituserguideen_fichiers/image257.png)(Ready)”

![*](safekituserguideen_fichiers/image001.png)       Section 7.7 “Degraded module ![](safekituserguideen_fichiers/image255.png)(Ready)and ![](safekituserguideen_fichiers/image213.png)/![](safekituserguideen_fichiers/image205.png)(NotReady)”

![*](safekituserguideen_fichiers/image001.png)       Section 7.8 “Out of service module ![](safekituserguideen_fichiers/image213.png)/![](safekituserguideen_fichiers/image206.png)(NotReady) and ![](safekituserguideen_fichiers/image213.png)/![](safekituserguideen_fichiers/image206.png)(NotReady)”

![*](safekituserguideen_fichiers/image001.png)       Section 7.9 “Module ![](safekituserguideen_fichiers/image213.png) STOP (NotReady): start the module”

![*](safekituserguideen_fichiers/image001.png)       Section 7.10 “Module ![](safekituserguideen_fichiers/image206.png)WAIT (NotReady): repair the resource="down"”

![*](safekituserguideen_fichiers/image001.png)       Section 7.11 “Module oscillating from ![](safekituserguideen_fichiers/image255.png) (Ready)
to ![](safekituserguideen_fichiers/image198.png) (Transient)”

![*](safekituserguideen_fichiers/image001.png)       Section
7.12 “Message on stop after maxloop”

![*](safekituserguideen_fichiers/image001.png)       Section 7.13 “Module ![](safekituserguideen_fichiers/image255.png) (Ready) but
non-operational application”

![*](safekituserguideen_fichiers/image001.png)       Section 7.14 “Mirror module ![](safekituserguideen_fichiers/image257.png)ALONE (Ready) - ![](safekituserguideen_fichiers/image205.png)WAIT/![](safekituserguideen_fichiers/image213.png)STOP
(NotReady)”

![*](safekituserguideen_fichiers/image001.png)       Section 7.15 “Farm module ![](safekituserguideen_fichiers/image255.png)UP(Ready)but problem of load balancing in a farm”

![*](safekituserguideen_fichiers/image001.png)       Section 7.16 “Problem with the virtual IP after failover”

![*](safekituserguideen_fichiers/image001.png)       Section 7.17 “Problem after Boot”

![*](safekituserguideen_fichiers/image001.png)       Section 7.18 “Analysis from snapshots of the module”

![*](safekituserguideen_fichiers/image001.png)       Section 7.19 “Problem with the size of SafeKit databases”

![*](safekituserguideen_fichiers/image001.png)       Section 7.20 “Problem for retrieving the certification authority certificate from
an external PKI”

![*](safekituserguideen_fichiers/image001.png)       Section 7.21 “Issue with email sending by the SafeKit notification agent”

![*](safekituserguideen_fichiers/image001.png)       Section 7.22 “Issue with antivirus”

![*](safekituserguideen_fichiers/image001.png)       Section 7.23 “Issue with SafeKit kernel modules”

![*](safekituserguideen_fichiers/image001.png)       Section 7.24 “Troubleshooting VIP ↔ MAC resolution”

![*](safekituserguideen_fichiers/image001.png)       Section 7.25 “Still in trouble”

## 7.1             Connection issues with the web console

If you encounter problems for connecting to
the SafeKit web console to SafeKit node, such as no
reply or connection error, run the following checks and procedures:

![*](safekituserguideen_fichiers/image001.png)       section 7.1.1 “Browser check”

![*](safekituserguideen_fichiers/image001.png)       section 7.1.2 “Browser state clear”

![*](safekituserguideen_fichiers/image001.png)       section 7.1.3 “Server check”

Then, it may be necessary to reload the
console into the browser.

### 7.1.1         Browser check

For the web browser:

1.    check that it is a supported browser and its level

2.    change the proxy settings for direct or indirect connection to the
server

3.    with Microsoft Edge, change the security settings (add the URL into
the trusted zones)

4.    clear the browser's state on upgrade as described below

5.    check that the web console and the server are at the same level
(backward compatibility may not be fully preserved)

### 7.1.2         Browser state clear

1.    Clear the browser cache

A quick way to
do this is a keyboard shortcut that works on IE, Firefox, and Chrome. Open the
browser to any web page and hold CTRL and SHIFT while tapping the DELETE key.
(This is NOT CTRL, ALT, DEL). The dialog box will open to clear the browser.
Set it to clear everything and click Clear Now or Delete at the bottom

2.    Clear the browser SSL cache if HTTPS is used

Look at advanced
settings for the browser and search for SSL cache.

Finally close all windows for the browser,
stop the browser process still running in the background if necessary, and
re-open it fresh to test what wasn't working for you previously.

### 7.1.3         Server check

On each SafeKit cluster node check:

1.    the firewall

If this
has not yet been done, run the SAFE/private/bin/firewallcfg
add command which configures the operating system
firewall. For other firewalls, add an exception to allow connections between
the web browser and the server. For details, see section 10.3.

2.    the web server configuration

HTTP access
to the web console requires authentication. If it has not yet been done, run
the SAFE/private/bin/webservercfg
-passwd pwd to initialize (or reinitialize) this
configuration with the password of the user admin. For details,
see section 11.2.1.

3.    the network and the server availability

4.    the safeadmin and safewebserver services

They must
be started.

5.    the SafeKit cluster configuration

Run the
command safekit
cluster confinfo (see section 9.2). This command
must return on all nodes, the same list of nodes and the same value for the
configuration signature. If not, reapply the cluster configuration on all nodes
(see section 12.2).

## 7.2             Connection issues with the HTTPS web console

If you encounter problems for connecting the
secure SafeKit web console to SafeKit nodes, you can run the following checks
and procedures:

![*](safekituserguideen_fichiers/image001.png)       section 7.1 “Connection issues with the web console”

![*](safekituserguideen_fichiers/image001.png)       section 7.2.1 “Check server certificate”

![*](safekituserguideen_fichiers/image001.png)       section 7.2.2 “Check certificates installed in SafeKit”

![*](safekituserguideen_fichiers/image001.png)       section 7.2.3 “Revert to HTTP configuration”

### 7.2.1         Check server certificates

The SafeKit web console connects to a
SafeKit node that is identified by a certificate. To get the SafeKit node
certificate content with Internet Explorer or Chrome, run the following:

 

|  |  |
| --- | --- |
| 1.    Click on the lock next to the URL to open the security report  2.    Click on the View certificates link. It opens a window that displays the certificate content |  |

 

|  |  |  |
| --- | --- | --- |
| 3.    Check the issuer that must be the appropriate certification authority  4.    Check the validity date and the workstation date. If necessary, change the workstation date  5.    Check the validity date. If the certificate is expired, you must renew. For certificate generated with the SafeKit PKI, see section 11.3.1.8.1 |  | |
| 6.    Click on Details tab  7.    Select Subject Alternate Name field. Its content is displayed into the bottom panel. The location set into the URL for connecting the SafeKit web console must be included into this list. Change the URL if necessary  8.    The address value for the node, set into the SafeKit cluster configuration, must be one of the values listed. If it is not, change the cluster configuration as described in section 12.2.  When using DNS name, you must use lower case.   |  |  | | --- | --- | | Commentaire important contour | With SafeKit <= 7.5.2.9, the server’s name must be included. | | |  |
|  |  |  |

### 7.2.2         Check certificates installed in SafeKit

You can use the checkcert command for
checking all the certificates.

On each SafeKit nodes:

1.    Log in as administrator/root and open a command shell window

2.    Change directory to
SAFE/web/bin

3.    Run
checkcert -t all

It checks
all installed certificates and returns a failure if an error is detected

4.    You can check that the server certificate contains some DNS name or
IP address with:

checkcert -h "DNS name value"

checkcert -i "Numeric IP address value"

|  |  |
| --- | --- |
| Commentaire important contour | The server certificate must contain all DNS names and/or IP addresses used for HTTPS connection. These ones must also be included into the SafeKit cluster configuration file. |

 

If the command
fails, it may be due to an incorrect file format.

The content of a .crt file looks like:

-----BEGIN
CERTIFICATE-----

MIID+DCCAuCgAwIBAgIFAJNuUj4wDQYJKoZIhvcNAQELBQAwUjEQMA4GA1UEChMH

RXhhbXBsZTEQMA4GA1UECxMHU2FmZUtpdDEsMCoGA1UEAxMjU2FmZUtpdCBMb2Nh

bCBDZXJ0aWZpY2F0ZSBBdXRob3JpdHkwHhcNMjQwNTI5MDYzMzIxWhcNNDQwNTI0

…

H/kG9pfzpnCEtZeyRCxGiowQpEmKtqOS51Xzg+q2tI7uiOAf5SVxHbqj/8c5RNZi

/iYlZg3itzIxLTPBEn3BD6pSVmRU33yU2cHo6HMsXXwFvo/LMOWNhVrj9I33d7u6

0fooCyU3aFbFCwGx

-----END
CERTIFICATE-----

 

The content of a .key file looks like:

-----BEGIN
PRIVATE KEY-----

MIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwggSjAgEAAoIBAQCbSAP0f28TR3lj

jMRNabVP6725NQoH6Wt3O238aH8uXKKiI2byzWGXVjnrvT8AK+3lraQ4yLoAGtO3

LTsxsbuOQi90kwfelKNlQsIh3WJ7V6bGltLoQhT+bDdLJAPmLH1nFHKe19Tkvqr/

..

SUl5Ap71plSqrYlvNhkiOB50Hs34r+iNtPB6GaKtnTHicBjI1i95zrU/J5JKHxBV

uRY4ghOgtJyq9LuZXb2aTOht7K7QTjLRqHS5rdy+alSByhKpD2wR6oqX44mw1w1s

eOCnWlvhpFarc9As9BIVGsw=

-----END PRIVATE
KEY-----

### 7.2.3         Revert to HTTP configuration

If the problem cannot be solved, you can
revert to the HTTP configuration (where SAFE=C:\safekit in Windows if System
Drive=C: ;  and SAFE=/opt/safekit in Linux).

On S1 and S2:

1.    remove the file
SAFE/web/conf/ssl/httpd.webconsolessl.conf 

2.    run safekit webserver
restart

3.    clear the browser cache as described in section 7.1.2

## 7.3             Global environment checks (healthcheck script)

The healthcheck script,
available since SafeKit 8.2.6, is a diagnostic tool that performs a global
verification of a SafeKit environment on a SafeKit node. It checks key
components such as installation paths, SafeKit services, filesystem usage,
cluster and module configuration…

This script is intended for quick
troubleshooting and support. It is automatically executed during the module
dump and its output is included into the module snapshot.

Run the healthcheck script
with:

|  |  |
| --- | --- |
| **Windows** | safekit -r healthcheck.ps1 |
| **Linux** | safekit -r healthcheck.sh |

 

All results are written into a single log
file (healthcheck.txt) in the SAFEVAR directory with [TEST], [OK], [WARN] and [ERROR] sections to
facilitate analysis.

## 7.4             How to read logs and resources of the module?

|  |  |
| --- | --- |
| **Module log** and **Scripts log** for the module on one node may be analyzed with (replace below node1 by the node name and *AM* by the module name):  ·         the web console at URI /console/en  /monitoring/modules/*AM*/nodes/node1/logs  ·         the command executed on node1   safekit logview -m *AM* for the module log  ·         on node1, into files SAFEVAR/modules/*AM*/userlog\_<year>\_<month>\_<day>T<time>\_<script name>.ulog for the scripts log  With the module log, you can understand why the module is no longer in its stable state  (Ready).  With the scripts log, you can see the output messages of module scripts (start\_xxx and stop\_xxx).  Note that a module can leave its stable  (Ready)because of an administrator command: safekit stop | restart | swap | stopstart | forcestop… -m *AM* | ·         You will find a list of SafeKit log messages in Log Messages Index.  ·         Messages in the log after an administrator command are:  "Action start called by admin@<IP>/SYSTEM/root"  "Action stop called by admin@<IP>/SYSTEM/root"  "Action restart called by admin@<IP>/SYSTEM/root"  "Action swap called by admin@<IP>/SYSTEM/root"  "Action stopstart called by admin@<IP>/SYSTEM/root"  "Action forcestop called by admin@<IP>/SYSTEM/root"  admin@<ip>: via the SafeKit console  SYSTEM: command on Windows  root: command on Linux  ·         If "Action stop called by maxloop" appears in the module log, see section 7.12 |
| **Resources state** of the module on one node may be analyzed with (replace below node1 by the node name and *AM* by the module name):  ·         the web console at URI /console/en  /monitoring/modules/*AM*/nodes/node1/resources  ·         the command executed on node1   safekit state -m *AM* -v | ·     Module status  state.local, state.remote  usersetting.errd, usersetting.checker, usersetting.encryption  ·     Checkers  proc.xxx, intf.xxx, custom.xxx  ·     File replication  rfs.uptodate, rfs.degraded, rfs.reintegre\_failed |

## 7.5             How to read the commands log of the server?

There is a log of the safekit
commands ran on the server. 

**Commands log**
may be displayed using the command safekit cmdlog

See section 10.12 for more details

## 7.6             Stable module  (Ready) and (Ready)

·        
A stable mirror module on 2 servers is in the
state ![](safekituserguideen_fichiers/image263.jpg)PRIM (Ready) - ![](safekituserguideen_fichiers/image264.jpg)SECOND (Ready): the
application is running on the PRIM server; on failure, the SECOND server is ready to resume the
application.

·        
A stable farm module is in the state ![](safekituserguideen_fichiers/image209.jpg)UP (Ready)on all servers of the farm: the
application is running on all servers.

## 7.7             Degraded module (Ready)and /(NotReady)

A degraded mirror module is in the state ![](safekituserguideen_fichiers/image239.jpg)ALONE (Ready)-  ![](safekituserguideen_fichiers/image214.jpg)STOP/![](safekituserguideen_fichiers/image265.jpg)WAIT (NotReady).
There is no recovery server, but the application is running on the ALONE
server.

A degraded farm module is in the state ![](safekituserguideen_fichiers/image211.jpg)UP (Ready)on at least one server of the
farm, the other servers being in the state ![](safekituserguideen_fichiers/image266.jpg)STOP/![](safekituserguideen_fichiers/image267.jpg)WAIT (NotReady). The
application is running on the UP server.

In the
degraded case, there is no emergency procedure to implement. Analysis of the
state ![](safekituserguideen_fichiers/image268.jpg)STOP/![](safekituserguideen_fichiers/image265.jpg)WAIT (NotReady)can be
done later. However, you can attempt to restart the module in a stable state:

![*](safekituserguideen_fichiers/image001.png)       See section 7.9 “Module ![](safekituserguideen_fichiers/image213.png) STOP (NotReady):
start the module”

![*](safekituserguideen_fichiers/image001.png)       See section 7.10 “Module ![](safekituserguideen_fichiers/image205.png)WAIT (NotReady):
repair the resource="down"”

## 7.8             Out of service module /(NotReady) and /(NotReady)

An out of service mirror or farm module is
in the state ![](safekituserguideen_fichiers/image221.jpg)STOP/![](safekituserguideen_fichiers/image269.jpg)WAIT (NotReady)on all
servers. In this case, the application is not operational on any server
anymore. You must restore the situation and restart the module in ![](safekituserguideen_fichiers/image257.png) (Ready)on at least
one server:

![*](safekituserguideen_fichiers/image001.png)       See section 7.9 “Module ![](safekituserguideen_fichiers/image213.png) STOP (NotReady):
start the module”

![*](safekituserguideen_fichiers/image001.png)       See section 7.10 “Module ![](safekituserguideen_fichiers/image205.png)WAIT (NotReady):
repair the resource="down"”

## 7.9             Module  STOP (NotReady): start the module

1.    Start the stopped module (replace below *AM* by the module
name) with:

·        
the web console via ![](safekituserguideen_fichiers/image270.png)Monitoring/![](safekituserguideen_fichiers/image271.png)on the node/![](safekituserguideen_fichiers/image272.png)Start/

·        
the command safekit start -m *AM* executed on the node

2.    Check that the module becomes ![](safekituserguideen_fichiers/image257.png) (Ready).

3.    Analyze results of start in the module and
scripts logs (replace below node1 by the node name and *AM* by
the module name) with:

·        
the web console at URI /console/en/monitoring/modules/*AM*/nodes/node1/logs

·        
the command safekit logview -m *AM* on node1, for the module log

·        
the files SAFEVAR/modules/*AM*/userlog\_<year>\_<month>\_<day>T<time>\_<script
name>.ulog on node1, for the scripts log

## 7.10          Module WAIT (NotReady): repair the resource="down"

|  |  |
| --- | --- |
| If the module is in the state WAIT (NotReady), it waits for the state of a resource to become up.  You must identify and fix the problem that caused the resource state to go down.  To determine the resource involved, analyze the module log and resources (see section 7.4).  **Notes:**  A wait checker is started after the prestart script and stopped before poststop.  The checker is active on all servers    ALONE/PRIM/SECOND/UP (Ready).  The action of the checker upon detecting an error is to set a resource to down.  A failover rule referencing the resource performs the wait action.  The module is locally in state   WAIT (NotReady)while the resource stays down.  The module exits the   WAIT(NotReady) state as soon as the checker sets the resource back to up. | Messages from wait checkers:  ·     files not up-to-date locally: see section 5  "Data may be not uptodate for replicated directories (wait for the start of the remote server)"  "Action wait from failover rule notuptodate\_server"  "If you are sure that this server has valid data, run safekit prim to force start as primary"  ·     <interface check="on"> checker of a local network interface  "Resource intf.ip.0 set to down by intfcheck"  "Action wait from failover rule interface\_failure"  ·     <ping> checker of an external IP  "Resource ping.id set to down by pingcheck"  "Action wait from failover rule p\_id"  ·     <module> checker of another module  "Resource module.othermodule\_ip set to down by modulecheck"  "Action wait from failover rule module\_failure"  ·     <tcp ident="id" when="pre"> checker of an external TCP service  "Resource tcp.id set to down by tcpcheck"  "Action wait from failover rule t\_id"  ·     <custom ident="id" when="pre"> customized checker  "Resource custom.id set to down by customscript"   "Action wait from failover rule customid\_failure"   <splitbrain> checker  “Resource splitbrain.uptodate set to down by splitbraincheck"  …  "Action wait from failover rule splitbrain\_failure"  ·          Files not up-to-date locally due to split-brain: see section 13.18 |

## 7.11          Module oscillating from  (Ready) to  (Transient)

|  |  |
| --- | --- |
| If a module oscillates from state  (Ready)to state  (Transient), it is probably a victim of a restart or stopstart checker which detects a constant error.  By default, after the 4th unsuccessful restart on a server, the module stops, and the server stabilizes in STOP (NotReady).  Use the module log to determine which checker is the source of the logs (to read logs, see section 7.4).  **Notes:**  A restart or stopstart checker is defined in userconfig.xml by:  ·         when="prim" for a mirror module  The checker is started on the node   PRIM/ALONE (Ready)after script start\_prim (stopped before stop\_prim). It checks the application started in start\_prim.  ·         when="both" for a farm module  The checker is started on all nodes UP (Ready)after script start\_both (stopped before stop\_both). It checks the application started in start\_both.  The action of a checker on an error is to restart or stopstart the module. stopstart on PRIM (Ready)leads to a failover of the primary on the other node.  The module is in the state PRIM/UP (Transient)during the application restart.  After several oscillations, the module stops with "Action stop called by maxloop" in the module log: see section 7.12. | Messages from restart or stopstart checkers:  ·         <errd> in userconfig.xml  checker of processes  "Process appli.exe not running"   "Action restart|stopstart called by errd"   ·         <tcp ident="id" when="prim"|"both"> in userconfig.xml  TCP checker of the application  "Resource tcp.id set to down by tcpcheck"   "Action restart|stopstart from failover rule t\_id"  ·         <custom ident="id" when="prim"|"both"> in userconfig.xml  custom checker  "Resource custom.id set to down by customscript"   "Action restart|stopstart from failover rule c\_id"   or  "Action restart|stopstart called by customscript" |

## 7.12          Message on stop after maxloop

|  |  |
| --- | --- |
| If an error detected by a checker repeats itself several times and successively, the module is stopped on the server in  STOP(NotReady): because the error is permanent, and the action of the checker cannot correct it  If in userconfig.xml, there is no parameter maxloop / loop\_interval in <service>, by default, maxloop="3" loop\_interval="24"  if the checkers generate more than 3 unsuccessful restarts (restart, stopstart, wait) in less than 24H, then stop of module: STOP(NotReady).  The counter is reset to 0 if an administrator executes an action on the module such as safekit start -m *AM* (replace *AM* by the module name) or safekit stop -m *AM* (without the option -i <identity>) | Message on stop after maxloop  "Action stop called by maxloop" |

## 7.13          Module  (Ready) but non-operational application

If a server has a status of ![](safekituserguideen_fichiers/image181.jpg)PRIM(Ready)or ![](safekituserguideen_fichiers/image181.jpg)ALONE(Ready)or ![](safekituserguideen_fichiers/image183.jpg)UP(Ready), the application can be
non-operational because of undetected errors on start-up. In the following,
replace node1 by the node name and AM by the module name.

1.    Check the output messages of application scripts coming from start\_prim/start\_both and stop\_prim/stop\_both. They are visible in (replace below node1
by the node name and *AM* by the module name) with:

·        
the web console at URI /console/en/monitoring/modules/*AM*/nodes/node1/logs

·        
the files SAFEVAR/modules/*AM*/userlog\_<year>\_<month>\_<day>T<time>\_<script
name>.ulog, on node1, for the scripts log

Check if there
are errors during start or stop of the application. Be careful, sometimes the
userlog is disabled because it is too large with <user logging="none"> in userconfig.xml of the module.

2.    Check application scripts start\_prim(/both) and stop\_prim(/both) of a mirror(/farm) and userconfig.xml with:

·        
the web console at URI /console/en/configuration/modules/*AM*/config

·        
under the directory SAFE/modules/*AM* on the node1

3.    Execute a restart of the ![](safekituserguideen_fichiers/image211.jpg)PRIM/ALONE/UP(Ready)node to stop and restart locally the application (without failover)
with:

·        
the web console via ![](safekituserguideen_fichiers/image270.png)Monitoring/![](safekituserguideen_fichiers/image271.png)on
the node/Restart/

·        
the command safekit restart -m *AM* executed on the node (replace *AM* by the module
name)

4.    If the application is still non-operational, apply a stop ![](safekituserguideen_fichiers/image184.jpg)PRIM/ ALONE / UP(Ready)node to
stop and the application (stopstart makes a failover if the other node is Ready)
with:

·        
the web console via ![](safekituserguideen_fichiers/image270.png)Monitoring/![](safekituserguideen_fichiers/image271.png)on
the node/![](safekituserguideen_fichiers/image279.png)Stop/

·        
the command safekit stop -m *AM* executed on the node

## 7.14          Mirror module ALONE (Ready) - WAIT/STOP (NotReady)

If a mirror module stays in state ![](safekituserguideen_fichiers/image211.jpg)ALONE(Ready)- ![](safekituserguideen_fichiers/image280.jpg)WAIT(NotReady), check
the resource state.remote on each node (to read resources, see section 7.4). If this state is UNKNOWN on
the two nodes, there is probably a communication problem between the nodes.
This problem may also lead to  ![](safekituserguideen_fichiers/image281.jpg)ALONE(Ready)-![](safekituserguideen_fichiers/image214.jpg)STOP (NotReady).

Possible root causes are:

1.    Real network problem

Check your
network configurations on the two nodes.

2.    Firewall rules on one or the two nodes

For details, see
section
10.3

3.    Not the same SafeKit cluster configuration or cluster cryptographic
keys

To communicate,
cluster nodes must belong to the same cluster and have the same configuration
(see section 12):

·        
The web console warns if nodes in the cluster
nodes list have not an identical configuration

·        
The command:  safekit cluster confinfo on any nodes of the cluster must report an identical configuration
signature for all nodes of the cluster (see section 9.2)

If the cluster
configuration is not identical, re-apply the cluster configuration on all
cluster nodes as described in section 3.2.2.

4.    Not the same module cryptographic keys

When
cryptographic has been enabled for the module, the resource  usersetting.encryption is “on” (to check the state of resources, see section 7.4). If the nodes do not have the same keys
for the module, the nodes will not be able to communicate for the internal
module communications.

To
distribute the same module cryptographic keys, re-apply the module
configuration on all nodes.

See section 10.7 for details.

5.    Expired cryptographic keys

In
SafeKit <= 7.4.0.31, the key for encrypting the module communication has a
validity period of 1 year. When it expires in a mirror module with file
replication, the secondary fails to reintegrate and the module stops with an
error message into the log:

reintegre | D | XXX
clnttcp\_create: socket=7 TLS handshake failed

In SafeKit >
7.4.0.31, the message is:

reintegre | D | XXX
clnttcp\_create: socket=7 TLS handshake failed. Check server time and module
certificate (expiration date, hash)

To solve this
problem, see section 10.7.3.1

## 7.15          Farm module UP(Ready)but problem of load balancing in a farm

Even though all servers in the
farm are ![](safekituserguideen_fichiers/image184.jpg)UP(Ready), load balancing is not working.

### 7.15.1      Reported network load share are not coherent

In a farm module, the sum of the network
load share of all ![](safekituserguideen_fichiers/image239.jpg)UP(Ready), module nodes must be equal to
100%.

If it’s not the case, there is probably a
communication problem between module nodes. Possible root causes are the same
as for a mirror module. See section 7.13 for possible solutions.

See also section 4.3.6.

### 7.15.2      virtual IP address does not respond properly

If the virtual IP does not respond properly
to all requests for connections:

1.    choose a node in the farm that receives and processes connections on
the virtual IP address (established TCP connections):

·        
in Windows, use the command netstat -an | findstr <virtual
IP address>

·        
in Linux, use the command netstat -an | grep <virtual IP
address>

2.    stop the farm module on all nodes except the one that receives
connections and that remains ![](safekituserguideen_fichiers/image211.jpg)UP(Ready) with:

·        
the web console via ![](safekituserguideen_fichiers/image282.png)Monitoring/![](safekituserguideen_fichiers/image271.png)on
the node/![](safekituserguideen_fichiers/image283.png)Stop/

·        
the command safekit stop -m *AM* (replace *AM* by the module name)

3.    check that all connections to the virtual IP address are handled by
the single server ![](safekituserguideen_fichiers/image284.jpg) UP (Ready)

 

For a more detailed analysis on this topic,
see next section and:

![*](safekituserguideen_fichiers/image001.png)      
section 4.3.4 “Test virtual IP address of a farm module”

![*](safekituserguideen_fichiers/image001.png)      
section 4.3.5 “Test TCP load balancing on a virtual IP address”

![*](safekituserguideen_fichiers/image001.png)      
section 4.3.7 “Test compatibility of the network with invisible MAC address”

## 7.16          Problem with the virtual IP after failover

Sometimes, external devices function
correctly when the primary server is node1, but they do not work properly after
failover on the other node, node2.

It may be a problem with the configuration
of external devices. Two types of TCP connections must be considered at the
level of external devices:

·        
Outgoing TCP connections issued by the external
devices to the SafeKit cluster.

·        
Incoming TCP connections issued by the SafeKit
cluster to the external device.

The outgoing TCP connections, issued by the
external devices to the SafeKit cluster, must be configured with the virtual IP
address and not the physical IP address of node1. Otherwise, they will remain
stuck to node1 in case of a failover to node2. Note that on the node side, the
application must listen on the virtual IP address to accept connections from
external devices. You can check the listening TCP sockets using the netstat
command. Generally, a listening TCP socket is bound to all IP addresses (0.0.0.0),
and in this case, there is no problem.

For incoming TCP connections, if the
application initiates a TCP connection on node1 to an external device, this
connection will start with the physical IP address of node1 as the source IP
address. After a failover to node2, the connection will start with the physical
IP address of node2. This is because the virtual IP address is set as an alias
on the network interface of the primary node, and the primary IP address of the
network interface remains the physical IP address of the node.

Therefore, if the external devices perform
a check on their incoming connections, it is necessary to configure them to
accept connections from the two physical IP addresses of the two nodes in the
cluster.

Now, if the external devices can only be
configured with a single IP address, then you need to reconfigure in the userconfig.xml:

<virtual\_addr where="one\_side\_alias">

to

<virtual\_addr where="one\_side">

This way, the primary IP address will be
the virtual IP address, and the connections from the primary node to the
external devices will use the virtual IP address as the source IP address.
Consequently, the external devices must be configured to accept incoming
connections on a single IP: the virtual IP address.

The same issue may occur if external
devices communicate with the cluster using UDP. In this case, it may be
preferable to configure one\_side.

The final possibility is that an external
device only accepts communications to a unique Ethernet MAC address of a
server. In this very specific and rare case, you need to configure the virtual
IP address with a 'vmac\_invisible' MAC address. For example, it can start with '5A:FE':

<virtual\_interface
type="vmac\_invisible" addr="5A:FE:01:02:03:04">

When configured as 'vmac\_invisible', a
virtual MAC address is associated with a virtual IP address, but this MAC
address is never visible in Ethernet headers. This configuration allows packets
directed to the virtual IP address to be received by all servers within the
system without revealing the virtual MAC address to switches, which would
typically be able to locate it. Since switches cannot detect this address, they
broadcast packets intended for it across all ports in the local area network
(LAN). All nodes receive these packets, particularly both nodes of the cluster.
Therefore, the primary server can be on node 1 or node 2. vmac\_invisible
requires promiscuous mode on the physical Ethernet cards of both nodes.
Additionally, it necessitates the 'vip' kernel module, which must be compiled
on Linux.

Preferably, configure a virtual IP address
with 'one\_side\_alias', and only use 'one\_side' or 'vmac\_invisible' if necessary.

Note that none of these issues arise with a
complete virtual machine replication and restart solution. In this case, there
is no virtual IP address involved. The VM is relaunched on the secondary node
with the same primary IP address and the same MAC address. To avoid the
aforementioned issues, you can use SafeKit solutions for Hyper-V or KVM.

## 7.17          Problem after Boot

If you encounter a problem after boot, see section 4.1.

Note that by default, modules are not
automatically started at boot. For this, you must setup the boot start into the
module’s configuration with:

·        
the web console at /console/en/configuration/modules/*AM*/config

·        
in file SAFE/modules/*AM*/conf/userconfig.xml on the node1, with the boot attribute of the service
tag (see section 13.3.3)

Then apply the new configuration on all
nodes.

## 7.18          Analysis from snapshots of the module

When the problem is not easily
identifiable, it is recommended to take a snapshot of the module on all nodes
as described in section 3.5. A snapshot is
a zip file that collects, for one module, the configuration files, dumps… Its
content allows an offline and in-depth analysis of the module and node status.

|  |  |
| --- | --- |
| Commentaire, ajouter contour | The structure and content of the snapshot vary depending on the version of SafeKit. |

Since SafeKit 8.1, the structure of the
snapshot is as follows:

|  |  |
| --- | --- |
| snapshot\_centos7\_test3\_mirror/ | Directory snapshot\_nodename\_*AM*  Snapshot for the module *AM* got from the node named nodename |
| mirror/ | Directory *AM*  Application module name |
| config\_2021\_05\_05\_14\_15\_42/      config\_2021\_07\_08\_10\_05\_02/      config\_2021\_08\_18\_16\_15\_25/ | Directoriesconfig\_year\_month\_day\_hour\_mn\_sec  Last 3 configurations for the module, including the current one |
| dump\_2021\_05\_15\_10\_15\_40/      dump\_2021\_07\_20\_11\_05\_35/     dump\_2021\_08\_28\_08\_11\_45/ | Directoriesdump\_year\_month\_day\_hour\_mn\_sec  Last 3 dumps for the module, including the last one |
| tmp/ | Directoryfor the level 3 support |

 

Since SafeKit 8.2.6, SafeKit provides the anontool
utility, which allows generating a filtered and anonymized snapshot from a full
snapshot. The filtered snapshot contains the minimal subset required to analyze
an issue:

·        
the cluster and module configuration (the most
recent config\_year\_month\_day\_hour\_mn\_sec directory),

·        
the module log, the module script logs, and the
SafeKit command log from the most recent dump\_year\_month\_day\_hour\_mn\_sec directory.

Each of these files is anonymized. See section 9.6.2.2 for more
details.

As the anonymized snapshot is restricted,
it only allows partial analysis.

### 7.18.1      Module configuration files

The module configuration files are saved as
follows:

|  |  |
| --- | --- |
| config\_2021\_08\_18\_16\_15\_25/ | Directory for the module's configuration files |
| module/           bin/        conf/        web/       private/ | Directory module  It contains the user configuration files  ·         bin directory  scripts start\_xx, stop\_xx, …  ·         conf directory  XML configuration userconfig.xml |

 

Check the user configuration file and
scripts for troubleshooting with the application integration into SafeKit.

### 7.18.2      Module dump files

The dump contains the state of the module
and the SafeKit node as it was at the time of the dump.

|  |  |
| --- | --- |
| dump\_2021\_08\_28\_08\_11\_45/ | Directory for the module's dump files |
| csv/       licenses/       notifications/       userlog/       var/       web/ | ·         csv directory  Logs and status in csv format  ·         licences directory  SafeKit licenses get from SAFE/conf directory  ·         notifications directory  Email notification agent configuration gets from SAFE/web/notifications directory  ·         userlog directory  Module scripts logs   ·         var directory  Extract of the SAFEVAR directory  ·         web directory  Web server configuration gets from SAFE/web/conf directory |
| log.txt    logverbose.txt | Module log files (not verbose and verbose) |
| healthcheck.txt | Since SafeKit 8.2.6, a diagnostic report that provides a global verification of a SafeKit environment on a node (see section 7.3). |
| heartplug | *Information file*  *Various information about the node (list and status of installed modules, OS version, disk, and network configuration…)* |
| last.txt    systemevt.txt  Or    applicationevt.txt    systemevt.txt | System logs  ·         last.txt and systemevt.txt in Linux  Or  ·         applicationevt.txt and systemevt.txt in Windows |
| commandlog.txt | Commands log for the node |
| heart    heart.trc    nfsbox    nfsbox.trc | Trace files for level 3 support |

·        
Check the license file(s) into licenses
directory for troubleshooting with the SafeKit license check

·        
Check the Apache configuration files into web directory
for troubleshooting with the SafeKit web service

·        
Check the module logs, in log.txt
and logverbose.txt, for troubleshooting with the module behavior

·        
Check the module scripts logs userlog/userlog\_<year>\_<month>\_<day>T<time>\_<script
name>.ulog for troubleshooting with application
start/stop

·        
Since SafeKit 8.2.6, check for errors in healthcheck.txt file

·        
If necessary, look at heartplug file for
some information on the node and search the system logs for events that
occurred at the same time as the problem being analyzed

·        
Check the commands log commandlog.txt for
troubleshooting with cluster management or distributed commands

#### 7.18.2.1  var directory

The var directory is
mainly for the level 3 support. It is a copy of some part of the SAFEVAR
directory. In the
var/cluster directory:

·        
look at the cluster.xml file for
checking the cluster configuration

·        
look at the cluster\_ip.xml file
for checking the DNS name resolution of names into the cluster configuration

#### 7.18.2.2  csv directory

The logs and reports are also exported into
csv format in the csv directory:

|  |  |
| --- | --- |
| csv/ | csv directory |
| logverbose.csv          resource.csv    resourcelog.csv | Logs and status of the module  ·         Verbose log        ·         Resources status and history |
| commandlog.csv    modules.csv    moduleslog.csv   clusterstate.csv | Logs and status of the node  ·         Commands log  ·         List of installed modules  ·         For the level 3 support |

 

Import the csv files into an Excel sheet to
facilitate their analysis. To import a file:

1.    Create a new sheet

2.    From the Data tab, import From Text/CSV

![](safekituserguideen_fichiers/image286.png)

3.    In the dialog box, locate and double-click the csv file to import,
then click Import

4.    Then click on Load

![](safekituserguideen_fichiers/image287.jpg)

You can use the Excel features to filter
rows according to the level of the messages, ... and load in different sheets
the csv of each node.

|  |  |
| --- | --- |
| Commentaire, ajouter contour | For the exact date, format cells with Number/Custom jj/mm/aaaa hh:mm:ss,000. |

