---
canonical: https://safekit.evidian.com/wp-content/uploads/downloads_safekit/version-82/slides-en/9-troubleshooting-en.pptx
---

# PowerPoint Converted to Markdown

Source: https://safekit.evidian.com/wp-content/uploads/downloads_safekit/version-82/slides-en/9-troubleshooting-en.pptx


## Slide 1: SafeKit Troubleshooting

_No extractable slide text found._


### Speaker notes

> These slides are timed and automatically move from one to the next after a delay. To remove this automation: Go to 'Slide Show' and uncheck 'Use Timings’.
> The slides have a soundtrack represented by an audio icon on the right side of each slide. To remove the soundtrack, click on each audio icon and lower the volume to the minimum.
> I am going to present how to troubleshoot SafeKit by analyzing logs and their messages, so you can resolve issues on your own. This process will help you identify the root causes of common issues and provide step-by-step solutions.


## Slide 2: SafeKit Troubleshooting

_No extractable slide text found._


### Speaker notes

> Here is an outline of the different topics covered for SafeKit troubleshooting:
> In 1, we encourage you to analyze the logs yourself to identify and resolve potential issues.
> In 2, we explain how to run an application without SafeKit.
> In 3, we describe how to take module snapshots at a customer site and how to diagnose and resolve issues quickly from snapshots.
> In 4, we cover issues that can happen with the web console.
> In 5, we do the same with command lines.
> In 6, we address how to solve mirror module issues.
> In 7, we focus on solving farm module issues.
> In 8, we explain how to solve checkers issues.


## Slide 3

- Analyze yourself the logs


### Speaker notes

> Let's start with analyzing the logs on your own.


## Slide 4: Analyze yourself the module log

- Log analysis in      or with command lines
- C(ritical) messages (error detection…)
- E(vent) messages (local and remote state…)
- U(ser) messages when the user runs action on the module 
(ex.: Action stop called by username)
- S(cript) messages when module scripts are executed
- Select the message type to view


### Speaker notes

> You can make the log analysis either with the monitoring feature of the console or with command lines. In the console, first select the module and the node on the left side. Then the log of the module on the node is displayed on the right side. The log is displayed in real-time with three buttons: resume, suspend, and download.
> By downloading the log to your workstation, you can use your favorite editor to analyze the messages. As shown in the second text box, the console provides an option to select messages according to their types. Here are the different message types you can select, represented by the four icons in the console:
> The first icon represents critical messages associated with error detection.
> The second icon represents event messages related to changes in local or remote states.
> The third icon represents user messages when the user performs an action on the module.
> The fourth icon represents script messages when module scripts are executed.
> These four types of messages are saved in a log named the non verbose log.
> As shown in the third text box, you can also filter messages based on content or a date range.
> Finally, you will find in the root directory of SafeKit, two command lines to help with debugging: safekit logview and safekit logsave. The  dash A option allows to have the verbose log.


## Slide 5: Analyze with the module states timeline

- State on all nodes over time, starting by the newest date (since SafeKit 8.2.2)
- Click to open/close the timeline
- Click to display the module log for the node starting at this date


### Speaker notes

> Let's explore the module states timeline feature.
> To access this feature, click on the clock icon located in the window's header. The timeline will appear as shown in the figure on the left. This timeline displays the module states on all nodes in reverse chronological order, starting with the most recent date. This visual representation helps administrators quickly understand the current status and historical changes of the module.
> The timeline shown is the one available at the time of loading. To refresh the timeline with the latest state changes, simply click on the refresh icon. Additionally, you can click on an event in the timeline. As shown in the figure on the right, the module log associated with the event will be displayed.
> It's important to note that the clocks of the two nodes must be synchronized for the state changes to be accurately mapped. This ensures that the timeline provides meaningful and precise information about the module states.


## Slide 6: Module log messages (1/2)

- For a full list of main messages, see “Log Messages Index” in the User’s Guide
- For custom, TCP, and ping checkers, the rule name  syntax is <checker initial>_<ident attribute value>


### Speaker notes

> Let's consider now some main module log messages. For a full list of messages, please refer to the "Log Messages Index" in the User’s Guide.
> In the module log, you will encounter various types of messages emitted upon error detection by checkers, along with the actions triggered to solve the error. For instance, the process monitoring message issued by err d indicates that the process my sql d.exe was not running. Consequently, an action restart was initiated by err d to restart the application. Secondly, the `ip` message shows that the virtual IP checker has set its resource to down, meaning that either there was a duplicate virtual IP on the network or the virtual IP has been removed. The action stopstart has been triggered from the failover rule `ip_failure`, to reconfigure the virtual IP. Thirdly, the resource `custom.http` has been set to down by a custom checker, and an action restart has been initiated from the failover rule `c_ http` to restart the application. Fourthly, the `tcp` checker has set its resource to down because it has detected an impossibility to make a TCP connection to the application. The action restart has been triggered from the failover rule `t_ Web_80` to restart the application. Note that the failover rule names of TCP, ping, and custom checkers are prefixed by 't_', 'p_', and 'c_'.


## Slide 7: Module log messages (2/2)

_No extractable slide text found._


### Speaker notes

> Let's consider other main module log messages coming from checkers.
> Firstly, the network interface checker message indicates that the interface was detected down. The action wait has been triggered from the failover rule interface_failure, waiting for the repair of the network interface.
> Secondly, the ping message shows that the ping checker has set its resource to down because its ping to an external component is not answering. An action wait has been triggered from the failover rule p_router, waiting for the external component to answer again to ping.
> Thirdly, the resource custom.sql has been set to down by a custom checker, and an action stopstart has been initiated from the failover rule c_sql, to restart the application on the other node if it was running the module.
> Fourthly, the module checker message means that the external SQL Server module was down and the action wait was initiated from the failover rule module_failure, waiting for the restart of the external SQL Server module.
> Fifthly, the splitbrain message shows that all heartbeats have been lost and the splitbrain checker was not able to reach the witnesses. Thus, the action wait has been triggered from the failover rule splitbrain_failure, to wait for the recovery of heartbeats.


## Slide 8: Best practices to analyze the module log

- Identify the root cause of a problem
- When there is a loop of errors
- When the problem is global
- Read first the non verbose log
  - See “Log Messages Index” of the SafeKit User’s guide”
- Go to the date when the problem occurred
- Go up in the log to identify the cause
- Go up in the log to find a stable state
  - “Local state ALONE Ready”, “Local state PRIM Ready”, “Local state SECOND Ready”, “Local state UP Ready”
- Then, go down in the log to the initial error
- Open the log of node1 and node2 in 2 different windows facing each other
- Match node1 and node2 messages by date
  - Note that the clocks on nodes may be desynchronized


### Speaker notes

> Let's consider now the best practices to analyze the module logs.
> Start by identifying the root cause of a problem. For that, firstly, read the non verbose log. Secondly, go to the date when the problem occurred. Thirdly, go up in the log to identify the cause of the problem.
> When there is a loop of errors, firstly, go up in the log to find a stable state, like "Local state ALONE Ready", "Local state PRIM Ready" or "Local state SECOND Ready" for a mirror module, or "Local state UP Ready" for a farm module. Secondly, go down in the log to find the initial error.
> When the problem is global, firstly download and open the log of node1 and node2 in two different windows, facing each other. Secondly, match node1 and node2 messages by date, to understand how the cluster has globally evolved. Be careful if the clocks on nodes were not synchronized.


## Slide 9: Analyze yourself the script log

- Log analysis in      or with command lines
- Click view the output
- of the script execution
- SAFEVAR/modules/AM/userlog_AAAA_MM_DDThhmmss_<script-name>.ulog


### Speaker notes

> After examining the module log, let's now look at the script log, which contains the output messages of the restart scripts. You can analyze the script log either with the console or on the node side.
> As shown in the figure, you can click on a restart script message, such as start_prim, to open the script log for this script on the right side. You will find the outputs of service startups and stops in the script logs, along with any potential error messages that can help with debugging.
> On the node side, the script logs are located in the var directory of SafeKit under the directory of the module. Each restart script execution is logged in an individual line with its date and time of execution.


## Slide 10: Best practices to analyze the script log

- Problem analysis
- Script
- Messages in logs
- Read messages from scripts in the script
log
- Search for messages from scripts in the module log
- Understand application errors
  - in the event log on Windows
  - in the system log messages on Linux
  - in specific application logs (not included into the snapshot)
- start_prim.cmd
  - @echo off
  - echo "Running start_prim %*"
  - net start "myservice"   /Y
  - if NOT %errorlevel% == 0 goto stop
  - :stop
  - "%SAFE%\safekit" printe "start_prim failed"
  - "%SAFE%\safekit" stop -i "start_prim"
- Script log
- ----------- 10/20 18:28:11 start_prim
- "Running start_prim WAIT ALONE"
- The myservice service failed to start
- Module log
- 10-20 18:28:11 … Script start_prim
- 10-20 18:28:12 … start_prim failed
- 10-20 18:28:12 … Action stop called by start_prim


### Speaker notes

> Let's talk about the best practices to analyze the script log.
> Let’s explain the problem analysis in the slide. Among the messages in the script log, like the messages of point 1 and point 2 on the right of the slide, you should search for error messages, typically the message 2 which indicate that the service failed to start. Scripts can log messages or execute commands in the module log, as shown in points 3 and 4 on the right side of the slide. Look for error messages logged by the script, such as message 3 indicating that the start prim script has failed. Also, search for actions initiated by the script, like message 4, which indicates that a stop action was initiated by the start prim script.
> Finally, it is necessary to understand application errors when the application has not been started or stopped correctly. On Windows, you can find these errors in the event log. On Linux, check  the system log messages. Additionally, specific application logs, which are not included in the SafeKit snapshot, can also provide valuable information.


## Slide 11

- Running the application without SafeKit


### Speaker notes

> Let's now explain how to run an application integrated inside SafeKit but without the SafeKit runtime.


## Slide 12: Running an application without SafeKit

- The need for this use case
- You want no SafeKit processes running, only your application
- You want to start/stop your application with the start_prim/stop_prim or start_both/stop_both scripts
- You want to set the virtual IP address as an alias on the local network interface, if <vip> is defined in userconfig.xml
- You want to pass environment variables defined in <user> of userconfig.xml
- Procedure for a module named AM
- Stop AM on all nodes
- Log in as an administrator/root
- Open a PowerShell/shell console
- To start the application, run the command SAFE/private/modules/AM/bin/AM_start_wrapper. {ps1,sh}
- To stop the application, run the command
  - SAFE/private/modules/AM/bin/AM_stop_wrapper. {ps1,sh}
- Note: wrappers are generated at each configuration of the module


### Speaker notes

> Let's first discuss the need for this use case. You want no SafeKit processes running, only your application. You want to start or stop your application using the start prim and stop prim scripts in a mirror module or the start both or stop both scripts in a farm module. You want to set the virtual IP address as an alias on the local network interface,  if a vip tag is defined in userconfig.xml. Lastly, you want to pass environment variables defined in the user tag of userconfig.xml to the start and stop scripts.
> Now  let's examine the procedure for a module named AM.
> Firstly, stop AM on all nodes.
> Secondly, log in as an administrator on Windows or root on Linux.
> Thirdly, open a PowerShell on Windows or a shell console on Linux.
> To start the application, go to the private/modules directory and run the start wrapper script of the module.
> To stop the application, run the stop wrapper script of the module.
> Please note that wrappers are generated at each configuration of the module.


## Slide 13

- Take module snapshots for support


### Speaker notes

> Let’s see how to take module snapshots for supporting a customer.


## Slide 14: How to take a module snapshot?

- With the web console
- In
- Open the module menu
- Select Debug/Download the snapshots
- With the safekit command
- On each node
- Log in as administrator/root
- Open a  PowerShell/shell console
- Change directory to the SafeKit installation root path
- Run with an absolute path
- safekit snapshot –m mirror /abs_path/snapshot_node1_mirror.zip
- Repeat the operation on the other node
- Take a snapshot of the module on each node for offline and in-depth analysis
- Take a dump in real-time when a replication problem arises
- safekit dump –m AM (dumps are collected in snapshots)
- Execute on both nodes when the problem arises to save real-time replication logs
- Saved in SAFEVAR/snapshots/modules/AM/dump_yyyy_mm_dd_hh_mm_ss


### Speaker notes

> For supporting a customer, you have to take a module snapshot on each node, either with the web console or with the safekit command. As shown in the figure on point 1, you can download both mirror module snapshots from node 1 and node 2 by clicking on the global menu of the mirror module and opening the debug submenu. This will download on your workstation two .zip files corresponding to the two module snapshots of the two nodes.
> To do the same with the safekit command, you will have to execute the following procedure twice on node 1 and on node 2:
> Firstly, log in as an administrator on Windows or root on Linux.
> Secondly, open a PowerShell on Windows or a shell on Linux.
> Thirdly, go to the root directory of SafeKit.
> Fourthly, run the safekit snapshot command, providing the module name and the absolute path of the .zip file. This file will be created and will contain the module snapshot of the node.
> Fifthly, repeat the same procedure on the other node.
> There is also a dump command useful to dump in real-time all SafeKit logs, which is helpful to capture a state in real-time when typically a replication problem occurs. The dump will be included in the snapshot operation.


## Slide 15: What is a module snapshot?

- What is a module snapshot?
- A snapshot is a zip file associated to a node that contains:
- the last 3 configurations
- the last 3 dumps of the module
- Internals of a module snapshot


### Speaker notes

> Let's now examine what a snapshot is and where you can find interesting information to analyze a problem. A snapshot is a zip file that contains the last three configurations of a module and the last three dumps of a module on a node. Inside the zip file, you will find directories associated with these configurations and these dumps.


## Slide 16: Configuration files of a module in a snapshot

- Problem analysis
- Configuration files
- Inside the snapshot
- Select the last configuration
- Verify userconfig.xml and scripts for troubleshooting with the application integration
- If necessary, compare the different configurations


### Speaker notes

> Let's examine the module configuration in a snapshot.  As illustrated in the right figure, begin by opening the directory containing the latest configuration,. Within this directory, you'll find the CONF subdirectory, which houses the userconfig.xml file of the module. Additionally, you’ll find the bin subdirectory, which contains the start prim and stop prim scripts for a mirror module, or the start both and stop both scripts for a farm module. By examining these files, you can gain insights into how the application has been configured within the module and troubleshoot the application integration. If needed, you can compare these files with previous configurations to identify any changes or discrepancies.


## Slide 17: Text files in a dump including logs and more

- Problem analysis
- Text files
- Inside the snapshot
- Select the dump with the date closest to the problem.
- Open the module log.txt (checker actions, module state transitions).
- Check the scripts logs for application start/stop issues (userlog subdirectory)
- Search for errors in the Windows or Linux event logs
- Check cluster.xml for network communication issues between nodes
- Check heartplug.txt for a detailed network configuration of the node
- Check heartplug.txt for the OS and SafeKit versions, license, list of installed modules…
- log.txt
- logvervose.txt
- applicationevtx.txt systemevtx.txt
- cluster.xml
- heartplug.txt
- commandlog.txt


### Speaker notes

> Let’s now consider how to analyze a problem in a module dump. As shown in the figure on the right, first open the dump with the date closest to the problem. In this dump, open the module log.txt file to review checker actions and module state transitions. Then, check the script logs in the userlog subdirectory for application start or stop issues. Also, use the Windows or Linux event logs to understand issues on application errors. If you have any network communication issues between nodes, examine the cluster.xml file to see how the networks are configured. Check also the detailed network configuration of the node in the heartplug.txt file, which contains network interfaces with their physical IP address, netmask, and virtual IP address if configured on the node. You will also find in the heartplug.txt file the OS and SafeKit versions, the license, the list of installed modules, the list of services installed on the node, and much more. Finally, the commandlog.txt file provides the list of SafeKit commands passed on the node.


## Slide 18: Csv files in a dump for analysis with Excel

- Analysis with Excel
- csv files
- Smart extraction
- Import the csv files into an Excel sheet
- Create a new sheet
- From the Data tab, import From Text/CSV
- In the dialog box, locate and double-click the csv file to import and click Import, then Load
- For a more precise date, format cells with Number/Custom: jj/mm/aaaa hh:mm:ss,000
- Use the Excel features to filter rows
- logverbose.csv
- resource.csv
- resourcelog.csv
- clusterstate.csv
- commandlog.csv


### Speaker notes

> To facilitate searching and sorting through messages, the dump also provides CSV files that can be imported into Excel. For instance, with the logverbose.csv file, once it's in an Excel table, you can easily apply filters to view only specific types of messages. This makes it much simpler to focus on the information you need.
> This is a way to present a detailed report to the client about an issue that occurred in the cluster, leveraging the full power of Excel. Additionally, you can combine the logverbose.csv files from both nodes into a single Excel file and sort by date. This allows you to see how the cluster has evolved overall, providing a comprehensive view of the global cluster's behavior over time.
> The procedure on the left explains how to import a CSV file into an Excel file. Just be careful with the date formatting to ensure millisecond precision. The primary file to consider for importing into Excel is the verbose log of the module. However, you also have the files for module resources status and their history, as well as the SafeKit cluster state and the commands log of the node.


## Slide 19: Other files in a dump to check license or web configuration

- Problem analysis
- Additional files
- Inside the snapshot
- Check the license file for SafeKit license issues
- Check the Apache configuration for web service issues


### Speaker notes

> As shown in the figure on the right, there are three additional subdirectories in the module dump. The licenses subdirectory contains the SafeKit license key files This can be useful for resolving issues related to the license key, such as the hostnames set inside the license key file are not the hostname of the nodes. The var subdirectory is a copy of the var directory of the node, including various runtime data. Lastly, the web directory contains the Apache configuration for the SafeKit web service, detailing how the web service is set up and managed. This can be useful for resolving issues with the SafeKit web service.


## Slide 20

- Web console issues


### Speaker notes

> Let's now examine the issues that may arise with the web console.


## Slide 21: Web console load or connection errors

- Check the browser
- Check the SafeKit nodes
- Proxy and security settings
- Same release number for the web console and the SafeKit nodes
- Clear the browser’s cache with CTRL and SHIFT while tapping the DELETE key
- Firewall and SafeKit web service setup
- SafeKit cluster configuration
- safeadmin and safewebserver services are running
- For HTTPS, see "Connection issues with the HTTPS web console" in the SafeKit User’s Guide


### Speaker notes

> You may encounter web console load or connection errors. First, check the browser. Verify the proxy and security settings in the browser. Ensure that the web console and the SafeKit nodes are running the same release number. Clear the browser’s cache by pressing CTRL and SHIFT while tapping the DELETE key. Secondly, check the SafeKit nodes. Check the firewall and SafeKit web service setup. Verify the SafeKit cluster configuration. Ensure that the safeadmin and safewebserver services are running. For HTTPS, refer to the "Connection issues with the HTTPS web console" section in the SafeKit User’s Guide.


## Slide 22: Authentication troubleshooting with the web console

- Same password on all nodes is mandatory
- To reset the password, on each node :
- Log in as administrator/root
- Open a console (PowerShell, shell, ...)
- Run
- Authentication failure of admin user or connection errors on the other node


### Speaker notes

> To troubleshoot authentication issues with the web console, ensure that the same password for the admin user is used on all nodes. To reset the password on each node, log in as administrator on Windows or as root on Linux, open a console such as PowerShell or shell, and run the webserverCFG command with your password. By setting the same password on both nodes, this should resolve any authentication failures of the admin user or connection errors on the other node.


## Slide 23

- Command line issues


### Speaker notes

> Let's now examine the issues that may arise with the command lines.


## Slide 24: Unsuccessful global command safekit -H "..."

- Test the global command
- Reset password on all nodes
- Log-in as administrator/root
- Open a console (PowerShell, shell, …)
- Example with an issue on node1 safekit -H "*" level
  - ---------------- Server=http://10.0.0.107:9010 ----------------
  - curl: (22) The requested URL returned error: 401 Unauthorized
  - ---------------- Server=http://10.0.0.108:9010 ----------------
  - admin action=exec
  - ----------------------------- Versions -----------------------------
  - Serveur : node2
  - SE           : Microsoft Windows Server Standard [64-bit] (10.0.17763 ) Server
  - SafeKit   : 7.5.1.50
  - Licence  : PAS de licence : Démo 3 jours
  - Success
- Same password is mandatory on all cluster nodes
- To reset it, on each node:
- Log as administrator/root
- Open a console (PowerShell, shell, …)
- Run with pwd = your password
    - SAFE/private/bin/webservercfg -passwd pwd
    - reset password for the global command and the admin user of the web console
    - or
    - SAFE/private/bin/webservercfg -rcdmpasswd pwd
    - reset password only for the global command
- Authentication issue for the global command (or the web console)


### Speaker notes

> If you encounter an unsuccessful global command with safekit  dash H, it may be due to an authentication issue.
> In this case, test the global command by logging in as administrator on Windows or root on Linux and opening a console such as a PowerShell or a shell. Test the global command with 'safekit dash H star level'. If there is a curl error with the unauthorized message, it means that there is an authentication issue between nodes.
> This problem may be due to the fact that the password is not the same on all the nodes. To reset it, on each node, log in as administrator on Windows or root on Linux, open a console such as a PowerShell or a shell, and run the webserverCFG command to reset the admin password on both nodes. Alternatively, you can run webserverCFG to reset the password only of the global command.


## Slide 25

- Mirror module issues


### Speaker notes

> Let's now examine the issues that may arise with a mirror module.


## Slide 26: No nodes are uptodate

- When?
- Problem solving
- PRIM - SECOND
- At first installation
- After a double simultaneous power outage
- Resource rfs.uptodate is down on both nodes (safekit state -v –m AM)
- Select the node that you consider as up-to-date and start it as primary
- safekit prim –m AM
- Start the other node with full synchronization of data
- safekit second fullsync –m AM
- node1 (not uptodate)
- node2 (not uptodate)
- node1 (uptodate)
- node2 (not uptodate)
- node1 (uptodate)
- node2 (uptodate)


### Speaker notes

> An issue arises in a mirror module when no nodes are up to date. In this case, as shown at the bottom of the slide, the console shows two nodes in the STOP red state with the "not uptodate" message. If you try to start both nodes with the start command, they will both go into the WAIT red state, because they will detect that they are not up to date and cannot start without resynchronizing their local data.
> This situation arises at the first installation or after a double simultaneous power outage. In this case, the resource rfs.uptodate is down on both nodes, and you can check this value using the safekit state dash v command.
> To solve this problem, select the node that you consider to be up-to-date, let's say node 1 in our example, and start it as the primary node using the safekit prim command. Node 1 will then transition to the ALONE green state. Next, start the other node, node 2 in our example. Use the safekit second fullsync command if you want a full synchronization of its data. After the synchronization, the cluster will return to the PRIM-SECOND green state.


## Slide 27: Application not operational

- Application not operational
- Application
- Not started
- Not operational
- Inaccessible by clients
- Problem solving
- Verify scripts start_prim/stop_prim
- Search for errors in the script log 
SAFEVAR/modules/AM/ userlog_AAAA_MM_DDThhmmss_start_prim.ulog
- Search for errors in the application and event system logs
- Search for errors in the error logs specific to the application
- Try a local restart of the application with safekit restart –m AM
- If PRIM, try a failover to start the application on the other node with safekit stop –m AM
- Verify that the application's clients are configured to connect to the virtual IP
- Reboot the node
- node1 (uptodate)
- node1 (uptodate)
- or


### Speaker notes

> Let's consider a node that is in the PRIM green state or ALONE green state, which typically means the application is running. However, the application is not started, not operational, or inaccessible by clients.
> To address this issue, first verify the start prim and stop prim scripts and search for errors in the script log of the module located in the SafeKit var directory. Then, check for errors in the application and event system logs of Windows or Linux. Additionally, search for errors in the error logs specific to the application.
> You can try a local restart of the application using the safekit restart command. If the node is in the PRIM state, attempt a failover to start the application on the other node using the safekit stop command on the PRIM node. Ensure that the application's clients are configured to connect to the virtual IP. If all else fails, reboot the node.


## Slide 28: File synchronization failures

- 3 successive synchronization failures on node 2
- Problem solving
- Search for errors in the module log on both nodes
- safekit logview –A –m AM
- File access error on the secondary
- Check that the application is correctly stopped on the secondary and that its start at boot is manual
- Exclude replicated directories from antivirus scanning
- Timeout on request processing
- Check that the network and the disk are working properly
- Start the secondary during an off-peak period
- <service maxloop="3" loop_interval="24" > in userconfig.xml
- stopstart after failure
- stop after 3 failures
- node2 (not uptodate)
- node1 (uptodate)
- On synchronization failure, data are inconsistent on node2


### Speaker notes

> Let's now consider a file resynchronization issue with node 2. Node 2 attempts resynchronization of its files three times (the maxloop value) before stopping. Each time, the resynchronization fails, and ultimately, the data are not resynchronized on node 2. As a result, the files remain inconsistent on node 2 due to partial resynchronization.
> When faced with this problem, the first step is to search for errors in the module log on both nodes. Start with the module log of the secondary node. Use the safekit logview command with the dash A option, to get the verbose log, and retrieve all messages from the resynchronization process.
> If you encounter a file access error in the module log of the secondary node, ensure that no remaining process of the application locks the files, preventing the resynchronization process from succeeding.  Additionally, verify that the stop prim script stops all services of the application and that these services are set to start manually at boot.
> Also, exclude replicated directories from antivirus scanning to prevent interference between the antivirus checking files and the resynchronization process.
> In the case of a timeout on request processing, verify that both the network and the disk are functioning properly. It's also advisable to start the secondary during an off-peak period to avoid any potential issues.


## Slide 29: ALONE degraded

- Before
- ALONE degraded
- Problem solving
- PRIM-SECOND
- Functional replication
- Non-functional replication
- Application still running on ALONE
- Resource rfs.degraded is up 
safekit state -v –m AM
- Return to normal mode by forcing a stop-start of the primary
- safekit stop –m AM
- safekit prim –m AM
- node1 (degraded)
- node2 (not uptodate)


### Speaker notes

> Let's now consider the issue with the ALONE degraded state on node 1. Before the degradation, the state was PRIM-SECOND, indicating a functional replication.
> Then, due to an issue in the replication mechanisms of node 1, the PRIM node transitions to the ALONE green degraded state, while the other node transitions to the WAIT red state, waiting to resynchronize data from the ALONE node. The application is still running on the ALONE degraded node, but the replication mechanisms are non-functional, and it is not possible to resynchronize data from the ALONE degraded node.
> You can check the degraded status with the resource named rfs.degraded and displayed with the `safekit state` command with the `dash v` option.
> To solve this problem and restore the replication mechanisms on the ALONE degraded node, stop node 1 with the safekit stop command and then start it as primary with the safekit prim command. Then, node 2 in the WAIT red state will be able to resynchronize its data.


## Slide 30: Communication failure between both nodes

- Communication failure
- States of the cluster is not coherent
- safekit state –v –m AM
- Problem solving
- Verify on each node:
- safeadmin service
Same signature for the SafeKit cluster
    - safekit cluster confinfo
- Same signature and id for the module
    - safekit –H "*" cluster state
- Firewall setup
- Network connection and DNS resolver
- node1 (uptodate)
- node2 (not uptodate)


### Speaker notes

> Let’s now consider a strange and incoherent state of the cluster, where node 1 is in the ALONE green state, ready to accept a resynchronization from node 2. However, node 2 is in the WAIT red state, waiting for node 1 to accept its resynchronization. In this case, there is a communication issue between node 1 and node 2. Note that the states of each node can be displayed using the `safekit state` command.
> To solve this issue, verify the following on each node:
> Confirm that the `safeadmin` service is running properly.
> Ensure that the SafeKit cluster has the same signature on both nodes by using the `safekit cluster confinfo` command.
> Confirm that the module has the same signature and ID by using the `safekit cluster state` command with the `dash H star` option.
> Check the firewall setup to ensure there are no blocking rules.
> Finally, verify the network connection and DNS resolver to ensure they are functioning correctly.


## Slide 31

- Farm module issues


### Speaker notes

> Let's now examine the issues that may arise with a farm module.


## Slide 32: Virtual IP issues

- Virtual IP issues
- A mosaic test is delivered with SafeKit.
- Set the following load balancing rule in userconfig.xml:
    - <rule port="9010" proto="tcp" filter="on_port"/>
- On an external workstation, connect a browser to the URL: http://virtip:9010/safekit/mosaic.html
- Problem solving
- Check that you are on an external workstation and not on the nodes themselves
- Check that the network, on which the virtual IP is configured on both nodes, is in the same subnet for both servers (prerequisite).
  - On Windows: ipconfig /all
  - On Linux: ip addr show
- On each node, check the connections on the virtual IP
  - On Windows: netstat –an | findstr <virtual IP>
  - On Linux: netstat –an | grep <virtual IP>
- Stop/start the module on each node to check which one is taking connections


### Speaker notes

> The first thing you want to know with a farm module is whether the load balancing on the virtual IP address is functioning correctly. SafeKit provides a way to verify this with a mosaic through its web service.
> First, set a load balancing rule on TCP port 9010, which is the port of the SafeKit web service, and configure the filter with the "on_port" value. This configuration allows the TCP sessions of the same client to be load balanced across all nodes of the farm.
> Secondly, from an external workstation (be careful, not a local node of the cluster), connect a browser to the virtual IP using HTTP, port 9010, and the mosaic.html URL. In the form that follows, enter the name of the module to be tested.
> If the farm module is running on both nodes, meaning node 1 is in the UP green state displaying 50% and node 2 is also in the UP green state displaying 50%, you will see a colored mosaic. Each square in the mosaic represents a TCP connection response from either node 1 or node 2. However, if the load balancing is not working, you will only see responses from one node.
> To solve an issue with virtual IP load balancing, follow these steps:
> Firstly, ensure that you are on an external workstation and not on the nodes themselves. On a node, a browser will display only local TCP connections to that node.
> Secondly, verify that the network interface, on which the virtual IP is configured as an alias IP on both nodes, is in the same subnet for both nodes. This is a prerequisite for proper load balancing. To check the network configuration, use IP CONFIG /all on Windows and ip ADDR show on Linux.
> Thirdly, on each node, check if the node is managing connections on the virtual IP. To do this, use the netstat command with a filter on the virtual IP and check if there are any established connections or connections in the TIME WAIT state. This will help you determine if the node is actively handling network traffic on the virtual IP.
> Fourthly, stop and start the module on each node to determine which one is taking connections and which one is not.


## Slide 33: Communication failure between the nodes

- Communication failure
- Nodes do not see each other's state (safekit logview –A –m AM)
- The load is not balanced (100% / 100%)
- Internal protocol failure
- Problem solving
- Verify on each node:
- safeadmin service
Same signature for the SafeKit cluster
    - safekit cluster confinfo
- Same signature and id for the module
    - safekit –H "*" cluster state
- Firewall setup
- Network connection and DNS resolver
- node2
- node1
- 100%
- 100%


### Speaker notes

> Let’s now consider a strange and incoherent state of the cluster, where node 1 is in the UP green state managing 100% of the traffic and node 2 is in the UP green state managing also 100% of the traffic. In this case, there is a communication issue between node 1 and node 2.
> To solve this issue, verify the following on each node:
> Confirm that the `safeadmin` service is running properly.
> Ensure that the SafeKit cluster has the same signature on both nodes by using the `safekit cluster confinfo` command.
> Confirm that the module has the same signature and ID by using the `safekit cluster state` command with the `dash H star` option.
> Check the firewall setup to ensure there are no blocking rules.
> Finally, verify the network connection and DNS resolver to ensure they are functioning correctly.


## Slide 34: Application not operational

- Application not operational
- Application
- Not started
- Not operational
- Inaccessible by clients
- Problem solving
- Verify scripts start_both/stop_both
- Search for errors in the script log 
SAFEVAR/modules/AM/ userlog_AAAA_MM_DDThhmmss_start_prim.ulog
- Search for errors in the application and event system logs
- Search for errors in the error logs specific to the application
- Try a local restart of the application with safekit restart –m AM
- Verify that the application's clients are configured to connect to the virtual IP
- node1


### Speaker notes

> Let's consider a node that is in the UP green state, which typically means the application is running. However, the application is not started, not operational, or inaccessible by clients.
> To address this issue, first verify the start both and stop both scripts and search for errors in the script log of the module located in the SafeKit var directory. Then, check for errors in the application and event system logs of Windows or Linux. Additionally, search for errors in the error logs specific to the application.
> You can try a local restart of the application using the safekit restart command.
> Ensure that the application's clients are configured to connect to the virtual IP.


## Slide 35

- Checker issues


### Speaker notes

> Let's now examine the issues that may arise with checkers.


## Slide 36: Deactivate errd

- Deactivation without reconfiguration
- Deactivate with
- the console
- Select the node/”Disable/Enable”/”Processes/services monitoring”/Disable
- (Enable to reactivate)
- the command
    - safekit errd off –m AM
    - (on to reactivate)
- Resource:  usersetting.errd="off"
- Deactivation with reconfiguration
  - Edit the module configuration with
    - the console http://host:9010/console/en/configuration/modules/AM/config
    - text editor
    - SAFE/var/module/AM/conf/userconfig.xml
  - Set action="noaction“ in errd section or comment the section and save
    - <!– Start of comment  <errd>  …  </errd>  End of comment -->
  - Stop the module to apply the new configuration
- Application maintenance or errd bad behaviour


### Speaker notes

> For maintenance operations that require stopping the application monitored by SafeKit, it may be necessary to disable process monitoring to avoid an automatic restart triggered by err d. This might also be the case if err d triggers automatic restarts that are not desired by the user.
> Firstly, let’s examine how to deactivate err d without reconfiguring the module. To suspend or resume process and service monitoring, you can either use the console or the commands: safekit err d "off" and safekit err d "on". To check if err d monitoring is on or off, you can consult the status of the resource named usersetting.err d.
> Secondly, let’s examine how to deactivate err d with reconfiguration of the module. For this, you can edit userconfig.xml either with the console or directly on the node.  You can set action="noaction" in the err d section of userconfig.xml to deactivate the monitoring of a process or a service. This way, you will only receive a message in the module log, but no action will be triggered. Alternatively, you can comment out the err d section. Note that there should be no comments included within these comments.
> Then, you need to stop the module, distribute the userconfig.xml on all nodes, reconfigure the module on all nodes, and then restart the module.
> All these operations are simpler to perform using the console.


## Slide 37: Deactivate checkers

- Deactivation without reconfiguration
- When the module is started, deactivate with
- the console
- Select the node/”Disable/Enable”/”Checkers”/Disable
- (Enable to reactivate)
- the command
    - safekit checker off –m AM
    - (on to reactivate)
- Resource :  usersetting.checker="off"
- Deactivation with reconfiguration
  - Edit the module configuration with
    - the console http://host:9010/console/en/configuration/modules/AM/config
    - text editor
    - SAFE/var/module/AM/conf/userconfig.xml
  - Comment checker and failover sections and save
    - <!– Start of comment  <checker>  …  </checker>  End of comment -->
    - <!– Start of comment <failover> … </failover > End of comment -->
  - Stop the module to apply the new configuration
- Application maintenance or checkers bad behaviour


### Speaker notes

> In the same way, for maintenance operations that require stopping the application monitored by SafeKit, it may be necessary to disable checkers to avoid an automatic restart triggered by them. This might also be the case if the checkers trigger automatic restarts that are not desired by the user.
> Firstly, let’s examine how to deactivate checkers without reconfiguring the module. To suspend or resume checkers, you can either use the console or the commands: safekit checker "off" and safekit checker "on". To check if the checker monitoring is on or off, you can consult the status of the resource named usersetting.checker.
> Secondly, let’s examine how to deactivate checkers with reconfiguration of the module. For this, you can edit userconfig.xml either with the console or directly on the node. You can comment out the checker section and the failover section if it exists. Note that there should be no comments included within these comments. Then, you need to stop the module, distribute the userconfig.xml on all nodes, reconfigure the module on all nodes, and then restart the module.
> All these operations are simpler to perform using the console.


## Slide 38: Maxloop: stop after 3 error detections in 24 hours

- 3 error detections in 24 hours
- Protection mechanism against false error detection
- Problem solving
- Identify the checker causing the actions
- Check that the checker is working properly
- Verify that the functionality under test is operational when the checker is active
- Delay the activation of the checker (start_after for errd ; sleep at start of a custom checker)
- Increase the error detection timeout
- Remove or comment out the checker configuration from the userconfig.xml
- <service maxloop="3" loop_interval="24"> in userconfig.xml
- action restart or stopstart
- requested by a checker
- stop after 3 actions in 24h.
- Following messages in the module log
- Action stop called by maxloop


### Speaker notes

> A common behavior of SafeKit that is often misunderstood is a restart or stopstart loop of a module on a node followed by a stop. This loop corresponds to an error detection on a node, an attempt to trigger a restart or stopstart action that does not resolve the error, reproducing the same error detection and a new restart or stop-start attempt, and so on. The loop is bounded by the maxloop variable in userconfig.xml, which causes the module to stop on the node. In this case, you will see the message “Action stop called by maxloop” in the module log.
> To solve this issue, verify the following:
> First, identify the checker causing the actions in the module log. Next, check that the checker is working properly, especially if it is a custom checker. Verify that the functionality under test is operational when the checker is active. If needed, delay the activation of the checker by using the start_after parameter for err d or adding a sleep command at the start of a custom checker. It could indeed be a case of the checker starting too quickly while the application is not yet fully launched. Additionally, increase the error detection timeout to allow more time for the application to respond to the repetitive tests of the checker. Finally, if necessary, remove or comment out the checker configuration from the userconfig.xml file.


## Slide 39: Thank you !

_No extractable slide text found._


### Speaker notes

> Thank you for your attention. If you have any questions or need further clarification, please feel free to ask.
