Failover and Data Recovery in Search Service and Caches

This document outlines the failover and data recovery processes for both the Search Service and Cache servers, detailing behaviors during outages and recovery mechanisms.

Failover and Data Recovery in Search Service

In the case of a failure, your ingress replicators or search service nodes may become unreachable. This section describes the processes involved during such an outage.

Note: To help avoid non-recoverable disk failures, configure the ingress replicator journals and search service indexes to use durable storage. Allocate at least 20 GB for each ingress replicator journal and at least 50 GB for each search service index. Regularly monitor these storage volumes, ensuring at least 25% free capacity is maintained.

Ingress Replicator Node Fails

The ingress replicator journals all activities to disk, ensuring every event is delivered at least once. If the service fails or is stopped, it will resend any remaining journaled events upon restarting. If it cannot restart due to a non-recoverable disk failure, then a full rebuild of the ingress replicator is required.

If both ingress replicators fail (or if a single ingress replicator fails), new content will not be indexed during the outage. However, when the ingress replicator comes back online, the search service can catch up with the indexed content due to local caching on the web application nodes, meaning no data is lost.

For more information, refer to Rebuilding On-prem HA Search Service.

Search Service Node Fails

When search service 1 or 2 goes offline, the ingress replicator retains undelivered activities. Upon restoring the offline search service, these undelivered activities are sent to it. During the period when undelivered activities are being processed, the search indexes may be out of sync. Once all activities have been received, the indexes will sync accordingly.

If the search service cannot be restored due to a non-recoverable disk failure, the affected search service must be removed and re-added.

If a search service remains down for an extended period (e.g., several weeks), disk space may run out because the ingress replicator continues to store activities until the service is restored. If you do not plan to restore the offline service, remove it from all ingress replicator configuration files and restart the ingress replicators.

For further details, see Adding an On-Premise HA Search Server.

Failover and Data Recovery in Caches

When a cache server becomes unavailable, web application nodes will indefinitely attempt to communicate with it until it comes back online.

Cache Server Availability Behavior

A web application node determines a cache server is unavailable after 20 consecutive failed requests. The web application node will then wait for two seconds before retrying communication with the cache server. This process continues indefinitely until the cache server is restored. During this period, web application nodes will automatically redirect requests to the next available cache server.

If the cache server is unavailable at the startup of a web application node, that node will utilize local caching while continuing attempts to communicate with the external cache server.

Jive Cloud Context

In cloud deployments, cloud infrastructure automatically manages cache server resources to minimize service disruption. The automatic redirection to available cache servers is optimized for distributed environments, enhancing recovery times and overall performance. While local caching is used, cloud-based applications can implement additional caching strategies to improve performance during server unavailability.

Failover behavior of HA servers