Failover and Data Recovery in the Activity Engine

Understand how the Activity Engine handles activity ingress, egress, and failures.

Activity Ingress

In a clustered Jive configuration, when a Jive web application node sends an activity, it selects an appropriate Activity Engine node to which to send the activity. The web app node selects an Activity Engine node based on the following criteria:
User affinity
All Jive web apps in a cluster attempt to route activity from a particular user to the same Activity Engine node.
Sensitive ordering
User affinity is based on node ordering. We recommend keeping consistent the list of Activity Engine nodes defined in the Activity Engine setup.
Failover behavior
If an Activity Engine node becomes unavailable, the web application bans it and reroutes applicable users to the next node in the list. The new routing becomes "sticky" for a small number of subsequent requests (30) before it is discarded and the original route is again attempted.

Activity Engine Architecture Diagram

After the web application selects an available Activity Engine node, the web application attempts the delivery. If it fails for any reason, the activity is journaled and attempted again later (potentially against a different node). Delivery failure may occur due to the following:
Unavailable node
The node is unreachable or otherwise unavailable between its selection and activity delivery.
Activity not written to the Activity Engine database
The node is unable to write the activity to the database for any reason.
Activity not queued for processing
The node is unable to add the activity to its processing queue for any reason.

Streams Durability

Each Activity Engine node is backed by its own Lucene stream service. While ingress remains unchanged, each node is responsible for replicating its processed data to all other nodes in the Activity Engine cluster. Each node is made aware of its siblings during the web application's registration process.

The replication procedure works as follows:
Disk queue
Fully-processed activities, as well as events (e.g. reads, hides, moves), are queued to disk in a simple pipeline destined for all siblings.
Connection pool
Each Activity Engine node establishes up to 10 connections to each of its siblings on the same port used by the web app. Note: Each node will therefore accept and establish up to 10 x #-of-siblings additional connections to handle the replication requirement.

Activity Engine Failover and Recovery

If an Activity Engine node goes down:
Unprocessed activity is in limbo
There is no way to reclaim activities that are still queued on disk and/or activities/events in the replication queue. You must bring the failed node back online for its queue to be processed.
Queues lost to disk failure are recoverable
In the event of a complete disk failure, in Jive 6.0, you can recover unprocessed activities by running a stream rebuild on all nodes. While this guarantees that the unprocessed activities are correctly reflected in all indexes, it is a resource-intensive procedure. Therefore, Jive Software recommends avoiding this scenario if possible.
Lucene is completely mirrored
Because each node has its own complete Lucene index, no stream data goes missing or becomes unreachable. All other data (e.g., follows and email notifications) are persisted to the database and are unaffected by an Activity Engine node failure.
Recoverable cross-communication
If an Activity Engine node is unable to reach one of its siblings during replication, activities/events destined for that sibling are pushed to a "retry" queue where they are reattempted at a later time. After a failed node recovers, replication to that node should resume within approximately 1 minute; so there will be a brief period of stream inaccuracy on the affected node.
After an Activity Engine node recovers:
Processing resumes
The Activity Engine node will continue processing its disk queue, as well as activities/events in the replication queue. No further action is required.
Corrupted queues
In the event that the failure has corrupted one or more disk queues and/or activities/events in the replication queue, those queues are copied off with the suffix ".corrupt" and can potentially be copied and analyzed. But there is no way to recover corrupt queues. If your replication queue is corrupt, we strongly recommend that you run a stream rebuild on all Activity Engine nodes.
If you decommission an Activity Engine node:
Unprocessed activity can be recovered
If you decommission an Activity Engine node with unprocessed activity in its queues, then we recommend you run a stream rebuild on all nodes. This is a resource-intensive procedure, but we recommend it over attempting to move the unprocessed queue data to another node.
Cross-communication adjusts accordingly
When you remove an Activity Engine node from the web app's Activity Engine configuration, all remaining nodes are made aware of the change. Replication to the decommissioned node ceases immediately.
If you add or re-add a node:
Cross-communication adjusts accordingly
When you add an Activity Engine node to the web app's Activity Engine configuration, all current nodes are made aware of the change. Replication to the commissioned node begins immediately.
Stream rebuild is initiated
When an Activity Engine node is registered with the web app, it is able to detect whether it was a new addition (or re-addition). The node flags itself for a stream rebuild to ensure that its index contains all required stream data.

Note that, while we have made extra efforts to ensure that consistency is maintained automatically during the management of an Activity Engine cluster, in the event of any unexpected inconsistency or missing data, you can correct the issue by initiating a stream rebuild on the affected node(s). The Activity Engine node remains completely accessible while performing a rebuild, during which time streams will fill with users' newest activity first. In addition, Jive 6.0 supports complex "in/out" configurations which play a large part in activity ingress.

Activity Egress

Because each node maintains its own Lucene index, if an Activity Engine node is in the process of performing a stream rebuild, it is possible for users routed to a particular Activity Engine node to have partly-diminished streams. This state is temporary and resolves itself automatically (similar to a search or browse index rebuild). The most recent stream activity will always be available first, and in no event will any user activity ever be permanently lost. There are some background tasks which are only performed by an elected node. However, in Jive 6.0, the "per-user stream rebuild" task is no longer applicable and has been removed. Therefore, the only tasks requiring node election are upgrade migrations. In addition, Jive 6.0 supports complex "in/out" configurations which play a large part in activity egress.