Troubleshooting Caching and Clustering

This topic lists caching- or clustering-related problems that can arise, as well as tools and best practices.

Log Files Related to Caching

If a cache server machine name or IP address is invalid, you'll get verbose messages on the command line. You'll also get the messages in log files.

  • cache.log -- Output from the cache processes, showing start flags, restarts, and general errors.
  • cache-gc.log -- Output from garbage collection of the cache process.
  • cache-service.log -- Output from the cache service watchdog daemon, which restarts the cache service as needed and logs interruptions in service.
  • cache.out -- Cache startup messages.

Use the appsupport Tool to Collect Cache Logs and Configuration

The appsupport tool, which gathers system information for communicating with Jive support, also collects caching information (unless you specify not to). For more information, see appsupport Command.

Configure Address of Node Previously Set with tangosol.coherence.localhost

If, prior to version 4.5, you set the VM property -D tangosol.coherence.localhost on your application instance, you'll need to enter the IP address (not name) of that node in the cluster. To do this, enter the node's IP address in the Local Cluster Address in the Cluster Admin Console page. For more information, see Setting Up a Cluster.

Cache Server Configuration Issue

When you are configuring a cache server with its address, you may need to use its IP address or its domain name, depending on the Jive version. For more information, see Setting Up a Cache Server.

Misconfiguration Through Mismatched Cache Address Lists

If you have multiple cache servers, the configuration list of cache addresses for each must be the same. A mismatched configuration will show up in the cache.log file. For example, if two servers have the same list, but a third one doesn't, the log will include messages indicating that the third server has one server but not another, or that a key is expected to be on one server, but is on another instead.

To fix the problem, ensure that each cache server is configured with identical cache address lists. You'll find the lists in the file at etc/jive/conf/cache.conf; the cache_addresses line has the value that should be identical. Shut down the cluster, open /etc/jive/conf/cache.conf, and edit the list of cache addresses so that the list is identical on all servers.

For more information, see Managing In-Memory Cache Servers.
CAUTION:
If you're setting up more than one cache server machine, you must use three or more. The CACHE_ADDRESSES value should list them in a comma-separated list. Using only two cache servers is not supported and can cause data loss.

Cache Server Banned Under Heavy Load

Under extreme load, an application server node may be so overwhelmed that it may ban a remote cache server for a small period of time because responses from the cache server are taking too long. If this occurs, you'll see it in the application log as entries related to the ThresholdFailureDetector.

This is usually a transient failure. However, if this continues, take steps to reduce the load on the application server to reasonable levels by adding more nodes to the cluster. You might also see this in some situations where a single under-provisioned cache server (for example, a cache server allocated just a single CPU core) is being overwhelmed by caching requests. To remedy this, ensure that the cache server has an adequate number of CPU cores. The minimum is two, but four are recommended for large sites.

Banned Node Can Result in Near Cache Mismatches

While the failure of a node won't typically cause caching to fail across the cluster (cache data lives in a separate cache server), the banning of an unresponsive node can adversely affect near caches. This will show up as a mismatch visible in the application user interface.

An unresponsive node will be removed from the cluster to help ensure that it doesn't disrupt the rest of the application (other nodes will ignore it until it's reinstated). Generally, this situation will resolve itself, with the intermediate downside of an increase in database access.

If this happens, recent content lists can become mismatched between nodes in the cluster. That's because near cache changes, which represent the most recent changes, are batched and communicated across the cluster. If the cluster relationship is broken, communication will fail between the banned node and other nodes.

After First Startup, Node Unable to Leave Then Rejoin Cluster

After the first run of a cluster -- the first time you start up all of the nodes -- nodes that are banned (due to being unresponsive, for example) might appear not to rejoin the cluster when they become available. That's because when each node registers itself in the database, it also retrieves the list of other nodes from the database. If one of the earlier nodes is the cluster coordinator -- responsible for merging a banned cluster node back into the cluster -- it will be unaware of a problem if the last started node becomes unreachable.

To avoid this problem, after you start every node for the first time, bounce the entire cluster. That way, each will be able to read node information about all of the others.

For example, imagine you start nodes A, B, and C in succession for the first time. The database contained no entries for them until you started them. Each enters its address in the database. Node A starts, registering itself. Node B starts, seeing A in the database. Node C starts, seeing A and B. However, because node C wasn't in the database when A and B started, they don't know to check on node C -- if it becomes unreachable, the won't know and won't inform the cluster coordinator. (Note that the coordinator might have changed since startup).

If a node leaves the cluster, the coordinator needs to have the full list at hand to re-merge membership after the node becomes reachable again.