Basic Monitoring Recommendations

Here are some monitoring recommendations that are relatively easy to implement.

Consider monitoring the following items using a monitoring tool such as check_MK, Zenoss, Zyrion, IBM/Tivoli, or other monitoring tool(s). Polling intervals should be every five minutes.

CAUTION:
If you are connecting Jive to other resources such as an LDAP server, SSO system, SharePoint, and/or Netapp storage, we strongly recommend setting up monitoring on these external/shared resources. Most importantly, if you have configured Jive to synchronize against an LDAP server, or if you have configured Jive to authenticate against an SSO, we strongly recommend that you configure monitoring and alerting on that external resource so that you can properly troubleshoot login issues. At Jive Software, we see outages related to the LDAP server not being available in our hosted customer environments.
Node What you should monitor Why you should monitor it
On all nodes
  • Memory utilization
  • CPU load
  • Disk space
  • Disk I/O activity
  • Network traffic
  • Clock accuracy
These checks help you monitor all the basics and should be useful for troubleshooting. We recommend performing each of the following checks every five minutes on each server.
  • Memory utilization: If your memory utilization is consistently near 75%, consider increasing the memory.
  • CPU load: On healthy web application nodes, we typically see CPU load between 0 and 10 (with 10 being high). In your environment, if the CPU load is consistently above 5, you may want to get some thread dumps using the appsnap command, and then open a support case on the Jive Community.
  • Disk space: On the web application nodes, you'll need enough disk space for search indexes (which can grow large over time) and for attachment/image/binary content caching. The default limit for the binstore cache is 512MB (configurable from Admin console: System > Settings > Storage Provider). We recommend starting with 512MB for the binstore cache. Note that you also need space for generated static resources.
  • Network traffic: While you may not need a specific alert for this, monitoring this is helpful for collecting datapoints. This monitor can be helpful for understanding when traffic dropped off.
  • Clock accuracy: In clustered deployments, ensuring the clocks are accurate between web application nodes is critical. We strongly recommend using NTP to keep all of the server clocks in sync.
Jive web application(s) We recommend running a synthetic health check against your Jive application (using a tool such as WebInject).
  • Individual web application server
  • Through the load balancer's virtual IP address

WebInject interacts with the web application to verify basic functionality. It provides functional tests beyond just connecting to a listening port. Checking individual servers, as well as the load balancer instance, verifies proper load balancer behavior. We recommend setting these checks every five minutes initially. To minimize false alarms, we require two failures before an alert is sent. If you find that these settings are resulting in too many false alarms, then adjust your settings as needed.

We recommend setting up WebInject tests that perform the following:
  • request the Admin Console login page (this verifies that Apache and Tomcat are running)
  • log in to the Admin Console (this verifies that the web application node can communicate with the database server)
  • request the front-end homepage (this verifies at a high level that everything is okay)

For an example of WebInject XML code that will perform all of the above, see WebInject Code Example.

Cache server
  • Java Management Extensions (JMX) hooks (heap)
  • Disk space (logs)
JMX provides a means of checking the Java Virtual Machine's heap size for excessive garbage collection. Disk space checks ensure continued logging.
Databases (Activity Engine, Analytics, and web application) Stats for:
  • Connections
  • Transactions
  • Longest query time and slow queries

Verify ETLs are running

Disk space

Disk I/O activity

Database checks will show potential problems in the web application server which can consume resources at the database layer (such as excessive open connections to the database).
  • Connections: More connections require more memory. If you're constantly seeing the number of connections spike, consider adding more memory to the database server and make sure that the database server has enough memory to handle the database connections. The number of connections will be a function of what the min/max settings are on each of the web application nodes. (To learn how to set those, see Getting Basic System Information). Out-of-the-box settings for database connections are 25 minimum, 50 maximum. For high-traffic sites in our hosted environment, we set that to 25/125. Note that additional nodes should be used instead of more database connections for managing additional traffic.
  • Transactions: If the database provides an easy way to measure this number, it can be helpful for understanding overall traffic volume. However, this metric is less important than monitoring the CPU/memory/IO utilization for capacity planning and alerting.
  • Longest query time and slow queries: It's helpful to monitor slow query logs for the database server that they're provisioned against. In our hosted (PostgreSQL) deployments, we log all slow queries (queries that take more than 1000ms seconds) to a file and then monitor those to help find any queries that might be causing issues that could be helped by database indexes.
  • Verify ETLs are running: This is important only for the Analytics database. The easiest way to monitor this is by querying the jivedw_etl_job table with something like this: select state, start_ts, end_ts from jivedw_etl_job where etl_job_id = (select max(etl_job_id) from jivedw_etl_job); If the state is 1, the ETL is running. If any state is 3, there is a hard failure that you need to investigate. If the difference between start_ts and end_ts is too big, you may need to increase the resources for the Analytics database.
  • Disk space: On the web application nodes, you'll need enough disk space for search indexes (which can grow large over time) and for attachment/image/binary content caching. The default limit for the binstore cache is 512MB (configurable from Admin console: System > Settings > Storage Provider). We recommend starting with 512MB for the binstore cache. Note that you also need space for generated static resources. The most critical place to monitor disk space is on the database server; you should never have less than 50% of your disk available. We recommend setting an alert if you reach more than 50% disk utilization on the database server.
  • Disk I/O activity: This is good to record because it can be important if you see slow performance on the web application node(s) and excessive wait time.
Document conversion
  • Tomcat I/O
  • Heap
  • Queue statistics (e.g., average length and wait times)
  • Running OpenOffice service statistics
  • Overall conversion success rate for each conversion step
The various service statistics are exposed via JMX's mbean and can be accessed the same way as JMX on the web application node's Tomcat's Java Virtual Machine.
Activity Engine
  • Activity Engine service
  • Java Management Extensions (JMX) hooks (heap) and ports
  • Queue statistics (e.g., average length and wait times)
JMX provides a means of checking the Java Virtual Machine's heap size for excessive garbage collection. Disk space checks ensure continued logging.