| On all nodes |
- Memory utilization
- CPU load
- Disk space
- Disk I/O activity
- Network traffic
- Clock accuracy
|
These checks help you monitor all the basics and should be useful for
troubleshooting. We recommend performing each of the following checks
every five minutes on each server.
- Memory utilization: If your memory utilization is consistently
near 75%, consider increasing the memory.
- CPU load: On healthy web application nodes, we typically see CPU
load between 0 and 10 (with 10 being high). In your environment,
if the CPU load is consistently above 5, you may want to get
some thread dumps using the appsnap
command, and then open a support
case on the Jive Community.
- Disk space: On the web application nodes, you'll need enough
disk space for search indexes (which can grow large over time)
and for attachment/image/binary content caching. The default
limit for the binstore cache is 512MB (configurable from ). We recommend starting with 512MB for the
binstore cache. Note that you also need space for generated
static resources.
- Network traffic: While you may not need a specific alert for
this, monitoring this is helpful for collecting datapoints. This
monitor can be helpful for understanding when traffic dropped
off.
- Clock accuracy: In clustered deployments, ensuring the clocks
are accurate between web application nodes is critical. We
strongly recommend using NTP to keep all of the server clocks in
sync.
|
| Jive web application(s) |
We recommend running a synthetic health check against your Jive
application (using a tool such as WebInject).
- Individual web application server
- Through the load balancer's virtual IP address
|
WebInject interacts with the web application to verify basic
functionality. It provides functional tests beyond just connecting
to a listening port. Checking individual servers, as well as the
load balancer instance, verifies proper load balancer behavior. We
recommend setting these checks every five minutes initially. To
minimize false alarms, we require two failures before an alert is
sent. If you find that these settings are resulting in too many
false alarms, then adjust your settings as needed.
We recommend setting up WebInject tests that perform the following:
- request the Admin Console login page (this verifies that
Apache and Tomcat are running)
- log in to the Admin Console (this verifies that the web
application node can communicate with the database
server)
- request the front-end homepage (this verifies at a high
level that everything is okay)
For an example of WebInject XML code that will perform all of the above, see
WebInject Code Example.
|
| Cache server |
- Java Management Extensions (JMX) hooks (heap)
- Disk space (logs)
|
JMX provides a means of checking the Java Virtual Machine's heap size
for excessive garbage collection. Disk space checks ensure continued
logging.
|
| Databases (Activity Engine, Analytics, and web application) |
Stats for:
- Connections
- Transactions
- Longest query time and slow queries
Verify ETLs are running
Disk space
Disk I/O activity
|
Database checks will show potential problems in the web application
server which can consume resources at the database layer (such as
excessive open connections to the database).
- Connections: More connections require more memory. If you're
constantly seeing the number of connections spike, consider
adding more memory to the database server and make sure that the
database server has enough memory to handle the database
connections. The number of connections will be a function of
what the min/max settings are on each of the web application
nodes. (To learn how to set those, see Getting Basic System Information). Out-of-the-box
settings for database connections are 25 minimum, 50 maximum.
For high-traffic sites in our hosted environment, we set that to
25/125. Note that additional nodes should be used instead of
more database connections for managing additional traffic.
- Transactions: If the database provides an easy way to measure
this number, it can be helpful for understanding overall traffic
volume. However, this metric is less important than monitoring
the CPU/memory/IO utilization for capacity planning and
alerting.
- Longest query time and slow queries: It's helpful to monitor
slow query logs for the database server that they're provisioned
against. In our hosted (PostgreSQL) deployments, we log all slow
queries (queries that take more than 1000ms seconds) to a file
and then monitor those to help find any queries that might be
causing issues that could be helped by database indexes.
- Verify ETLs are running: This is important only for the
Analytics database. The easiest way to monitor this is by
querying the jivedw_etl_job table with
something like this: select state, start_ts, end_ts from
jivedw_etl_job where etl_job_id = (select max(etl_job_id)
from jivedw_etl_job); If the state is
1, the ETL is running. If any state is
3, there is a hard failure that you need to
investigate. If the difference between start_ts
and end_ts is too big, you may need to increase
the resources for the Analytics database.
- Disk space: On the web application nodes, you'll need enough
disk space for search indexes (which can grow large over time)
and for attachment/image/binary content caching. The default
limit for the binstore cache is 512MB (configurable from ). We recommend starting with 512MB for the
binstore cache. Note that you also need space for generated
static resources. The most critical place to monitor disk space
is on the database server; you should never have less than 50%
of your disk available. We recommend setting an alert if you
reach more than 50% disk utilization on the database
server.
- Disk I/O activity: This is good to record because it can be
important if you see slow performance on the web application
node(s) and excessive wait time.
|
| Document conversion |
- Tomcat I/O
- Heap
- Queue statistics (e.g., average length and wait times)
- Running OpenOffice service statistics
- Overall conversion success rate for each conversion step
|
The various service statistics are exposed via JMX's mbean and can be accessed
the same way as JMX on the web application node's Tomcat's Java Virtual Machine. |
| Activity Engine |
- Activity Engine service
- Java Management Extensions (JMX) hooks (heap) and ports
- Queue statistics (e.g., average length and wait times)
|
JMX provides a means of checking the Java Virtual Machine's heap size
for excessive garbage collection. Disk space checks ensure continued
logging.
|