Getting started with monitoring
All too often we talk about monitoring as a necessary part of routine maintenance like flossing our teeth or washing windows.
But the truth is, like some nice-to-do, good-for-you tasks, we don't consider a complete monitoring program until it's too late.
Working with customers, we're more used to a "Monitoring in time of crisis" approach.
Something suddenly goes wrong and we wish we had been paying better attention all along, and so under the stress of a down deployment or critical issue we start a monitoring program.
This article discusses a "Monitoring in a time of peace" approach.
The concepts and steps to take when everything is running perfectly and you'd really rather hang the "Gone fishin'" sign on the door.
Because everyone's workload is different, the monitoring we suggest is usually very basic:
Watch JVM, disc space, RAM. Use JTSMon to start noticing trends.
The mechanics are more complicated (what tool, does the tool have triggers, what to do with all the JTSMon data, etc.) and usually customized based on resources, etc.
Reasons for monitoring
What leadership wants:
- Proof application is being used, value for investment
- Capacity planning data and trends that might help plan for new hardware or expansion
What admins want:
- Clues to optimize daily and long-term operations
- "Canaries," triggers or warning signals, hints before something bad happens
- Data which might suggest tuning is needed (or prove that tuning was beneficial)
What end-users want:
- Confidence that someone is paying attention to essential systems
A basic monitoring plan
Document the topology
Use a diagram, whiteboard, any tool.
Document all properties, settings, preferences, custom configuration changes
OS settings (version, patches, etc.)
The support process' must gather can be used here too
https://jazz.net/help-dev/clm/index.jsp?re=1&topic=/com.ibm.team.concert.doc/topics/t_using_the_isal
Capture baseline end-user response time data
When end-users complain, it's helpful to know what's normal behavior.
Use an lightweight automated tool to check uptime and responsiveness
link to
https://jazz.net/library/article/1017/
curl, ping
Five essential things you must monitor
Monitoring your CLM environment can be complex and possibly a daunting prospect if you are starting from scratch.
Over the past years we've made notes on some of the most essential aspects of the environment which should be monitored.
Most essentially, collecting this information isn't as much about "point-in-time" data as it is about "trend" data.
Meaning we want to collect this information and then compare week-to-week, or project-to-project, or maybe even server-to-server.
Further details will provide suggestions on how to track, capture and use this data.
License usage
Track how many licenses are being used.
From this we can infer much about the number of users, user patterns, active users vs. registered (licensed) users.
Database size
Track size (in GB) of the databases and how they grow over time.
Average JVM size per application
Track in GB, and implicitly how the JVM size changes across a period of time.
Average CPU % per server
Track in percentage, and implicitly track how the percentage changes over time.
Data transfer
Track in GB how much data moves in and out (up and down) between the appserver and the database.
Reporting interval will vary depending upon tool, method and how much data is created.
In time of peace, sampling every 10 minutes is a good way to start.
Of course, at that sampling rate, it's possible to miss many spikes or infrequent events.
If you're starting a monitoring program from scratch, it may make sense to start with a wide interval (10 minutes) but adjust to a greater frequency (1 minute) as the collected data starts to make sense.
Do something with the data you collect
No point collecting the data and ignoring it.
Validate that the data collected is useful.
- Compare one week with another.
- Find a way to graph it.
- Put it on a webpage or dashboard.
- Run scripts which look for errors or things that you are interested in.
Most likely collecting data and comparing it with other data sets will reveal trends, and most likely you will discover something interesting.
- A report that runs every day. Is it needed?
- There's an unexpected traffic pattern
- Some servers are not configured consistently
- Some environments are more prone to problems than others
- There are unexplained anomalies
- The database is growing 100s of GB per week
- CPU% avg is higher than 80%
As you collect data from week to week, emerging trends will enable proactive capacity planning.
External links: