r10 - 2016-06-09 - 15:13:33 - RosaNaranjoYou are here: TWiki >  Deployment Web > DeploymentPlanningAndDesign > MonitoringWhereToStart

Monitoring: Where to Start? uc.png

Authors: GrantCovell, MarkGravina
Build basis: CLM x.x

Getting started with monitoring

All too often we talk about monitoring as a necessary part of routine maintenance like flossing our teeth or washing windows. But the truth is, like some nice-to-do, good-for-you tasks, we don't consider a complete monitoring program until it's too late.

Working with customers, we're more used to a "Monitoring in time of crisis" approach. Something suddenly goes wrong and we wish we had been paying better attention all along, and so under the stress of a down deployment or critical issue we start a monitoring program.

This article discusses a "Monitoring in a time of peace" approach. The concepts and steps to take when everything is running perfectly and you'd really rather hang the "Gone fishin'" sign on the door.

Because everyone's workload is different, the monitoring we suggest is usually very basic: Watch JVM, disc space, RAM. Use JTSMon to start noticing trends.

The mechanics are more complicated (what tool, does the tool have triggers, what to do with all the JTSMon data, etc.) and usually customized based on resources, etc.

Reasons for monitoring

What leadership wants:

  • Proof application is being used, value for investment
  • Capacity planning data and trends that might help plan for new hardware or expansion

What admins want:

  • Clues to optimize daily and long-term operations
  • "Canaries," triggers or warning signals, hints before something bad happens
  • Data which might suggest tuning is needed (or prove that tuning was beneficial)

What end-users want:

  • Confidence that someone is paying attention to essential systems

A basic monitoring plan

Document the topology

Use a diagram, whiteboard, any tool.

Document all properties, settings, preferences, custom configuration changes

OS settings (version, patches, etc.) The support process' must gather can be used here too https://jazz.net/help-dev/clm/index.jsp?re=1&topic=/com.ibm.team.concert.doc/topics/t_using_the_isal

Capture baseline end-user response time data

When end-users complain, it's helpful to know what's normal behavior.

Use an lightweight automated tool to check uptime and responsiveness

link to https://jazz.net/library/article/1017/ curl, ping

Five essential things you must monitor

Monitoring your CLM environment can be complex and possibly a daunting prospect if you are starting from scratch. Over the past years we've made notes on some of the most essential aspects of the environment which should be monitored.

Most essentially, collecting this information isn't as much about "point-in-time" data as it is about "trend" data. Meaning we want to collect this information and then compare week-to-week, or project-to-project, or maybe even server-to-server.

Further details will provide suggestions on how to track, capture and use this data.

License usage

Track how many licenses are being used. From this we can infer much about the number of users, user patterns, active users vs. registered (licensed) users.

Database size

Track size (in GB) of the databases and how they grow over time.

Average JVM size per application

Track in GB, and implicitly how the JVM size changes across a period of time.

Average CPU % per server

Track in percentage, and implicitly track how the percentage changes over time.

Data transfer

Track in GB how much data moves in and out (up and down) between the appserver and the database.

Reporting interval will vary depending upon tool, method and how much data is created.

In time of peace, sampling every 10 minutes is a good way to start. Of course, at that sampling rate, it's possible to miss many spikes or infrequent events.

If you're starting a monitoring program from scratch, it may make sense to start with a wide interval (10 minutes) but adjust to a greater frequency (1 minute) as the collected data starts to make sense.

Do something with the data you collect

No point collecting the data and ignoring it.

Validate that the data collected is useful.

  • Compare one week with another.
  • Find a way to graph it.
  • Put it on a webpage or dashboard.
  • Run scripts which look for errors or things that you are interested in.

Most likely collecting data and comparing it with other data sets will reveal trends, and most likely you will discover something interesting.

  • A report that runs every day. Is it needed?
  • There's an unexpected traffic pattern
  • Some servers are not configured consistently
  • Some environments are more prone to problems than others
  • There are unexplained anomalies
  • The database is growing 100s of GB per week
  • CPU% avg is higher than 80%

As you collect data from week to week, emerging trends will enable proactive capacity planning.

Related topics: Deployment web home, Deployment web home

External links:

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r10 < r9 < r8 < r7 < r6 | More topic actions
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Contributions are governed by our Terms of Use. Please read the following disclaimer.
Ideas, requests, problems regarding the Deployment wiki? Create a new task in the RTC Deployment wiki project