r15 - 2024-10-04 - 16:34:39 - PaulEllisYou are here: TWiki >  Deployment Web > DeploymentPlanningAndDesign > ELMSizingOverview71

Sizing ELM deployments for 7.x releases

Authors: VaughnRokosz
Build basis: ELM 7.x
Date: August 2024

Introduction

Whether new users or seasoned experts, customers using IBM Engineering Lifecycle Management (ELM) products all want the same thing. They want to use the ELM products without worrying that their deployment implementation will slow them down, and that it will keep up with them as they grow. That desire leads to a few basic questions:
  • What kind of hardware is required to run ELM?
  • What are the best deployment topologies for that hardware?
  • For any given class of hardware, how many users and how much data can be supported?

This article answers those questions by drawing on the tests executed by the IBM Performance Engineering team and analysis of many customer deployments.

Estimating the deployment size that can be supported by a given set of hardware is hard to do precisely, since so much depends on what exactly the users are doing as well as the amount of data. The approach taken in this article is to define a set of standard machine sizes (memory, CPU) and then provide estimates for the number of concurrent users and repository size that can be supported for each standard size.

Think of the guidance here as an estimate or a starting point. Expect to adjust your system sizes and topology as your deployment grows. IBM strongly recommends using best practices for application monitoring to help you understand how system usage evolves over time. That helps you to anticipate when changes to your deployment will become necessary.

Standard disclaimer

The information contained in this article is derived in part from data collected from tests performed in IBM test labs under controlled conditions. Customer results may vary greatly. Sizing does not constitute a performance guarantee. IBM assumes no liability for actual results that differ from any estimates provided by IBM.

Factors that influence performance

Before getting into the details of system specifications, it is worth discussing the general factors that influence performance. These factors will help you put the later guidance into context and also help you understand whether you should make adjustments for your specific environment.


The factors are:

  • Deployment topology. How many servers will be deployed, and what are the specifications of the servers (memory, number of CPUs, disk I/O speed)?
  • Data. How big is the deployment (how many artifacts and of what kind)? At what rate is the deployment growing, as people, custom scripts or integrations create more artifacts?
  • Application usage. How many people are actively using the system at any one time? What are those people doing? Are there integrations or custom scripts in use that should be accounted for in the usage model? At what rate are new users onboarded to the deployment?

If you are deploying ELM for the first time, then start by estimating data and application usage. You can then use those estimates to choose a deployment topology and to size the servers. Project your estimates a few years into the future so that your initial deployment can handle growth.

If you are managing an existing deployment, then monitor the active servers to look for overloaded systems and monitor the size of your repository. The monitoring data will help you adjust server sizes or decide on changes to your deployment topology.

Impact of deployment topology

You can find more information on deployment topologies on the standard deployment topologies overview page. There are three main deployment types:

  • A departmental topology, which is the minimum IBM recommends for any production deployment. This topology minimizes the amount of hardware required and is best for smaller projects and smaller-sized teams.
  • An enterprise topology, which is useful for medium-size to large-size teams. This topology spreads the workload across additional servers in order to support larger teams.
  • A federated topology, which is useful for very large enterprises that deploy an ELM solution per product line or organizational division, but would still like to be able to have a program-wide view of their entire portfolio of work.

The guiding principle behind the standard topologies is that you need more CPUs to support more concurrent users, and you need more RAM to support more data. In a departmental topology, you trade off data and user scale in order to use less hardware. There is less processing power and less memory available, so fewer users can be supported. In an enterprise topology, each application runs on its own server. This provides additional processing power and memory (by deploying additional servers). This increases the number of users (and the amount of data) that can be supported. A federated topology further increases the available processing power and memory by adopting multiple instances of the enterprise topology while maintaining some common elements. The additional processing power comes from the deployment of additional servers.

Impact of processor count

The number of active users that can be supported by a deployment is most strongly influenced by the number of available CPUs. Imagine a case where a user is interacting with an ELM application through the browser, and makes a selection in the UI that runs an operation on the server. That operation will use one CPU. If 200 users were running operations on the server at the same time, that would require 200 CPUs. If the server doesn't have enough CPUs available, the server would be running at 100% CPU and the user operations would slow down.

Real users, of course, don't run operations continuously. They think about what they are seeing on the screen, or they get coffee or attend meetings. So a workload of 200 active users doesn't really need 200 processors - that's just the worse case. But the basic principle applies: more active users requires more CPUs.

A special case is users that are accessing the system through application programming interfaces (APIs). If you develop code to interact with a deployment via APIs, that code will put more load on a deployment than a real user. For sizing purposes, think of each API user as the equivalent of 20 real users. Refer to the section on Integrations and APIs for more information.

CPUs can be added in one of three ways:

  • An individual server can be given more processors
  • Applications can be separated onto their own servers, which in effect increases the total number of available processors by adding more machines
  • Applications can be clustered, which increases the available processors by having multiple cluster nodes

IBM performance tests show that the throughput which a deployment can handle (in terms of requests per second) increases linearly as the number of processors increases (up to 48 processors). There is a point of diminishing returns, however, so continuing to add processors beyond 48 has less and less benefit (depending on the workload).

Impact of memory

The reliability of a ELM deployment is most strongly influenced by the available RAM. Without enough memory, the applications can hang, crash or become slow. You'll need to allocate enough RAM to satisfy the primary consumers of memory on a ELM server, which are:

  • The memory heap for the Java run-time environment that is running the application
  • Native memory used by the Java run-time (especially critical for network operations involving file transfers)
  • Memory used by the operating system.

The amount of RAM available can also influence performance, in these ways:

  • If the JVM heap is not large enough or is not tuned correctly, system performance can be degraded by garbage collection.
  • If there is too much demand for memory, the operating system's file cache may shrink and this can impact the caching of data stored locally on the application servers.
  • High memory usage may result in swapping to disk, which degrades performance.

Impact of data (repository size)

The size of your repository can impact performance, and can influence the hardware requirements needed to deliver good performance. Repository size refers to two different things:
  • The number of artifacts of each particular type (e.g. how many work items, how many test cases, how many requirements). This is the repository size.
  • The structure of those artifacts and the relationships between them (e.g. work items can be associated with a development plan, test cases can be associated with a test plan, requirements can be added to a module, artifacts can reference each other via links). This is the repository shape.

IBM recommends that you develop a model of how your users will interact with the system, so that you can have a rough idea of how the size of the deployment will evolve over time. When we do this for our own performance tests, we start by thinking about the user population, and we identify the primary roles which individuals take on as they do their jobs. We then define the operations which each role is likely to perform, and we estimate the frequency at which each operation will occur. From this you can then estimate how many artifacts are likely to be created over time.

The size and shape of your repository influences how much memory will be needed by your servers. For example:

  • The full-text index (used for artifact searching) grows as artifacts are added. More memory is needed to cache the files associated with the full-text index.
  • The database tables which store data grow as the number of artifacts grows. Larger databases require more memory on the database server.
  • Native Java memory (specifically, direct memory buffers) is used by the Web application servers to send information over the network to browsers (or EWM Eclipse clients). The source code management features in EWM can be particularly impacted by the size of source files, the number of files in a workspace, and frequency at which workspaces are refreshed (such as through builds).
  • Java heap memory is used to carry out ELM operation. Larger repositories result in more data being transferred from server to client, which then requires additional memory in the ELM application servers. As the memory demands on the Java heap grow, performance can degrade if garbage collection gets triggered more frequently. There can also be high transient demands on memory (e.g. when running reports which return large result sets).

Sizing ELM deployments

The sizing approach used in this article uses a simplified model that looks at two dimensions:

  • Deployment sizing
  • Machine size

Each of those dimensions is divided into coarse-grained buckets, and then a machine size is assigned to each deployment size, for each product. The deployment sizes are:

Size Active Users Total Database size
Small 10 Up to 50G
Medium 10-250 Up to 250G
Large 250-500 Up to 1TB
Extra-large >500 >1TB

An active user is defined as a person that is interacting with with the system and is therefore sending requests to the applications. A registered user is a person that is able to access the system. The important dimension to the sizing is the number of active users - you can have a much higher number of registered users. A registered user only starts to consume system resources once they become active. The definition of an active user explicitly excludes user ids used by scripts or automation. Scripts usually place a much higher load on a system than real people. For planning purposes, consider each API user as the equivalent of 20 real people.


These size buckets are per-application. For example, a medium size deployment of EWM can support up to 250 users with an EWM database of up to 250G. A large deployment of ETM would support up to 500 users with an ETM database of up to 1TB. You could have a medium EWM server and a large ETM server in a single deployment.

The standard server sizes used are:

  • 2 vCPU/4G RAM
  • 4 vCPU/8G RAM
  • 8 vCPU/16G RAM
  • 16 vCPU/32G RAM
  • 48 vCPU/96G RAM
  • 64 vCPU/128G RAM

A vCPU is a virtual CPU or a logical processor, accounting for hardware multi-threading. For example, an Intel physical CPU may have 6 cores with each core providing two threads. So each 6 core CPU provides 12 vCPUs or logical processors.

The sizing estimates provided here should give you a good starting point for sizing your ELM systems, but since this is a simplified model, you should expect to make adjustments for your specific deployment. IBM strongly recommends that you use a commercial monitoring tool to look for servers that are becoming overloaded.

The following sections use these abbreviations:

  • JTS: Jazz Team Server
  • JAS: Jazz Authorization Server
  • GCM: Global Configuration Management
  • EWM: Engineering Workflow Management
  • ERM: Engineering Requirements Management DOORS Next
  • ETM: Engineering Test Management
  • DCC: Data Collection Component (part of Jazz Reporting Service)
  • RB: Report Builder (part of Jazz Reporting Service)
  • LQE rs: Lifecycle Query Engine (Relational store)
  • LDX rs: Link Index Provider (Relational store)
  • PUB: PUB Document Builder
  • ENI: Engineering Lifecycle Optimization - Engineering Insights

Sizing an enterprise topology

IBM's recommendation for an enterprise topology is to deploy each application onto its own server. The recommended sizing for each application server is listed in the table below.

SizingTable.png

Here are the principles behind these sizings:

  • More vCPUs are required to support more active users.
  • More RAM is required to support more data.
  • The database is used by all servers so it needs to be larger. The database requires more memory as a deployment grows.
    • Plan to have multiple database servers in larger deployments.
  • In 7.1, the LDX and LQE servers use the database server for storage. However, memory is still required for indexing. IBM recommends using a single LQE rs system to support both reporting and link indexing, so the LQE rs server is larger than other servers.
  • You should expect to cluster the EWM and ETM servers in the extra-large deployment. You may also need to deploy multiple ERM servers as the repository grows.
  • The Jazz Authorization Server does not require significant resources.
  • You may need multiple proxy servers for redundancy, or more vCPUs for higher user loads. A single proxy with 8 vCPUs can become overloaded for transaction rates in the 500-1000 per second range.

If you need to have multiple applications hosted on a single server, then sum up the vCPU and memory requirements for each individual application and apply that to the combined server. For example, in a small deployment, you could host ETM and EWM on the same server but that would require 8 vCPUs and 16G of RAM.

Sizing a departmental topology

A departmental topology uses fewer servers and hosts multiple applications per server. IBM recommends using such a topology for small and medium deployments only. Use the sizing estimates from the enterprise topology and sum up the vCPU and RAM requires for each hosted application.

For the standard departmental topology, the system requirements would be:

Dept.png

Here, "Apps" refers to the ELM applications hosted on a single server: EWM, ETM, ERM, JTS, ENI, PUB and GC.

Proof-of-concept sizing

A proof-of-concept (or evaluation) topology can be used for extra-small deployments:
  • Less than 10 total concurrent users
  • Less than 100,000 total artifacts

IBM strongly recommends using a departmental topology for proof-of-concept (PoC) efforts since repository size can grow quite quickly even with a low user load. The departmental topology gives you enough resources to avoid scalability issues during the PoC.

If you keep your PoC small, you can use the following system sizes:

Application vCPU RAM
JTS 2 4
JAS 2 4
Apps 8 24
DCC 4 8
LQE/LDX rs 8 16
Proxy 2 4
Database 4 12

It is possible (although not recommended) to further combine applications so that you end up with a 3 system deployment:

Application vCPU RAM
JTS/JAS/Apps/Proxy 8 32
DCC/LQE/LDX rs 8 32
Database 4 12

Special considerations

This section discusses additional factors that can impact deployment sizing.

Storage

There are two factors to consider when addressing storage needs:

  • How much disk space is needed for each application?
  • How fast does that storage need to be?

ELM application servers do not stress the local storage systems. Allocating 500G for local storage is enough for most deployments. Storage for ELM application servers need not be extremely fast - spinning disks can provide enough performance. The main use cases for storage involve logging or the creation of temporary on-disk data. The exception is older ELM components (LQE and LDX) that are reliant on Jena - refer to this section for more details.

The storage for the database servers, however, will be stressed and database servers require large amounts of fast disk space. Most application data is stored in the database and so the amount of disk space required can be large and will grow as artifacts are created or edited. ELM applications will make queries against the database to retrieve data, and this can involve reading data from the database server's storage. IBM recommends investing in a fast storage system that provides a high rate of IO operations per second (IOPS).

Artifact counts and repository size

The sizing model is based on the size of the application databases. If you are planning a new deployment, you might find it easier to estimate the number of artifacts. As a rule of thumb, assume that 1 million artifacts will require 50G of database storage. If you have enabled configuration management, then assume 50G per 1 million versions. This applies to ETM, EWM/RMM and ERM.

LQE rs will use more disk space since it will read data from all applications (as will LDX rs). LQE rs also keeps track of the versions associated with each configuration (stream or baseline), and the space required for that is based on the number of versions per configuration times the number of configurations. To estimate LQE rs database size, use the following guidelines:

  • 12G per million resources indexed
  • Add .5G per million versions added for configurations

For example, if you have 1 million requirements, this would take 12G of database storage. If those requirements were in 100 streams or baselines, then there would be an additional 50G of storage needed (.5G per million x 1 million x 100).

Older components using Jena

Older versions of some components use a triple-store database (based on Jena technology), and this will grow as artifacts are created. The components which are most impacted by the size of the triple-store are:

  • The Lifecycle Query Engine (prior to 7.0.3)
  • The Link Index Provider (LDX) - prior to 7.1

Both of those components have been updated to use a relational store, so the following comments only apply to the older versions.

The triple-store uses the operating system's file system cache to help optimize queries. The file system cache will keep the indices in memory if there is enough free RAM, but if other applications need RAM, the operating system will shrink the file system cache. You'll get the best performance out of a server with large indices if you have enough RAM to keep most of the indices in memory. As a rule of thumb, you can look at the size of the indices on disk, and estimate the total amount of RAM required by looking at the sum of: JVM max heap size + disk size of indices...and then add an additional 4G for the overhead of operating system processes.

The large and extra-large deployments can require more than 1 terabyte of RAM to fully cache large LQE or LDX Jena indexes.

Considerations for database servers

Database servers benefit from additional RAM as repositories grow. Both Oracle and Db2 use memory to cache information, so when you have more RAM, you have the ability to cache more data (which can then speed up SQL queries). If you share a single database server between the ELM application servers, the memory demands on the database server are a function of the combined load across all application servers.

If your deployment will host millions of artifacts across multiple applications like EWM, ETM, and ERM - then you should deploy database servers with 64G or more of memory. Large and extra-large deployments can usually benefit from higher amounts of database memory (e.g. 756G or more). Large and extra-large deployments may require multiple database servers.

Integrations and APIs

The ELM suite provides a set of application programming interfaces (APIs) that allow for the development of custom applications that interact with the ELM applications. These custom applications can place extreme load on an ELM deployment, depending on how the APIs are used. A custom application that calls the APIs in multiple threads with no throttling can easily add load that is equivalent to 1000 real users.

IBM recommends that you implement a governance process around API usage. Specific recommendations:

  • Maintain an inventory of API-based tools, and understand the load that each tool places on the ELM applications. In the absence of other information, assume that each API thread in a tool is equivalent to 20 real users. Add this adjusted user count to your active user totals when sizing your ELM servers.
  • Register and monitor each tool as a custom expensive scenario
  • Test integrations at scale under load in a test environment (using production-like data)
  • Create a change control board (CCB) to manage integrations and approve their deployment (as well as to manage changes to integration tools)
  • Have throttling in place to control the load from tools and adjust their load based on system conditions

Data collection Component

The Data Collection Component is not strongly impacted by deployment size. It has a fixed number of threads, and does not handle concurrent user load. A 4 vCPU/8G system will typically be enough to manage collection of data from the applications.

If you increase the default number of threads, or you have additional data sources to index (e.g. because you've deployed additional application instances), you might need up to 8vCPU/16G of RAM.

Clustering

Extra-large deployments should consider clustering when supporting more than 500 users. Clustering in ELM 7.1 is supported in these places:

  • Engineering Test Management
  • Engineering Workflow Management (and Rhapsody Model Manager)
  • Jazz Authorization Server
  • Oracle (with RAC) and Db2 (with PureScale)

Clustering for ELM involves two additional applications:

  • A distributed cache manager (DCM)
  • Eclipse Amlen

DCM and Eclipse Amlen should be deployed onto their own dedicated servers.

Size DCM Amlen
Large 8 vCPU, 8G RAM 1 instance: 8 vCPU, 16G RAM, 64G disk
Extra Large 16 vCPU, 16G RAM 2 instances: 8 vCPU, 16G RAM, 64G disk


Additional sizing information for Eclipse Amlen can be found here.

Clustering is not supported by other parts of the ELM suite. However, you can deploy additional application instances and spread out the load that way. That includes:

  • Setting up new ERM servers
  • Adding Report Builder instances
  • Adding new LQE or LDX servers to support a subset of applications

Refer to Scaling the configuration-aware reporting environment for more information.

Networking

The sizing model discussed in this document does not address networking considerations. IBM assumes the following things to be true:

  • Application servers and database servers are located within a single data center with low latency between all servers (e.g. <1 ms).
  • There is sufficient network bandwidth to support data transfer rates
  • The quality of service on the network is high (no packet retries)
  • There are no devices on the network that interfere with communication (e.g. by artificially causing socket timeouts)

Operating system

The choice of operating system (Windows or Linux) does not impact the sizing recommendations.

References

Topic attachments
I Attachment Action Size Date Who Comment
Pngpng Dept.png manage 9.8 K 2024-09-11 - 18:50 VaughnRokosz  
Pngpng SizingTable.png manage 29.7 K 2024-09-30 - 16:38 VaughnRokosz  
Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r15 < r14 < r13 < r12 < r11 | More topic actions
 
This site is powered by the TWiki collaboration platformCopyright © by IBM and non-IBM contributing authors. All material on this collaboration platform is the property of the contributing authors.
Contributions are governed by our Terms of Use. Please read the following disclaimer.
Dashboards and work items are no longer publicly available, so some links may be invalid. We now provide similar information through other means. Learn more here.