Blogs about Jazz

Blogs > Jazz Team Blog >

Working towards continuous deployment in Jazz.net

I have worked for IBM for about 20 years, moving back and forth between development, customer support and IT operations roles. Working in these different roles helped give me perspective for my current role as the manager for Jazz Continuous Deployments/DevOps. In this role I am responsible for ensuring that our multiple staging and self-hosting environments are stable and always using the latest driver or sprint of the Collaborative Lifecycle Management (CLM) products. We are not yet at the point where we can automatically deploy a new driver daily to production, but we are working towards that goal.

Right now we have two main staging systems and one production system. One staging environment is updated daily with the latest build. The other behaves more like a classic staging system.

Once we agree on a build to deploy to production, we test the upgrade on the second staging system just to ensure we can upgrade from our previous deployed build.

This second staging system also allows testing hot fixes and patches before deploying them to production.

With this current approach we upgrade a cluster system, initialize and configure it in three and a half hours. This is accomplished as follows:

Prepare the new build (10 minutes)

  • One script is getting the installable images and unpacking it.
  • Then we run the ‘prep’ task of our clmtools script. It merges properties and updates WAR files with specific data needed for our front end user interface.

Stop the server(s) (5 minutes)

  • Because we can only take offline backup as of today, we need to stop the server and start the outage at that time.

Take offline backups (2 hours to get an entire backup of the system)

  • We copy the DB2 content to an offline location.
  • We also backup the file indexes for work item.

Install (15 minutes)

  • Once backups are done we run the ‘inst’ task of the clmtools script.
  • This uninstalls the old WAR from the application server, installs new ones, and removes temporary directories.

Migrate (15 minutes)

  • We run the repotools addTables on each database.
  • We have to run them even if they are not going to change anything.
  • We are working with the development team to change that behavior, so it checks if a migration is needed while the server is up.

Unclustering (10 minutes)

  • At this point, we need to uncluster the system to allow the RM migration.

Migrate RM (15 minutes)

  • We open the RM web page and run the RM migration which will migrates RM if needed

Re-cluster (15 minutes if all goes well…)

  • We finally stop the server, recluster and restart the cluster

Validation (10 minutes)

  • The first step of our validation is to access the Admin page of all applications: JTS, CCM, QM and RM
  • Then we have a set of 3 URLs we execute to ensure the following areas are working: Dashboard, Reports, Converter

We need to do better. Ultimately, we would like to be able to only stop the production system in no more than one hour, by having online backup, running migrate tools only when needed and not having to start the system as uncluster to check if RM migration is needed.

SO instead of 3.5 hours, we are expecting between 30 and 45 minutes depending if we need to migrate the database and/or RM.

To accomplish this, here is what we are working on:

1) The upgrade process must be scriptable

Some of the steps in the upgrade migration still require a human intervention and some User Interface action.

We are working on making sure all products tasks can be executed from a script. Ultimately we want them to be executed from an IBM UrbanCode Deploy engine.

2) The product itself should be fast to install and configure

Some of the configuration requires the system to be up and unclustered, which means someone has to manually start the system and open the web interface. We could create a script, but I am working with the development team to allow configuration of the system without requiring us to starting the application server.

This will reduce the length of outage.

3) The product should know when upgrade steps are required

The command-line upgrade should quickly check if the version of the metadata or data has changed and return immediately if no further work is required. Right now most of our upgrade checks the milestone/sprint version. But some upgrades are done within the same sprint, which means the check must use the timestamp of the build as well as some internal version of the model used (in case it needs to be upgraded).

Upgrade and migration is between build timestamp, not between versions.

4) All the backup and migration must be done online

https://jazz.net/wiki/bin/view/Main/OnlineMigrationDesign

We are working on having our Database and Indices backup run while the server is up. Again, the goal is to only stop the server during an upgrade when totally necessary.

Once we have more information, we will post them here: https://jazz.net/wiki/bin/view/Deployment/

5) Allow patching of a live server without restarting

In some cases we need to install a patch or a hot fix. While this is not a continuous deployment issue, our product should be able to be updated while running. We use OSGi as our base layer, and we should use its feature where possible. Ultimately, a continuous deployment should be considered like a set of live hotfix updates, which would not require an outage but could be applied one step at a time.

6) Test successful migration more extensively

Our upgrade smoke test needs to be enhanced and scripted.

For the next release, we are already planning the changes requires to automate and speed up the upgrade process

You can follow the discussion and participate in this work item.

In the next set of blogs, I will go into the details of how we design the set of scriptable tasks on UrbanCode Deploy. I’ll also discuss the role of our Information Radiator in continuous delivery.

Christophe Elek
Jazz/CLM Deployment Manager