Jazz Community Site - Jazz Team Blog » How scalable are repository workspaces?

Blogs about Jazz

How scalable are repository workspaces?

by Jean-Michel Lemieux Mon, 23 Mar 2009 | 5 Comments

I was chatting with users of a rather large Rational Team Concert deployment today and they asked a very common question, “what is the footprint overhead of having repository workspaces?”. The easy answer is, don’t worry about it because disk space is cheap. But in reality, we do care about it, a lot. Imagine what happens when you have a stream with 130,000 files and a disk footprint of 2.2GB (yes, this is the footprint of the RTC integration stream) and every time you create a repository workspace we blindly copied the entire configuration and added 130,000+ rows in the database!

So what is the overhead? Well, thanks to Dmitry Karasik and John Camelon, who designed and implemented our configuration inheritance algorithm, we reuse as much as possible when a repository workspace is created. But let’s take a look at what metadata we need for a repository workspace:

We need a record of the repository workspace itself. This is a rather small structure that has the name, description, and flow targets for the workspace. It doesn’t have the files, change sets, or file contents.
Each workspace also has a set of component instances. It’s these component instances that contain the change sets.
Then lastly, each component has a configuration that simply represents what the component would look like if it was loaded on disk (eg, the file structure).

As you can see, the repository workspace itself isn’t a large structure, but copying the change history, all those change sets, and the configurations could be huge. So this is where we get the most bang-for-the-buck with reuse. Since most change histories are very close to one another given the common practice of delivering and accepting to stay in sync with the team, we can use this and reuse the change histories at the baseline level. When baselines are available, since they are immutable, we can simply have the components in the workspace reference them. Since the baselines are shared, the configurations for those baselines can be shared too. And voila, we store the changes you make on top of the shared baselines and configurations, and repository workspaces become a lot cheaper.

So it’s a constant process of starting with a repository workspace with references to a set of shared baselines, then adding new change sets that aren’t shared while a developer works, then re-harmonizing when the developer delivers and accepts, and things are in sync again.

Of course, the actual file contents are always shared, and new in 2.0, contents of different files that happen to be the same (eg, same hash) are stored in the same blob. If you store two 200MB zip files of the same video in different components, only one 200MB file will be stored. We were able to reduce the footprint of our self-hosting server by 10% by sharing similar content.

Let me show some concrete numbers then. I’ve taken a couple of snapshots from our repository statistics taken from the regular RTC web interface, to highlight the relationship between four measures: 1) number of repository workspaces, 2) number of files in the repository, 3) footprint of those files, and 4) the footprint of the configurations. Our hope is that as the number of repository workspaces increases, the footprint of repository workspaces doesn’t. As new files are added, it’s normal to have an increase in repository workspace/configuration footprint, but it shouldn’t be that significant.

This first graph shows the count of the number of repository workspaces that were added to our self-hosting repository over a one year period. At this rate, it looks like we’ve added 1000 repository workspaces this year.

Next is the count of files added (this doesn’t include directories). So far this year we’ve added almost 100,000 files.

These 100,000 files total 17GB compressed, including all the states (eg, revisions) of each file, and uncompressed it’s around 35GB.

And lastly is the footprint of the configurations, which records what states of each file you have in your repository workspace. The good news here is that it doesn’t follow the same upward trend as the number of repository workspaces. As more files are added, there is a normal increase in the shared configurations and change histories. Also note the huge spike last year; this is when we implemented the sharing and reduced the configuration size in the repository drastically.

Over the January to September time frame it remained almost flat-lined, as we worked mainly on maintenance. Although we created new repository workspaces, it didn’t affect the footprint. From September we added more teams to the self-hosting server and started to add a lot more files, such as all the 13 language translations and the foundation split. But the footprint of the configurations did what we expected and stayed a factor of file count — not the number of repository workspaces.