Sharded MongoDB in Integration Tests

Sharded Mongo Db in Test Containers


Recently I had a pleasure to fix integration tests in one of inherited projects from another team. It used to run from 8 to 15 minutes (!!!), failed to run on M1 Apple Silicon Chip and was leaving some mess on machine where it executed. The root of those troubles was setting up sharded MongoDb with use of de.flapdoodle.embed.mongo . After using TestContainers and few tweaks they need around 4 minutes to finish, developers with M1 Apple Silicon Chip are able to run them and they leave machine in clean state.

If you are only interested in solution, go ahead to the Final Solution section


Recently my team inherited a project from another team. When I’m getting to know new project I like to run tests, look how they are written and what do they test. I think that they are even more important than production codebase (you can easily rewrite whole project with great tests without fear) - but that’s a topic for another blog post.

When I ran tests I cried. They finished in 13 minutes.

Fast and furious

Then my teammate said that when he runs those tests he gets errors related to processor architecture - he’s got M1 Apple Silicon Chip.

When I ran those tests later again I noticed that they are even slower and my machine consumed more RAM than usual. After looking into logs I found out, that database that was set up is not cleaned up properly.

Thanks for trying a lot

And indeed there were two mongo processes after running those integration tests.

ps aux | grep mongo
piotr.proszowski 40910  49,7  0,1 36616824  48572   ??  UN    8:44    13:40.41 /var/folders/36/ycyd3y7d4mjctf18v4_jndj80000gq/T/extract-69f1502f-d8f9-4229-b9cc-dc2200940b55extractmongod --dbpath /var/folders/36/ycyd3y7d4mjctf18v4_jndj80000gq/T/embedmongo-db-7c917d98-a591-424f-837a-04f083df7772 --noauth --noprealloc --smallfiles --nojournal --port 53955 --bind_ip --replSet test --shardsvr --syncdelay=0
piotr.proszowski 41399  49,5  0,1 36616824  47412   ??  UN    8:49    13:13.92 /var/folders/36/ycyd3y7d4mjctf18v4_jndj80000gq/T/extract-d2777c3a-f3a7-45e3-ac59-892716e7036bextractmongod --dbpath /var/folders/36/ycyd3y7d4mjctf18v4_jndj80000gq/T/embedmongo-db-44b427f3-5f11-41ca-8230-1b9013b5afca --noauth --noprealloc --smallfiles --nojournal --port 54516 --bind_ip --replSet test --shardsvr --syncdelay=0

Which could be cleaned up by following command:

pkill mongo

During my Fedex day I decided to clean up this mess and introduce TestContainers to my team.

What’s wrong with those tests?

  • Tests are incredibly slow

Project uses sharded MongoDB collections (it stores a lot of data) so there was a need to set up similar database for integration tests to be sure that everything’s work as expected. To achieve this goal embedded Mongo was set up with help of library de.flapdoodle.embed.mongo . To initialize sharded mongo with use of this library (in this particular version) it takes ~4:30 minutes. Additionally, in one tests authors decided to re-initialize Spring context (you know, that when using @TestPropertySource the Spring context has to be re-initialized, right?). So at this point it’s obvious why we get such

  • Developers with M1 Apple Sillicon Chip cannot run tests

Because it was done some time ago - version 2.2.0 was used which causes problems for developers with M1 Apple Silicon Chip . This problem seems to be solved with version > 3.0.0 of library and probably could be solved by bumping version of this library.

  • Tests are leaving a mess on the machine

Leaving mongod processes in background after tests finish still seems to be opened .

Test Containers to the rescue

When I saw all of that I thought immediately - it’s a perfect usecase for TestContainers! Because it makes extensive use of Docker and Docker Compose I’ll start from briefly explaining what are those two guys.

Few words about docker

What the heck is it?

The best article answerting this question I’ve ever read is Understanding Docker as if it were a Game Boy .

Because such great articles already exist I’ll only paraphrase essentials from article quoted above.

Let’s think about Docker as a GameBoy and let’s think about docker images as games that can be run on GameBoy. Similarly, as we can run any kind of game on GameBoy we can also run any kind of software on Docker. And it’s deadly simple - we just specify what software we want to run, and it just works (as simple as running game on GameBoy). Moreover, as we can share our games with friends, we can do the same with images!

Let’s say I want to run MongoDb locally. I just want to try out how does it work and experiment with some commands. Without Docker I would probably head to the installation section on MongoDB website. I would found out installation guide for my operating system (OS X) and then install lot of stuff. Then I would probably encounter some strange error for which I would google some random fixes. It would take me a lot of time and would be unrepeatable (what if something change on my machine?).

Those are problems (among others) which Docker address.


Docker showcase

Docker is a game changer. We can imagine that somebody did all this effort to setup MongoDB once, he/she noted all the steps and created the script for that with all exact versions of all elements for which it worked. This script (image definition) is written in Dockerfile. Since whole this effort was made once I just take an image somebody created and shared publicly in Docker Registry.

I run the image called mongo with tag 5.0 in detached mode (-d) and with open port 27017 (-p 27017:27107).

❯ docker run -d -p 27017:27107 mongo:5.0

That’s it. I’ve just created Docker Container with running mongo (not necessary in version 5.0 as you can see below)

I can check what Docker Containers are currently running on my machine:

❯ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED         STATUS         PORTS                                 NAMES
a79be5ee39a1   mongo:5.0   "docker-entrypoint.s…"   2 minutes ago   Up 2 minutes   27017/tcp,>27107/tcp   compassionate_rosalind

And I can easily interact with those containers, for example:

❯ docker exec -it a79be5ee39a1 bash
root@a79be5ee39a1:/# mongo
MongoDB shell version v5.0.9
connecting to: mongodb://
Implicit session: session { "id" : UUID("0461b1b9-43ae-47e6-9978-be54a99d452b") }
> db.version()

What the heck just happened?

To understand what happened, let’s take a look what mongo Docker Image looks like. It’s available on github . Let’s take a look at Dockerfile which is basically a definition of image.

It starts with line

FROM ubuntu:focal

which means that it has a base image which is Ubuntu with tag focal. We can find also how Ubuntu image is built . Those are only free lines:

FROM scratch
ADD ubuntu-focal-oci-amd64-root.tar.gz /
CMD ["bash"]

Again - it starts from base image with name scratch . Copy and extract tarball and run bash command. I know - it looks like cheating. If you expected some fancy stuff here (as I did) you are probably disappointed.

Going back to mongo Dockerfile - we can see a bunch of commands executed, setting some strange environment variables, downloading lots of stuff. I’m not gonna analyze what’s going on there - the point is that we get a lot of stuff done packed into image which we can easily run.

The funny thing in (which is called from Dockerfile) I’ve just spotted:

_mongod_hack_ensure_arg --logappend "${mongodHackedArgs[@]}"

I’m really impressed that it works!

Is containerization just a fancy name for virtualization?

Nope. Here’s the great answer for that .

Basically containerization is less heavy than virtualization which has it’s pros and cons. Containers are less isolated from each other, they can share some resources and are much faster. We can think of virtualization as it emulates also a hardware layer whereas containerization care only about software. All containers use host OS kernel that’s why they are so lightweight.

So you want to tell me, that Docker containers can be Composed?

Appetite comes with eating - they’re saying. We can run easily any software we want in seconds. Why not to run whole bunch of apps and communicate those with each other?

There are several tools that can help with orchestrating multiple Docker containers and handle communication between those (as well as their restarting, scaling and many more). Most popular are:

With all of them we can have infrastructure as a code (more or less complex) - decide which containers to run, with which parameters and how they communicate.

I’ll focus only on the third one as it mostly serves development purposes and other two are production ready solutions for orchestrating containers (not only Docker containers). For Docker Compose this config by default is called docker-compose.yml and it looks like that:

version: '3.8'
    image: mongo:5.0
      - 27017:27017
    restart: always
        - aliases:
            - mongo-network

  internalnetwork: { }

This simple configuration above set up a single service of mongo which will be restarted every time it stopped with root username root and root password example. It also binds container’s port 27017 to localhost port 27107 and set up a network called internalnetwork - if there were more containers they could communicate within the same error by service name.

I’m not gonna deep dive how it happens internally - for purpose of this blog post information that we have such option is enough.

If you want to see the more useful example of docker-compose usage I highly recommend to look at this repo which gets you node exporters, grafana, prometheus etc. set up with single docker-compose command. It’s a great starting point for pet projects to get some monitoring.

What are test containers?

Definition from homepgae :

Testcontainers is a Java library that supports JUnit tests, providing lightweight, throwaway instances of common databases, Selenium web browsers, or anything else that can run in a Docker container.

It describes perfectly what TestContainers do. It’s worth to emphasize that in any case it shouldn’t be use to run containers that could serve production purposes - it’s just not ready for that.

TestContainers use docker api to control containers (this library allows to communicate with Docker via Java - there are similar libraries for many other languages). It contains multiple ready to use containers: for example MongoDBContainer initializes single replica set, waits for the initialization to complete and provide an URL so we can easily connect to the database in tests.

Advantages of using TestContainers

  • We are testing against real database, running in the very similar environment as in production
  • Thanks to ryuk - kind of supervisor containers are cleaned up after tests
  • It’s easy and cheap to set up (at least with Spring)
  • This solution is platform independent
  • Reusing testcontainers is an option for those who wants to have incredibly fast integration tests.

Disadvantages of using TestContainers

  • There is a need to have Docker installed
  • First startup is slow (due to downloading images) and internet connection is needed for it
  • Nodes for CI/CD have to support Docker

Final solution

Time to share what it can look like in real life. I created an project for that purpose:

I’m not going to describe it deeply, few things work notice:

  • docker-compose.yml with setting up sharded Mongo found on github
  • There was a need to wait for initialization of sharded mongo. I used the docker-compose healthcheck feature to check if shards are in place.
  • I used MongoInitializer in BaseIntegrationTest from which all integration tests inherit and I make sure that database is initialized only once across all tests

Other materials not linked in this post