In hindsight this is a lesson that there’s no such thing as a free lunch and that it invariably pays to monitor services at the application level. A few months ago we started observing a peculiar behaviour with our MongoDB infrastructure that led us down a complicated path of extensive diagnosis on the MongoDB and Application level, iterative testing, a MongoDB Meetup, and ultimately a trivial fix to an obscure bug.
a life less ordinary
For the longest time we’ve been very happy with MongoDB for our open source Atlas product. We use replicasets with a secondary and arbiter server to keep the voting clean. Then, rather out of the blue, one day we found that our secondary had fallen out of sync with the primary. Fortunately, this did not impact directly on service availability but it did impair performance and response times for some use cases. When this all began, we were not actively monitoring the replication lag or replication status, owing to some bugs in the Nagios-MongoDB plugin we were using at the time, so it was not entirely obvious what the cause was. Our first clue that something was wrong was when we saw a knock on impact on other services relying on our Atlas backend. Typical that this was the only aspect of our Mongo Infrastructure that we weren’t monitoring, I swiftly hacked the Nagios plugin and we were then able to more closely monitor and alert on replication issues. (The MongoDB plugin had been subsequently fixed, and we updated our version once things settled down)
pointing the finger
Our initial suspicion was pointed at the global lock on the primary stealing cycles from the oplog replication. Atlas is a growing product and we regularly add in new features, including large tasks, which iterate over the majority of items in our 40GB database and write results throughout. We found that whenever we stopped this task, our replication lag would slowly catch up. Although lag should never typically be more than a couple of seconds, we suspected that our use-case for Mongo might not be a conventional one. This was first indicated by our observation that the oplog for replication, which was at 10GB, would contain as little as 60 minutes of data at its busiest periods. Given this oplog failed to meet our disaster recovery requirements, we first looked at extending the size to 20GB (yes, 50% of our database size). This helped to a degree but ultimately we still had sync issues, and our replication was slow. We also looked into optimisations of our processing tasks which put strain on the DB, but this too yielded little gain.
phoning a friend
An extensive search of discussion forums regarding replication lag and secondary’s going out of sync seemed to prove fruitless. In order to investigate further we turned our sights back to the MongoDB instances, and enabled the 10gen MMS extended monitoring service so we could better visualize how the hosts were being used. While very insightful, it could not help us conclusively pinpoint operations by our schedules and when the database falls out of sync. We had reached the position where we needed professional advice on our problem and sought out the regular MongoDB Meetup occurring in London, to discuss our issues with 10gen. Sadly the meeting was not entirely helpful as the engineer assigned to help us might have been the ‘new guy’ and knew little on the nature of replication or low-level operations. He did however mention the use of Capped Collections very casually, and this led me to research our next angle of diagnosis: Bugs.
a glimmer of hope
After another round of searching discussion forums, now armed with my new keywords, I discovered a few threads talking about replication sync issues and brief mentions of Capped Collections in replicasets. It turns out Capped Collections behave very interestingly in replicasets, very much like they weren’t designed for replication in mind. By default, Capped Collections do not create indexes on ID, primarily for performance reasons in the developer’s view. The documentation stresses that ID collisions on capped collections can cause the replication to stall or halt. It falls short of recommending indexes for Capped Collections when using replication and neglects to discuss the impact of replicating Capped Collections without an index. We checked and could not see an index on our Capped Collection(!).
wait, it was that easy?
We responded by manually adding an index to the primary. This was simple enough, and there’s handy documentation on the MongoDB website on how to do this. Unfortunately, the index does not get replicated to a new or existing secondary. This is particularly troublesome when we were recovering from a stale secondary as we’d have to perform the initial resync and then remove it from the replicaset, add the index, and then let it catch up. The way that indexing on Capped Collection is seemingly required for replicasets and the conflicting behaviour of the creation and replication Capped Collections is in my view, a poorly implemented feature at best or a bug at worst. It’s a better compromise to accept the performance impact of having an index than potentially breaking a secondary from being able to play an oplog. Below is the moment we added our index and restarted the secondary. Yeah…
Suggestions in MongoDB discussions are that the secondary needs to scan over the full collection to be able to replicate any changes when there’s no index. This would make the process very slow, especially if the oplog contains a long series of writes to the collection. Given how the lag eventually recovered, this behaviour seems consistent with our experiences.
how to lose an index and alienate secondaries
The larger question is how did this all start. We suspect it was as simple as a failure on our secondary database requiring us to rebuild from scratch. We designed our infrastructure with resilience and redundancy in mind, so the loss of a secondary should be seen as a routine incident. In the event of a loss, we would have implemented a new secondary and synced over the data. This would have not replicated the index on the Capped Collection, which must have had originally existed. Without that, replication on the Capped Collection would have slowed dramatically and the rest is history. From our understanding, the Capped Collection Index sync bug will not be fixed until the next release of Mongo, which is due within the next few months. We have adjusted our disaster recovery documentation to require the manual creation of an index on the secondary.
That’s not to say our Atlas processing tasks can’t abuse MongoDB and make it fail to replicate in other weird and wonderful ways. What we resolved was a cunning issue of slow replication. We’ve since experienced a puzzling situation of catastrophic lag while one of our jobs runs, and then an almost immediate recovery if we kill the job. As it happens, we found that some dirty data had crept into our system causing a bit of a feedback loop of epic proportions. This significantly increased the volume of reads and writes, meaning that replication stood no chance of doing anything while the job ran.
the moral of the story
It’s not just bugs and ‘features’ that can give you headaches with MongoDB, something as simple as poorly formatted or inconsistent data can be terrifying. Since implementing indexes on our Capped Collection (we only had one), and cleaning up the data problem, replication lag on our secondary is now infinitely better.