Bad Analysis – How Data Migration Turned Zen Parable to Zeno’s Paradox

This will be my first addition to my #errlog diaries – a chronicling of different failures in my past. Before I begin, I’d like to assure the reader that this entire post isn’t just an excuse to use that title.  

This failure is localized to a single year-long project, and as such, I’m able to break it up into smaller, easy to understand, parts.

  1. Preface
  2. Challenge
  3. Failure
  4. Consequence

Let’s dive right in.

PREFACE

The entirety of the firm’s data layer is based on Cassandra 3.0. The decision to use Cassandra stemmed from a few core characteristics of our system.

  • Required: Fast Insertions – The vast majority of inserts into our database is ‘timeseries’, which is to say that inserts happen in the order of real-time events.
  • Required: Fast Seeks – We wanted constant-time fetching of data if we have a sufficient set of query parameters.
  • Not Required: Immediate Consistency – All real-time relevant information in the middle of the trading day is communicated from one JVM to another via Java Multicast. The data layer is for the vast ocean of analytics that consumes this data posthoc.
  • Not Required: Low Maintenance – We hosted the Cassandra installation on local ny4 dataservers over in Secaucus, New Jersey. That’s light jogging distance from our New York Headquarters. This means we could buy powerful machines close enough for us to service within the hour of any problems arising.

The above profile of requirements and non-requirements paints a pretty clear picture. We started off, as a company, trying to attack the US Equities market. So it made sense to have a local datacenter. It made sense to store a single piece of data several times in different tables so that we could have constant-time seek calls even if our queries were very different.

I’m describing our legacy data layer in the ‘Preface’ section because the data layer is intricately tied to the application layer of all of Clearpool’s code.

Every single create, read, update, or delete statement ever written was written with our Cassandra setup in mind. But this becomes a problem when words like “global” and “scalable” come into view.

No one ever promised that our local Cassandra setup was going to be able to serve requests from Canada in an appropriate amount of time. Hell, Clearpool is even flirting with the idea of letting the clients ‘own’ the data in their own data silos.

And that doesn’t mesh at all with a local Cassandra database, right? How do we ‘give’ them their own data? How do we open our European Clearpool branch when each query has to travel across the Atlantic Ocean?

In comes the technical architect.

He’s got the answer to this problem, and his answer is magical. We’re going to migrate our entire data layer into the Amazon Cloud, and we’re going to set it up in such a way that any new customer can ‘spin up’ an instance of our analytics software and run with it.

CHALLENGE

Our current setup is a local Cassandra 3 cluster running a couple miles away, in Secaucus, NJ. Now all we’ve got to do is migrate it all into the Amazon cloud. This can be our opportunity to assess inefficiencies in our data setup!

Thought 1: “We want to keep the ‘shape’ of our data generally the same.
This means that we are looking for a NoSQL Cloud database. Oh look, Amazon DynamoDB offers just what we want!

Objection! If we wanted to write the same amount of data into DynamoDB that we did into our local Cassandra, I’d probably have to start working Pro Bono, because the firm would be bleeding money into Amazon.

Thought 2: “We’ll port everything over from NoSQL to a Relational database.
I mean, this could be a really good thing! Our old data model involved serializing all objects using Google Protobuf before storing them into the Cassandra table. Serializing everything made inserts and reads super fast, but it came with a cost.

This means that all queries that didn’t hit a specific key needed to read entire chunks of data into the JVM, deserialize the data into Java objects, then apply filtering logic. We had gotten used to it, but we had several developers that would salivate at the thought of being able to run complex SQL statements against our data!

Objection! The producers and consumers of our data have become accustomed to Cassandra’s NoSQL features. A bait-and-switch under the covers that replaces the ‘column-family’ of Cassandra with a ‘table’ from PostGres is not even remotely close to smooth.

Example: The table definition is now forced to be constantly up-to-date.

Let’s say an object that has 3 fields in version 1, but a 4th field in version 2. Imagine, then, that Production is running version 1, whereas version 2 is still in Development. In Cassandra, you can still insert objects to the database using version 2 of code, because Google Protobuf is forward and backward compatible as long as no one is changing the semantic meaning of pre-existing fields.

In PostGres, however, inserting version 2 will lead to a PSQLException complaining about how that 4th column doesn’t exist.

I don’t mean to make the above challenges sound insurmountable. They’re not. Every project will face challenges, otherwise engineers wouldn’t be paid the kind of money we are. But the way we handled these challenges is what brings us to the next section.

FAILURE

Solving problems is our forte.

But we seem to be terrible at grasping how long the problem will take to solve. The classical approach to making an accurate estimate for a project is to break it down into smaller parts that more closely resemble previously completed projects. At this point, apply estimates to the smaller parts, and add it all up together.

Breaking up our data migration project into smaller parts wasn’t very difficult. Below is the general gist of how it came out. Please note that 1.0M = 1 Man-Month.

Step 1: We went through this exercise and came to a 20 Man-Month estimate.

Step 2: The development team consisted of me and a party of 2 developers.

Step 3: So we did some very simple arithmetics, and took the 20M number above, and divided by 3 to arrive at roughly 7 months of time required for a 3-man team to finish this project. 

Step 4: The architect of the project committed to finishing the Cloud project in 7 months.

Failure 1.a: I didn’t voice my immediate reservations about the timeline of this project. This was perhaps a subconsciously political move on the my part because I was, after all, coming on to a new team and trying to make a good impression.

Failure 1.b: I didn’t bother going through all assumptions made at the time that this 20M estimate was made. The architect’s expertise in the Cloud space was a comfortable safety net that dulled the my natural skepticism.

Failure 2: As new information arose that broke assumptions made during the estimation-phase, both the architect and I chose to try to simply pick up the pace to meet deadlines, instead of formally publicizing that we had heavily underestimated the project.

“Measure twice, cut once.”

Who has time for all that measuring?

If you’re curious, here’s an example of an unexpected problem…
PostGres doesn’t internally handle deadlock scenarios. For example, transaction A is waiting for a lock held by transaction B, while B is also waiting for A. This throws a wrench into the scalability of our database insertion infrastructure.

CONSEQUENCE

Achilles races a tortoise, and you won’t believe what happens next!

Achilles and a tortoise set out on a race. And to even the odds, Achilles lets the tortoise get a head start.

Let’s assume that both Achilles and the tortoise are constantly moving towards the end goal.

This means that whenever Achilles runs to where the tortoise previously was, the tortoise has moved forward, because both of them are constantly moving. Which seems to imply that Achilles will never catch up to the tortoise, even though he’s obviously faster.

Every time he’s about to catch up, he’s just a tiny bit behind.

Several corners were cut during the design and analysis phase of the project. And because of this, the stakeholders are met with a string of excuses over a series of months explaining why the delivery was delayed yet again.

Predictably, this affected the confidence that the stakeholders had in the merit of the entire project. While the architect and I found meaningful bugs and discovered insightful shortcomings in their architecture, people on the outside just saw missed deadlines.

Momentum and stakeholder confidence are everything.

And as with everything precious, they are hard to gain and easy to lose.

Leave a Reply

Your email address will not be published. Required fields are marked *