Design by Kumkum Jain Legacy code clean up and movement is like spring cleaning your home. It eliminates clutter and creates a less chaotic and more streamlined environment with more logical boundaries. But, it’s not easy! It comes with many hidden challenges. In order to move tangled codebases, you need to identify and break dependency cycles, […]
Legacy code clean up and movement is like spring cleaning your home. It eliminates clutter and creates a less chaotic and more streamlined environment with more logical boundaries. But, it’s not easy! It comes with many hidden challenges. In order to move tangled codebases, you need to identify and break dependency cycles, identify which parts of the system talk to other systems, make sure you keep your dependencies aligned with your architectural vision and most importantly get to a point where the team can easily reason about the architecture, have lesser dependencies to deliver features and can move fast independently.
We recently worked on separating handling of grievances and refunds from our Order Management Service (OMS) to a new microservice. Grievances are basically any concerns raised by our customer and refunds are monetary refunds for orders. These have been till date coupled with the OMS service because the team managing OMS would also own grievances and refunds (Conway’s Law). The objective of this refactor was to enable my team to work on grievance and refund handling independently. We wanted to rapidly run experiments on grievances to make the overall support experience better for our customers. The lack of control on existing codebase involved in delivering those outcomes involved hand-offs with the team maintaining OMS (for activities like code review, deployment, etc.), which increased wait times and took away our independence to innovate and experiment quickly.
For starters, as a developer, you must know how to deep dive into unknown code bases. When you work on a new code base, there are a lot of specifics that you don’t know, like why some of the code is written in a certain way, why a certain design pattern is used in favour of another design pattern, if all the critical business logic is documented, etc. To be able to understand how the code base came out to be how it is today, you need to ask the right set of questions.
Keep notes, create diagrams, understand the key workflows, naming conventions, call hierarchies, code structure, design patterns used, dependent services, etc. This will give a basic idea of what the code is doing.
Being able to visualize the system helps a lot in having conversations and collaborating with the people in your team, other teams dependent on your project and stakeholders. We created a couple of diagrams to visually represent the architecture, design, and implementation of grievance and refunds in OMS and exhibit a high-level flow.
For example, we had to map out the entire logic for performing all the validations and checks before creating a grievance in OMS in a diagram (see below) to fully understand what was going on.
Tip: Try using the functionality you are moving from a customer’s point of view. It will help you navigate dependencies easily.
By now you would have enough clarity of everything you need to know before planning the migration.
We divided the entire code separation activity into 6 phases with specific timelines and scope.
Define the translation of modules. For example, Module X can have some structure, naming convention which you might want to redefine in the new microservice.
Don’t just copy-paste code from old to new service. Look what part of the current service you can improve, try reducing coupling and making it more cohesive. Try reading about some best ways of refactoring your code within the provided time frame.
You should try to identify whether the already written logic needs refactoring and if it can be done without changing the existing flow for other clients. For example, in our case, we had implemented a certain price related grievance handling logic that would update the actual selling price in the database for a particular order without keeping a log of that change, which made auditing transactions by our support and finance teams very difficult. In our new microservice, we made sure that every transaction is recorded as an individual transaction instead of mutating an old transaction.
Tip: We kept adding #TODO while refactoring till the testing phase whenever there was some unresolved dependency. It helped us with revisiting some pending decisions and issues that had to be fixed before the deployment.
You must know what is expected of the code and you have to find where the new code is not meeting those expectations. Refactoring is a process of changing the current application in such a way that it improves the internal structure without affecting its external behaviour (mostly).
We wanted to make sure that existing behaviours don’t break. We did quick behaviour testing by comparing the API contracts and comparing inputs and outputs of the new service. This helped us establish that for end users our application continues to work as earlier and the behaviour of the application is preserved.
Allocate a good amount of time to testing. Approach the test cases from as many angles as possible. While we did some behaviour testing by writing cheap one-off scripts in interest of time, you can try writing automated behaviour driven tests and make it a practice.
We added unit tests which verify the accuracy of each unit. Unit testing improves the overall quality of the code. It identifies all the edge cases that may have come up before writing integration tests. After writing test cases, always check the code coverage and try to maximise the coverage. There is no fixed lower or upper limit of code coverage as it depends on your code structure and the criticality of functionality. We were able to achieve ~65% code coverage.
This is the part where you have to figure out how other internal services are dependent on the code that you’re separating. This can be one of the most painful parts of the process. You will have to go through a lot of documentation of other services. And if there is not enough documentation, this is where talking to service maintainers would help.
A strategy that we used was the segregation of read and write endpoints. Segregating reads and writes essentially helped with the deployment and also scaling throughput for reads. This is how we went about migrating everything:
To achieve full observability, logs should be well-structured and appropriately leveled. There are a few tips for logging which we used while adding logs because it really helps when you have to monitor your application afterward. Without structured logging, it can be difficult to identify what’s happening when something goes wrong, especially debugging infrequent bugs.
Lastly, don’t log too much or too little.
Deployment is the most critical phase of a migration project which raises questions like — how should we orchestrate deployment of changes across multiple dependent services (are there any potential cyclic dependencies), how will we handle the dependent services while deployment, can we deploy all the services without any downtime? There were two deployment strategies that we were considering:
Parallel strategy meant, keep running the old codebase and deploy the new one in parallel. Then shift the traffic gradually to the new service by first moving all the read endpoints (less critical) and then moving the write endpoints.
Hard Switch strategy meant, switching all the traffic to the new service. This would have meant that we had to have a solid rollback strategy in case something doesn’t work out.
Both the strategies had their pros and cons. We went with the parallel strategy as it is not a make-or-break solution and it helped us reduce the blast radius. If something doesn’t work out, we could easily move the traffic back to the old service.
Capacity Planning: If you are adding the migrated code in an existing and more relevant microservice, then your service will start handling more requests, and you might need to scale up your service, the database instances, etc. Look for current CPU and memory utilization, database connections, network throughput, and based on how much of it will start bouncing up after deployment, you can plan the scaling.
Data migration: You might not need this step if there is no database migration but in our case, it was required.
We needed to start migrating the data from the new database while still writing to the old database. So we needed to establish a data synchronization process between both the databases. We used AWS DMS for syncing data in real-time from the old database to the new database and kept it on for a few days until the migration and monitoring of the entire code were done in production.
After all the data was synchronized in the new database, the parallel deployment strategy ensured a smooth deployment by first incrementally updating the read endpoints and then the write endpoints.
Monitor all the affected services constantly after the deployment. You need to have traceability for failures. We use Grafana for service health monitoring, Loki for logs, NewRelic for application performance monitoring.
Having a ready list of Loki queries helped with debugging production issues quickly whenever they happened (and they will happen):
The decision to move from legacy is a hard one but we took this decision to bring agility in our team, to own our problems end-to-end and to be able to take product and technical decisions independently and execute changes fast. This refactor was painful, but it has truly enabled us to own our problems along with positive side effects such as resolution of nasty bugs that have been in production for years. A great benefit has been that we are not just able to meet our short term goals more efficiently but we can also define the vision of the code base along with the vision of the product. We are better enabled to maintain the health of the code base in the long term.
This is our story of how we approached refactoring service boundaries in our microservices architecture. And most likely, there are many approaches to do this. If you have done this somewhere, we’d love to hear from you and learn about your approach.
Here are some of the tools we used in this project that we found pretty useful in any kind of large technical change