An approach to refactoring a legacy codebase

Design by Kumkum Jain Legacy code clean up and movement is like spring cleaning your home. It eliminates clutter and creates a less chaotic and more streamlined environment with more logical boundaries. But, it’s not easy! It comes with many hidden challenges. In order to move tangled codebases, you need to identify and break dependency cycles, […]

Design by Kumkum Jain

Legacy code clean up and movement is like spring cleaning your home. It eliminates clutter and creates a less chaotic and more streamlined environment with more logical boundaries. But, it’s not easy! It comes with many hidden challenges. In order to move tangled codebases, you need to identify and break dependency cycles, identify which parts of the system talk to other systems, make sure you keep your dependencies aligned with your architectural vision and most importantly get to a point where the team can easily reason about the architecture, have lesser dependencies to deliver features and can move fast independently.

We recently worked on separating handling of grievances and refunds from our Order Management Service (OMS) to a new microservice. Grievances are basically any concerns raised by our customer and refunds are monetary refunds for orders. These have been till date coupled with the OMS service because the team managing OMS would also own grievances and refunds (Conway’s Law). The objective of this refactor was to enable my team to work on grievance and refund handling independently. We wanted to rapidly run experiments on grievances to make the overall support experience better for our customers. The lack of control on existing codebase involved in delivering those outcomes involved hand-offs with the team maintaining OMS (for activities like code review, deployment, etc.), which increased wait times and took away our independence to innovate and experiment quickly.

For starters, as a developer, you must know how to deep dive into unknown code bases. When you work on a new code base, there are a lot of specifics that you don’t know, like why some of the code is written in a certain way, why a certain design pattern is used in favour of another design pattern, if all the critical business logic is documented, etc. To be able to understand how the code base came out to be how it is today, you need to ask the right set of questions.

Keep notes, create diagrams, understand the key workflows, naming conventions, call hierarchies, code structure, design patterns used, dependent services, etc. This will give a basic idea of what the code is doing.

Some questions that we found were useful to ask

  • What specific part of the service are you going to separate? You won’t be working with the entire codebase of your existing service (hopefully). There is most likely a specific part of it that you need to refactor. You should figure out those specifics first. Try to establish the boundary of the domain that you intend to work with and its impact on the system.
  • Understanding the need to move the codebase? What is the impact of the migration? Why are we doing this? For example: our objective was to reduce the dependency as much as possible on other teams. We decided to measure the times we get blocked on other teams in delivering a user story in the previous quarter vs the next quarter.
  • Who are the current maintainers of the project? Knowing the right set of people to whom you can reach out with your queries is a blessing. What really helps is also aligning them with your objectives so that your refactoring initiative can get more active support from the current maintainers, after all it also reduces their overhead and responsibility.
  • Who are the stakeholders involved in this refactor? For example, for us it was the CRM business team, the OMS engineering team, the cart platform team and the payments platform team. Not all of them were included in every detail of the project. But knowing the difference helped us in terms of who to reach out for help and who to keep informed about progress.
  • What are the parts of the current system which are dependent on the code you are going to move? This is the right question to ask when you have clarity about what code you are working with. You will need to understand what other things of the current services are directly or indirectly coupled with the code that you need to move out. For example, the existing architecture for certain features may not be relevant any more (performance, reliability, etc.) after the service boundaries have changed. Figure out when and why the clients are calling the system you are dealing with and how you will not break those dependencies.

Create a lot of diagrams along the way

Being able to visualize the system helps a lot in having conversations and collaborating with the people in your team, other teams dependent on your project and stakeholders. We created a couple of diagrams to visually represent the architecture, design, and implementation of grievance and refunds in OMS and exhibit a high-level flow.

For example, we had to map out the entire logic for performing all the validations and checks before creating a grievance in OMS in a diagram (see below) to fully understand what was going on.

This is one of the many diagrams that we created. While most details are redacted for security reasons, this diagram can be used to understand the complexity of the system and its implications on the refactoring.

Tip: Try using the functionality you are moving from a customer’s point of view. It will help you navigate dependencies easily.

By now you would have enough clarity of everything you need to know before planning the migration.

Planning the refactor and unplugging of dependencies in phases

We divided the entire code separation activity into 6 phases with specific timelines and scope.

1. Refactoring the code

Define the translation of modules. For example, Module X can have some structure, naming convention which you might want to redefine in the new microservice.

Don’t just copy-paste code from old to new service. Look what part of the current service you can improve, try reducing coupling and making it more cohesive. Try reading about some best ways of refactoring your code within the provided time frame.

You should try to identify whether the already written logic needs refactoring and if it can be done without changing the existing flow for other clients. For example, in our case, we had implemented a certain price related grievance handling logic that would update the actual selling price in the database for a particular order without keeping a log of that change, which made auditing transactions by our support and finance teams very difficult. In our new microservice, we made sure that every transaction is recorded as an individual transaction instead of mutating an old transaction.

Tip: We kept adding #TODO while refactoring till the testing phase whenever there was some unresolved dependency. It helped us with revisiting some pending decisions and issues that had to be fixed before the deployment.

2. Writing test cases

You must know what is expected of the code and you have to find where the new code is not meeting those expectations. Refactoring is a process of changing the current application in such a way that it improves the internal structure without affecting its external behaviour (mostly).

We wanted to make sure that existing behaviours don’t break. We did quick behaviour testing by comparing the API contracts and comparing inputs and outputs of the new service. This helped us establish that for end users our application continues to work as earlier and the behaviour of the application is preserved.

Allocate a good amount of time to testing. Approach the test cases from as many angles as possible. While we did some behaviour testing by writing cheap one-off scripts in interest of time, you can try writing automated behaviour driven tests and make it a practice.

We added unit tests which verify the accuracy of each unit. Unit testing improves the overall quality of the code. It identifies all the edge cases that may have come up before writing integration tests. After writing test cases, always check the code coverage and try to maximise the coverage. There is no fixed lower or upper limit of code coverage as it depends on your code structure and the criticality of functionality. We were able to achieve ~65% code coverage.

3. Validate dependent services

This is the part where you have to figure out how other internal services are dependent on the code that you’re separating. This can be one of the most painful parts of the process. You will have to go through a lot of documentation of other services. And if there is not enough documentation, this is where talking to service maintainers would help.

A strategy that we used was the segregation of read and write endpoints. Segregating reads and writes essentially helped with the deployment and also scaling throughput for reads. This is how we went about migrating everything:

  • We first started the data migration process to migrate the data from the existing database to a new database for the new microservice (we will talk about this later in detail)
  • Then we updated all the read endpoints reading from the new database
  • And finally, we updated the write endpoints to start writing to the new database.
Reads and write on the existing legacy service

4. Logging

To achieve full observability, logs should be well-structured and appropriately leveled. There are a few tips for logging which we used while adding logs because it really helps when you have to monitor your application afterward. Without structured logging, it can be difficult to identify what’s happening when something goes wrong, especially debugging infrequent bugs.

  • Use JSON log messages because it’s a structured format. It is both readable and compact, and can be easily queried.
  • Categorise log levels (Trace, Info, Debug, Error, Fatal). With careful use of log levels, you can make it easier for existing tools to search through logs and find the most relevant information. For example: adding a trace ID in the logs can help you trace individual flows.
  • Write meaningful log messages.
  • Add enough context to your log messages, otherwise, logs won’t add much value. For example, a log message with enough context looks like:
  • Include the stack trace when logging exceptions.

Lastly, don’t log too much or too little.

Here is an example of what we consider as a useful log message. This is a screenshot from Loki which parses the structured fields and makes them more readable in the UI.

5. Deployment

Deployment is the most critical phase of a migration project which raises questions like — how should we orchestrate deployment of changes across multiple dependent services (are there any potential cyclic dependencies), how will we handle the dependent services while deployment, can we deploy all the services without any downtime? There were two deployment strategies that we were considering:

  1. Parallel (incremental deployment and testing)
  2. Hard Switch

Parallel strategy meant, keep running the old codebase and deploy the new one in parallel. Then shift the traffic gradually to the new service by first moving all the read endpoints (less critical) and then moving the write endpoints.

Hard Switch strategy meant, switching all the traffic to the new service. This would have meant that we had to have a solid rollback strategy in case something doesn’t work out.

Both the strategies had their pros and cons. We went with the parallel strategy as it is not a make-or-break solution and it helped us reduce the blast radius. If something doesn’t work out, we could easily move the traffic back to the old service.

Steps involved in deployment

Capacity Planning: If you are adding the migrated code in an existing and more relevant microservice, then your service will start handling more requests, and you might need to scale up your service, the database instances, etc. Look for current CPU and memory utilization, database connections, network throughput, and based on how much of it will start bouncing up after deployment, you can plan the scaling.

Data migration: You might not need this step if there is no database migration but in our case, it was required.

We needed to start migrating the data from the new database while still writing to the old database. So we needed to establish a data synchronization process between both the databases. We used AWS DMS for syncing data in real-time from the old database to the new database and kept it on for a few days until the migration and monitoring of the entire code were done in production.

After all the data was synchronized in the new database, the parallel deployment strategy ensured a smooth deployment by first incrementally updating the read endpoints and then the write endpoints.

6. Monitoring

Monitor all the affected services constantly after the deployment. You need to have traceability for failures. We use Grafana for service health monitoring, Loki for logs, NewRelic for application performance monitoring.

Having a ready list of Loki queries helped with debugging production issues quickly whenever they happened (and they will happen):

  • Query for listing all the 5xx
  • Query for listing all the 4xx
  • Queries for particular failing cases or based on log messages or tags you’ve added in your code.
  • Query for tracing individual requests using a Request ID, Transaction ID or a critical entity such as Order ID or User ID.
  • Our refactoring project was related to refunds, so we formed queries around refund amounts credited to the customers. For example, how much did we refund for a user or in a day? Was the refund correctly processed as we expected? We didn’t want to not accidentally give out refunds that a customer was eligible for, but at the same time we didn’t want to give out unnecessary refunds.

Conclusion

The decision to move from legacy is a hard one but we took this decision to bring agility in our team, to own our problems end-to-end and to be able to take product and technical decisions independently and execute changes fast. This refactor was painful, but it has truly enabled us to own our problems along with positive side effects such as resolution of nasty bugs that have been in production for years. A great benefit has been that we are not just able to meet our short term goals more efficiently but we can also define the vision of the code base along with the vision of the product. We are better enabled to maintain the health of the code base in the long term.

This is our story of how we approached refactoring service boundaries in our microservices architecture. And most likely, there are many approaches to do this. If you have done this somewhere, we’d love to hear from you and learn about your approach.

Tools we used

Here are some of the tools we used in this project that we found pretty useful in any kind of large technical change

  • Hackmd.io for notes.
  • Draw.io for diagrams.
  • Google sheets for project management.
  • Confluence for documentation, recording details and architecture decisions.
  • This useful extension in VSCode helps visualize TODOs.

Ekta Garg is a Software Engineer at Grofers. Follow her on Twitter.

We are hiring across various roles! If you are interested in exploring working at Grofers, we’d love to hear from you.


An approach to refactoring a legacy codebase was originally published in Lambda on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Grofers