Every day, Addepar’s data pipeline consumes portfolio data from hundreds of different custodians and imports it to millions of nodes and edges in our Financial Graph. To ensure that the data can be trusted and used to correctly calculate performance, it’s crucial that we not only import it in a timely fashion, but also run […]
Every day, Addepar’s data pipeline consumes portfolio data from hundreds of different custodians and imports it to millions of nodes and edges in our Financial Graph. To ensure that the data can be trusted and used to correctly calculate performance, it’s crucial that we not only import it in a timely fashion, but also run verification checks with what already exists on our platform. The data is complex, and data integrity check failures can be caused by a lot of different issues — for example, missing data, incorrect raw data from custodian, incorrect transaction mapping, problematic validation logic, and incorrect security mapping.
As our platform grows, we’re continually improving and extending our methods. The following is an example of how we addressed one type of verification problem — node matching — by non-intrusively introducing a core change to a critical pipeline.
The Financial Graph is used to represent the ownership structure of portfolio data, where a node represents an entity (such as a brokerage account, an asset owner, a legal entity, a stock, or a bond), and an edge (or a position in financial terms) represents the ownership relationship between two nodes. The from node is often an account, and the to node refers to the portfolio’s holding — the actual security.
When importing data, we need to do a few things:
Ideally, the security nodes are unique and shared between the account nodes. After all, the edge is what references the client’s specific holding.
Addepar’s platform imports daily account portfolio data from hundreds of different data sources. Each source may represent each security slightly differently. When there isn’t enough information in the feed to identify the corresponding node for the security, the system treats it as a new security and automatically creates a new node. Clients may also edit the security attributes directly, which can also cause the system to create a new node. Over time, multiple nodes may represent the same security.
Having duplicate nodes isn’t great, but it’s not the end of world. It has no impact on the accurate representation of the client’s portfolio. What’s important here is the edge, where the position snapshots and transactions are stored. We need to be able to import data to the right edge all the time. The real problem with duplicate nodes is that it increases the probability of the system matching the account’s holding to the wrong security node, hence breaking the time series. This in turn causes unit verification failures and performance calculation errors.
Above is an example showing that Account I’s Facebook holding has time series data for three days. However, on Day 4 after the import, it only has one day of position data because the wrong security node was identified, which created a new edge. This causes the time series to drop off in what we refer to internally as the Node Jumping problem.
Instead of searching the from and to nodes first and identifying the edge, what if the system could just match with the edge directly? The benefit here is that matching becomes very straightforward, and node matching is bypassed for the most part, minimizing the chance of node jumping. The challenge is that we’re introducing a new primary key that bridges two large internal systems. It’s a fundamental change to the data flow of Addepar’s daily portfolio data imports. The adoption risk is very high. We took this challenge into account when deciding how to design and deliver this change.
At a high level, the new workflow leveraging the new Position ID, goes like this:
The Position ID is the unique identifier for each holding in a portfolio. Some portfolio data partners have this concept for us to leverage. Others require examining patterns in the feed data to come up with the best derived key. While we may know what combinations make up the best Position ID, the assumption still needs to be validated for the entire time series. The goal is to find the most stable combination in the time series. Position ID loses its meaning when it mutates too often because the process will just fall back to node matching. The combination that mutates the least over time is considered the strongest ID.
Ideally, Position ID is a constant value like a true primary key. In reality, Position ID mutates over time as security attribute changes. This could happen for a variety of reasons, the primary driver being data coming from the custodians or corporate actions. To accommodate this requirement, Position ID persistence utilizes a temporal model where each ID has a start and end date to track its validity. This is essential for historical imports and replaying imports: as IDs change over time, the ability to hit the right edge each time is crucial. Each Position ID should only be assigned to one edge only. There should never be a case where a Position ID points to more than one edge. To ensure data integrity, any conflicting ID assignment is invalidated with every insert.
The temporal model also provides a complete history of the Position ID (PID) assignment of an edge. This is particularly useful when debugging any mapping abnormalities.
There is a big difference between building a system from scratch and strategically evolving a running system. With the former, you have fewer constraints on the design, no concerns about backward compatibility, and, most importantly, no possibility of impacting a working solution. Often times, we have to deal with the latter, making core changes to a critical pipeline, whether that’s migrating a database, adopting a new messaging infrastructure, or refactoring a large interface. When we do, we need to constantly think about the risk with each release and how we can leave the production system better than we found it.
For the Position ID release, the known and unknown risks were very high because the change could directly impact the integrity of the client’s portfolio. Before we could roll out this feature, we first needed to understand the worst case scenarios:
Thorough unit test coverage and feature flagging goes a long way toward ensuring the quality of a release. Another important aspect here is test coverage. How we do ensure the test data set is meaningful and covers all the edge cases? The following captures the steps we took for this feature rollout:
Releasing this feature was a lengthy process and it took a few iterations but with these careful measures, the release went out without any issues and we have been using Position IDs successfully to improve our data quality.
Our data footprint is growing exponentially. Continually investing in, retrofitting, and upgrading our platform is a constant theme at Addepar. We balance these challenges with careful planning around known and unknown risks, always look to learn from our mistakes, and collaborate across teams of passionate engineers who enjoy solving tough problems. There will be more blogs in this space around large scale data migration, import and performance challenges. Stay tuned!
Retrofitting Position IDs to Addepar Non-Intrusively was originally published in [email protected] on Medium, where people are continuing the conversation by highlighting and responding to this story.