By Francisco Santos and Sofia Gomes Fraud is a complicated subject. It takes on many shapes and is constantly evolving over time. While the most common fraud patterns are easily caught by our state-of-the-art models in a few milliseconds, the most complex fraud and anti-money laundering (AML) schemes involve many intertwined attacks and parties, and require […]
Fraud is a complicated subject. It takes on many shapes and is constantly evolving over time. While the most common fraud patterns are easily caught by our state-of-the-art models in a few milliseconds, the most complex fraud and anti-money laundering (AML) schemes involve many intertwined attacks and parties, and require complementary tools to defend our clients.
Genome is one of such tools available on the Feedzai stack. Feedzai Genome is an innovative new product that helps fraud analysts and investigators find and visualize complex financial crime patterns using link analysis.
Our customers are loving Genome as much as we love developing it. We recently launched Genome v2, which substantially improves upon Genome v1. Among its core improvements is the new graph storage system, where we extensively improved the performance of our graph queries.
With this complete overhaul, Genome V2 focuses on the ability to activate customers with larger data profiles, delivering massive leaps in performance and improving usability.
How did we do that? Spoiler alert: We built our own graph database! In this post we explain why and how we did it.
As with any Minimum Viable Product (MVP), to build Genome v1 fast and cheap, we reused some components of our existing Feedzai stack. As a result, Genome v1 storage was built upon another Feedzai’s component database system. Although not ideal, this was necessary to showcase Genome, demonstrate its value, and work with our clients to better understand the product that we needed to build.
Genome v1 was a major success in detecting complex fraud patterns and uncovering important fraud rings. The biggest challenge? It couldn’t scale, and some of the most important operations were significantly slow, even for small graphs.
From the beginning, we knew that we would need a different storage system to support Genome’s front-end. At Feedzai, we always try to avoid reinventing the wheel, and build upon the best solutions available.
We realized that some of our slowest queries were Graph Queries — e.g., expanding a node to find its neighbors (which can be conditional to some node or relation property) or calculating the degree of a node.
So the obvious step was to look into the landscape of the native Graph Databases (Graph DBs). We explored many solutions, including:
To test them, we start by defining what were the most common operations and queries we needed to support. They included the graph queries described above, but also other general queries like finding specific nodes (using ID or another specific property) or finding all properties of a given node (by ID).
Next, we used these queries to assess the systems, and a small dataset with around 9.2 million nodes, 8.6 million edges and 100 million properties. We were disappointed to learn that none of the solutions was suitable for our intended use case.
Neo4j is probably the most popular graph storage system today. It is built in Java and open source. It is a full-fledged Graph DBMS supporting ACID transactions, multiple users’ roles, authorization policies, and a robust query language.
Regardless of this limitation, we still decided to test it on a single node, but the results were disappointing. Even on a smallish dataset, conditionally expanding a node could take around 1.5–3s, and calculating the degree of a node could take over 3s. Less graph-like operations such as querying all properties of a given type of a node id could take around 9s.
After the disappointing performance of Neo4j, we tested RedisGraph, a module from RedisLabs based upon the amazing GraphBlas library, which treats most graph operations as linear algebra. We tested RedisGraph v1.2 which claims to be 600x faster than other solutions.
The results were indeed much better than Neo4j. Conditionally expanding would take 0.3–1.3s, and calculating the degree of a node around 0.6s. Our other query, finding all properties of a given type of a node would now take 0.5s.
Although the results were good, RedisGraph had a major limitation: it could not scale horizontally. Since RedisGraph, like Redis, requires loading all data to memory, this is a serious problem. Even with very efficient memory management, this would require very powerful and expensive machines, and even on those, we couldn’t fit several use cases.
On top of that, we felt that the solution still lacked maturity as we had multiple issues and instability problems. In the meantime, RedisGraph launched v2, which might solve some of these issues and make RedisGraph even faster. However, its core problem — scalability — remains unresolved.
We left the most promising solution to be tested last. JanusGraph is a distributed, open source, graph database built on Java. On the functional side, it ticked all the boxes, so we were very excited to try it out. There are multiple backends that can be plugged with JanusGraph like HBase, BigTable or Cassandra. We tested it using Cassandra.
Since we knew that JanusGraph was our most promising option, we spent some weeks learning how to optimize JanusGraph for our queries. JanusGraph outperformed RedisGraph v1.2 on several of them. Conditionally expanding a node would take from 20–50ms, and finding the degree of a node around 7ms.
However, we found out that the other less graph-like operations, such as simple searches, hampered our performance and would frequently cause timeouts. Search is a pretty big deal for us, since it is the main way fraud analysts begin a Genome session. We tried to overcome this, but the best solutions required us to use additional components, making our solution extremely costly to maintain and develop.
At this stage we were confused as to how to move forward. We could not accept that there was no graph storage solution ready for us to use. So we set out to learn how other companies were addressing this same problem. This is what we found:
As you can see, all these companies developed their own Graph DB solution, using well-known technologies. Each company knows its own use-cases inside and out, so it makes sense that each one could develop a great Graph DB for themselves. This observation guided us into the next step in our journey: building our own Graph DB solution.
To build our Graph DB, we looked into Genome’s two types of operations. The first is searches for specific nodes and edges. The second is graph operations for conditional expansions and node degree calculations. Fraud analysts often start a Genome session with a search then use graph operations to keep exploring.
As a result, we defined a data model optimized for these two operations. On one hand, it can answer queries regarding nodes (our entities and properties), and edges (how nodes are related to each other, and their properties). On the other hand, this data model with edges and nodes, enables us to perform graph operations like expansions very quickly.
Similar to LinkedIn, Twitter, or Facebook, we chose an RDBMS as the underlying engine for our Graph DB. In our case, we use PostgreSQL, since it is already used in the Feedzai stack. We also explored Cassandra as a possible alternative, but it provided much less flexibility for general search queries, and since our graph operations also require additional queries to retrieve properties, the performance for most graph operations was generally worse.
Our solution in PostgreSQL — Genome V2 — takes raw data and transforms it into aggregated data, stored into several tables containing nodes, and edges. In addition, we split nodes and edges per type, to support data sharding and scalability.
In summary, Genome V2 is faster than Genome V1 in all categories. It is up to 37 times faster than Genome V1 in terms of just search operations and up to 131 times faster than Genome V1 in terms of graph operations.
Native graph databases are nice general solutions that have gained a lot of popularity over the past few years. While these solutions have evolved tremendously, they still lack some maturity and performance over big datasets, which forced us to build our own graph storage, tailored specifically for the needs of our application.
With the development of new graph databases, this landscape is likely to further evolve in the future. Regardless of the tools, we will continue to develop products like Genome, and do what’s best to deliver the best performance and features to our clients.
In perspective, whereas in Genome V1 we could only handle only 1 week’s worth of data for a standard bank, we can now comfortably handle over 6 months of data, and several billion nodes and edges. This unleashes a new world of possibilities for Genome, empowering our clients to explore and understand better fraud patterns that were not visible before.
Empowering Fast Graph Visualizations for Fraud Detection (or Why We Built Our Own Graph Database) was originally published in Feedzai Techblog on Medium, where people are continuing the conversation by highlighting and responding to this story.