Problem overview Jobandtalent manages the lifecycle of thousands of jobs. From the requirements, to the recruitment, to the management of the contract, to the end of the contract. We even try to accompany the worker to their next opportunity. One of the key metrics in this industry is obviously the length of the contracts as […]
Jobandtalent manages the lifecycle of thousands of jobs. From the requirements, to the recruitment, to the management of the contract, to the end of the contract. We even try to accompany the worker to their next opportunity.
One of the key metrics in this industry is obviously the length of the contracts as it is proportional to the revenue per contract, which is compared to the fixed and variable costs of managing it.
However, the length of the contracts is not an easy KPI to measure because we do not know in advance when a contract will end (they are often renewed many times, and they can also be shortened). Therefore if we were interested in the average length of the contracts that started this month, we would have to wait until all contracts have finished to finally know the true value (it can take a few years!). Can we make a good prediction instead?
The data team was tasked with tackling this challenge.
In the following sections, we detail our approach that mixes elements of survival analysis and Bayesian inference. In the end, we implemented an algorithm that provides an accurate prediction, continuously updating it as more information becomes available.
Survival analysis is the branch of statistics interested in modeling the time to an event. In this case, we are measuring the time between the creation (“birth”) and the end of a contract (“death”). As previously mentioned, the difficulty of the task is due to the fact that for all ongoing contracts, this data is missing. Those observations are defined as “censored” because at this point in time we don’t know yet when they will end.
The most important tool in survival analysis is the survival function. It is a curve along a time axis that displays, for a given time, the proportion of the population that is expected to be “alive”. This curve tells us all we need to know about the length of the “lives” of the population. In fact, it can easily be shown that this curve is simply 1-CDF(T), where T is the random variable representing the lifetime, and CDF(T) is its cumulative distribution function. Our problem can thus be solved if we could just estimate the survival function of T.
We choose to create a parametric model since we will need to predict the shape of the survival function well beyond the observed timeframe. However, this requires us to correctly specify a distribution.
The usual tool for selecting a distribution is a QQ-plot, but this is generally not appropriate for censored data. Thankfully, the lifelines python library contains routines to take censoring into account and still create Q-Q plots.
From the images above we can say that, among the distributions most used in survival analysis, the Weibull distribution fits the best. It is an approximation though, and we will have to take this into account when interpreting the results. Weibull distribution is characterized by two parameters, here called k and λ. λ is called the scale parameter because it controls how close to 0 the mass of the PDF (probability density function) is. k is called the shape parameter because it changes what the PDF looks like near 0.
We would like to see the evolution over time of the length of the contract. It was decided that a month was the right granularity level, so we grouped the contracts by their month of creation. For each calendar month, we want to compute the average lifetime of the contracts that started during that month.
The easiest approach is to consider all of those populations independent. For each month, we can provide the lifelines API with the censored and uncensored lengths and it will return the values for λ and k that are most likely to generate those observations. However, we found that this approach created very unstable historical predictions and a very unreliable prediction for recent months. We needed a model that would include our knowledge that one month cannot be that different from the next. We also needed a model that would tell us how much confidence we can have in its predictions. Probabilistic Programming met our needs very well.
Probabilistic programming languages (or in our case, PyMC3, a Python library) simplify the creation of probabilistic models and the inference of the parameters of the model. In simple terms, we can specify our model:
T ~ Weibull(λ, k)
Then, we need to tell our model the durations of the finished contracts we have observed: finished contract durations (in days): [50, 110, 33,…]
In PyMC3, these two things can be done in a single line of code:
pm.Weibull(‘T_observed’, alpha=k, beta=lambda_, observed=[50, 10, 33,…])
It was very simple. Now, the idea is to let PyMC3 use its algorithm* to tell us what values λ and k can explain our observations. It will not return a single value, but a long list of samples that were taken from the joint posterior distribution of λ and k. We can just run: pm.sample()
This was the core of the model. It is missing a few parts though. First, we should specify a prior for the parameters λ and k. Second, we should also include unfinished contracts! They contain essential information since we would vastly underestimate the average length of the contracts if we ignored them. For a contract that is 40 days old, we know that the time to finish will be higher than 40. The probability of this happening is given by the survival function with parameters k and λ: sf(40, k, λ). For PyMC3 to take this into account we have to pass the log probability function to pm.Potential:
pm.Potential(‘T_censored’, weibull_log_sf(censored_observations, alpha=k, beta=lambda_))
Finally, we need to define all of the above within the context of a model (“with pm.Model():”). Here is a simple version of our model:
So far we have defined a model that can infer the Weibull distribution for a given period, in the form of a trace of samples for the pair (λ, k). Since we have samples for λ and k, we can easily compute the theoretical average for each pair and get a credible interval on the actual average. Next, we will see how we can make predictions of neighboring months dependent on each other.
Extending the model is relatively straightforward. We use a modeling trick (explained in more detail by Thomas Wiecki in his blog post). Essentially, we take the series of log(λᵢ) (representing the scale parameter for Tᵢ, the lifetime of contracts of month i), and tell the model our prior assumption that they will follow a random walk with a certain standard deviation sigma. With this prior, the log(λᵢ) can be positive or negative, but λ=exp(log(λᵢ)) will always be positive, which is required for Weibull distribution.
In effect, it penalizes the posterior where λᵢ is very different from λᵢ₊₁. “Very different” depends on sigma, which is itself a parameter learned from the data (with its own prior).
From the previous section, we get estimates for λᵢ and k. We can use them directly to compute the “underlying” mean lifetime of the contracts of each month. Alternatively, we can use them to simulate only the lifetimes of the contracts that are still ongoing (sampling from the conditional distribution P(T|T>time_until_today)). By averaging over many of those simulations we can get an estimate of the average length we will actually observe.
Since our approach is based on creating realistic simulations, we can generalize it to compute not only the expected average length but any KPI based on the length of the contracts (such as the average length without outliers).
We have worked in close collaboration with the Strategy team, one of the main stakeholders for this metric. We had to make sure we covered their requirements and that they understood our approach.
The flexibility of our approach continued to pay off while we quickly iterated to create the KPIs that would fit their needs (actionable and unbiased by outliers). The Bayesian approach offers many advantages in this regard:
*to read more about MCMC, a technique to get samples from the posterior distribution, a good starting point is: http://elevanth.org/blog/2017/11/28/build-a-better-markov-chain/
Predicting contract length with probabilistic programming was originally published in Jobandtalent Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.