Python Packaging : How we manage it at JobTeaser Problem Here at JobTeaser, we found ourselves duplicating a great amount of code between text cleaning operations for machine learning projects, AWS wrapper, and Kafka helper. The solution we found to this trouble was to extract our reusable code into proper packages (correctly tested and documented), before pushing […]
Here at JobTeaser, we found ourselves duplicating a great amount of code between text cleaning operations for machine learning projects, AWS wrapper, and Kafka helper.
The solution we found to this trouble was to extract our reusable code into proper packages (correctly tested and documented), before pushing it to our own package server.
The current solution relies on an EC2 machine that contains a pypi server. The CI (here CircleCi) will build→test →push a new version of the package to the pypi server.
Once published, other projects can use the package in their CI process by requesting it from pypi server.
Because our infra already runs on AWS, this search was limited to solutions that used S3 as a backend, after some time we found two solutions:
As you can see the strongest point of the S3pypi solution is that it is straightforward.
It effectively uses a Cloudfront to expose s3 content in the same way that a static website.
But at Jobteaser, we manage all our AWS resource using Terraform. So we would have to take their templating and adapt it, not to mention that the security relies only on the Cloudfront setup.
On the other hand we have Pypicloud that goes in the opposite direction with a more complex and modular solution.
Similarly to the current pypi solution, it requires a dedicated server but also has the added need for external storage and a caching service. But on the bright side, you can actually use any option you want for these components.
In the end the winner was Pypicloud, because, for the deployment, Cloudfront’s hidden complexity and the need to rely only on AWS IAM were unacceptable.
First, we need to create an s3 bucket, with access rights granted to a unique AWS user, in our case, the Pypicloud server.
One major change from the recommended AWS configuration is the use of a PostgreSQL RDS as a cache instead of Dynamodb.
We made this choice because for internal use the added complexity and cost were not worth it.
Here’s the result:
In the meantime, we’ve actually migrated on Kubernetes, with a CI on Jenkins.
So here is the package publishing pipeline:
For the package publication, we have a Github repository that contains Pypicloud configuration, a Dockerfile and Kubernetes resources for the server container that will be deployed.
This repository also contains other packages projects that will be tested in the CI before being built and published using a dedicated user, as we saw before this published package will be sent to Pypicloud to be stored on s3.
The Pypicloud URL is templated in the Dockerfile in order to be accessed during the container build phase and set by Jenkins.
For the python projects deployment we have one repository per project, those repositories contain:
This pipeline doesn’t require a Pypicloud user, because the Ingress configuration of the server will simply reject access from outside the Kubernetes cluster.
We needed a remote package service in python and we looked at 3 solutions:
In the JobTeaser context:
The winner was Pypicloud.
Python Packaging : How do we manage it at JobTeaser was originally published in JobTeaser Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.