Amazon recently released AWS Athena to allow querying large amounts of data stored at S3. This is built on top of Presto DB. Amazon releasing this service has greatly simplified a use of Presto I’ve been wanting to try for months: providing simple access to our CDN logs from Fastly to all metrics consumers at 500px. […]
Amazon recently released AWS Athena to allow querying large amounts of data stored at S3. This is built on top of Presto DB. Amazon releasing this service has greatly simplified a use of Presto I’ve been wanting to try for months: providing simple access to our CDN logs from Fastly to all metrics consumers at 500px.
Previously we investigated using Presto on an Elastic MapReduce (EMR) cluster to execute queries during our daily ETLs. We’d then load those queries’ outputs to Redshift for further analysis throughout the day. This was meant to avoid the cost of having an EMR cluster running all the time, or the latency of bringing up a cluster just for a single query. Athena allows us to avoid this additional cluster management as AWS is providing the always-on Presto cluster.
Some minor difficulties were encountered.
Query execution time at Athena can vary wildly. In my evening (UTC 0500) I found query times scanning around 15 GB of data of anywhere from 60 seconds to 2500 seconds (~40 minutes). During my morning tests I’ve seen the same queries timing out after only having scanned around 500 MB in 1800 seconds (~30 minutes). We’ll have to see if these become more stable over time.
Experimenting with table design brought up a couple issues as well. One of my tables with a RegexSerde took a little coaxing to nicely handle some minor log format changes we’ve done. This took awhile as the row format of a Presto table can’t be updated. Dropping the old table and creating a new one fails as it appears the table’s schema is cached for some time. This left me with creating the table with the new schema with a new name, generally just appending a number to it. Once I resolved issues with the schema I intended to simply rename the table to its final name. Again I ran into a minor difficulty: table renaming isn’t currently supported despite being documented. Oh well, I let the cached schema expire then created the table once again with its final name.
Schema resolved, I moved on to integrating with our metrics pipeline that we’ve built in Python using Luigi. This… posed some minor challenges of its own.
Athena currently only has two interfaces: the AWS web console and a JDBC driver. Making use of the JDBC driver from Python is possible with Baztian’s JayDeBeApi module, with a few minor tweaks. These were required as Athena’s JDBC driver doesn’t support passing some options as part of the URL but instead require they be passed as properties which were previously unsupported in JayDeBeApi. Along with this came a simple implementation of non-prepared statements — Athena does not support prepared statements yet. These tweaks are available at https://github.com/Melraidin/jaydebeapi
One further issue encountered while initially testing the JDBC connection was the seeming inability to query any tables outside the default DB. For now I’ve simply put our tables in the default DB and asked on the AWS dev forums; I’m sure this will be resolved soon (or I’m doing it wrong).
I received a response to the above issue in the linked thread. It seems that through the JDBC driver (possibly an artifact of how I’m using it from Python though) the DB name must be quoted, e.g.: "db"."table"
Querying then failed with an unexpected Java error seemingly related to type conversions. Trawling the dev forums again found a thread mentioning an issue with the decimal data type. Changing our decimal columns to doubles resolved the issue but isn’t ideal. I’m sure this will likely be fixed soon.
This brought us the ability to query Athena from Python. The last piece remaining were the base classes for Athena interactions with Luigi. These simply factor out boilerplate that would otherwise be required for Athena tasks. We have one to query Athena and store the results at S3:
One note: after querying Athena we remove any key at the result path at S3 ending with “.csv.metadata”. This is an object that is created by Athena that might be required for the Athena web console to properly display the results. It also causes problems when loading to Redshift as it will match our S3 path in the Redshift “copy” command causing a failure to load. See this discussion at the AWS forums.
After getting our results to S3 we only have to load them to Redshift. This base class makes use of an internal class to load data at S3 to Redshift but should give you an idea of the process:
What this all leads to is a task pipeline to load data such as logs from Fastly to Redshift that looks like this:
This pulls data from our partitioned Fastly table through Athena. We’re able to partition the Fastly table in a form suitable for use from Athena using some undocumented options for the Fastly log location. It supports at least some of the format characters from strftime() giving us hourly partitions to optimize our queries. Thanks to Aaron Peckham for this tip.
And so now we’re able to pull data from Fastly logs to Redshift for further analysis. This will let us take advantage of some of the more interesting data Fastly makes available in their logs, like geographic location.