In 2021, AWS Athena and QuickSight are actually good for analyzing your mountain of ELB logs

KP
3 min readFeb 23, 2021

If you’ve looked at Athena in the past and ran away screaming due to the hacks required to manage date partitions (e.g. a Lambda running every day), I have good news.

Now, it takes only one Athena statement to setup a table for your date-partitioned logs, and you are done. No more care and feeding afterward. And best of all, it doesn’t require you to change your S3 paths to have strange syntax like “/year=YYYY/”. See example below for good-old-fashioned ELB logs (the bold text shows the new auto partitioning feature):

CREATE EXTERNAL TABLE default.prod_elb_logs (
timestamp string,
elb_name string,
request_ip string,
request_port int,
backend_ip string,
backend_port int,
request_processing_time double,
backend_processing_time double,
client_response_time double,
elb_response_code int,
backend_response_code int,
received_bytes bigint,
sent_bytes bigint,
request_verb string,
url string,
protocol string,
user_agent string,
ssl_cipher string,
ssl_protocol string
)
PARTITIONED BY (
datepart STRING
)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' )
LOCATION 's3://BUCKET/Logs/AWSLogs/ACCOUNT/elasticloadbalancing/us-west-2/'TBLPROPERTIES (
"projection.enabled" = "true",
"projection.datepart.type" = "date",
"projection.datepart.range" = "2020/01/01,NOW",
"projection.datepart.format" = "yyyy/MM/dd",
"projection.datepart.interval" = "1",
"projection.datepart.interval.unit" = "DAYS",
"storage.location.template" = "s3://BUCKET/Logs/AWSLogs/ACCOUNT/elasticloadbalancing/us-west-2/${datepart}"
)

With just a small bit of awkward syntax (in bold below), we can dynamically query the last 7 days, and Athena will only scan the S3 files for that date range:

select datepart, count(*) from default.prod_elb_logs
where datepart >= date_format(current_date - interval '7' day, '%Y/%m/%d')
group by 1

I like to select just the ELB errors, and load them into a QuickSight SPICE dataset (which has a max of 250 million rows) for easy analysis. QuickSight imports directly from Athena every night; no data warehouse is required. Tip: use yyyy-MM-dd’T’HH:mm:ss.SSSSSS’Z’ in QuickSight so it can parse the timestamp string field.

Did I mention our ELB logs are in separate accounts than Athena? Not a problem. But you will probably run into this confusing feature of S3 which prevents your bucket policy from working. But it can be fixed: change the S3 bucket settings to “Bucket owner preferred” to fix new objects, and fix the old objects like this:

# Change old ELB logs to be owned by our account
s3uri="s3://BUCKET/Logs/AWSLogs/ACCOUNT/elasticloadbalancing/us-west-2/2021/01/"
aws --profile=prod s3 cp $s3uri $s3uri \
--metadata x-amz-meta-updated=1 --recursive \
--acl bucket-owner-full-control

So there you have it. Cross-account, easy analysis of ELB logs with multiple AWS tools. Now all Amazon has to do is update their ALB and ELB documentation, because as of Feb 2021 they are still recommending the hacky Lambda approach for partitioning. Don’t they know their products actually work together now? :)

--

--

KP

Professional rider of the technology hype-cycle since 1999.