Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

October 4, 2021 05:44 am GMT

Best practices for AWS Athena

In this blog I will mention some of the best practices recommended by AWS for building queries in Athena based on my experience and the following resources:

General Recommendations

Always use WHERE on partition field

This mainly in order to speed up time and cost.
For example:
Avoid:

select * from table1 where cast(col_1 as integer) = cast('201912' as integer) - 1

and prefer:

select * from table1 where col_particion = '201911' //speedup ~85%, savings ~95%

Avoid using ORDER BY without LIMIT

It is extremely important to understand that the ORDER BY function must be done in a single node, since it is a slow and time-consuming function. Ideally it should be avoided, however, if within the use case you are implementing you must use it, I always recommend placing a LIMIT.
For example:
Avoid:

select * from table1 order by date;

and prefer:

select * from table1 order by date limit 1000; //speedup ~98%, avoid 'Query exhausted resources at this scale factor'

Select only the columns to retrieve the final result

This recommendation is very simple. In practical effect it is to avoid the SELECT * FROM.
For example:
Avoid:

create table tmp_tableas selectA.col_1A.col_2B.col_3from select * from table_1 Aleft join (select * from table_2) B on A.col_1 = B.col_1

Instead use:

create table tmp_tableas selectA.col_1A.col_2B.col_3from select col_1,col_2 from table_1 Aleft join (select col_1,col_3 from table_2) B on A.col_1 = B.col_1

Schedule data aggregation for small files

The numbers speak for themselves:

Query	Number of files	Run time
SELECT COUNT(*) FROM lineitem	5000	8.4 seg
SELECT COUNT(*) FROM lineitem	1	2.31 seg
Speedup		72% faster

Prefer the use of regular expressions over 'LIKE'

Query	Run time
SELECT COUNT(*) FROM lineitem WHERE text_column LIKE '%wake%' OR text_column LIKE '%some%' OR text_column LIKE '%express%' OR text_column LIKE '%hello%'	20.56 seg
SELECT COUNT(*) FROM lineitem WHERE regexp_like(text_column,'...')	15.87 seg
Speedup	17% faster

Note: The expression would be

regexp_like(text_column, 'wake|some|express|hello')

When using group by for multiple fields. order them from highest to lowest cardinality

This will avoid memory errors and reduce the time to deliver results.
For instance:
Avoid:

select * from people group by column_genre,department;

Instead use:

select * from people group by department,column_genre;

In case of using Crawlers to automatically obtain the structure of the data stored in S3, respect the data types supported for the source engine. Likewise, do not forget to run the crawler after a data update that may generate changes in the structure, this in order to update the structure in the glue catalog.

Use MSCK REPAIR TABLE only if the folders are created with the structure 'field1 = / field2 = /.../ fieldN = ' and only after creating the table, since it is' msck repair table 'is expensive operation and it is preferable to use' alter table add partition 'or glue api to add partitions.

and do you have any good practice that you recommend? Comment it in the comment box.

I hope this blog is useful for you. Greetings!

Original Link: https://dev.to/aws-builders/best-practices-for-aws-athena-12gb

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To