An Interest In:
Web News this Week
- April 26, 2024
- April 25, 2024
- April 24, 2024
- April 23, 2024
- April 22, 2024
- April 21, 2024
- April 20, 2024
Best practices for AWS Athena
In this blog I will mention some of the best practices recommended by AWS for building queries in Athena based on my experience and the following resources:
- https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
- https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
General Recommendations
Always use WHERE on partition field
This mainly in order to speed up time and cost.
For example:
Avoid:
select * from table1 where cast(col_1 as integer) = cast('201912' as integer) - 1
and prefer:
select * from table1 where col_particion = '201911' //speedup ~85%, savings ~95%
Avoid using ORDER BY without LIMIT
It is extremely important to understand that the ORDER BY function must be done in a single node, since it is a slow and time-consuming function. Ideally it should be avoided, however, if within the use case you are implementing you must use it, I always recommend placing a LIMIT.
For example:
Avoid:
select * from table1 order by date;
and prefer:
select * from table1 order by date limit 1000; //speedup ~98%, avoid 'Query exhausted resources at this scale factor'
Select only the columns to retrieve the final result
This recommendation is very simple. In practical effect it is to avoid the SELECT * FROM.
For example:
Avoid:
create table tmp_tableas selectA.col_1A.col_2B.col_3from select * from table_1 Aleft join (select * from table_2) B on A.col_1 = B.col_1
Instead use:
create table tmp_tableas selectA.col_1A.col_2B.col_3from select col_1,col_2 from table_1 Aleft join (select col_1,col_3 from table_2) B on A.col_1 = B.col_1
Schedule data aggregation for small files
The numbers speak for themselves:
Query | Number of files | Run time |
---|---|---|
SELECT COUNT(*) FROM lineitem | 5000 | 8.4 seg |
SELECT COUNT(*) FROM lineitem | 1 | 2.31 seg |
Speedup | 72% faster |
Prefer the use of regular expressions over 'LIKE'
Query | Run time |
---|---|
SELECT COUNT(*) FROM lineitem WHERE text_column LIKE '%wake%' OR text_column LIKE '%some%' OR text_column LIKE '%express%' OR text_column LIKE '%hello%' | 20.56 seg |
SELECT COUNT(*) FROM lineitem WHERE regexp_like(text_column,'...') | 15.87 seg |
Speedup | 17% faster |
Note: The expression would be
regexp_like(text_column, 'wake|some|express|hello')
When using group by for multiple fields. order them from highest to lowest cardinality
This will avoid memory errors and reduce the time to deliver results.
For instance:
Avoid:
select * from people group by column_genre,department;
Instead use:
select * from people group by department,column_genre;
In case of using Crawlers to automatically obtain the structure of the data stored in S3, respect the data types supported for the source engine. Likewise, do not forget to run the crawler after a data update that may generate changes in the structure, this in order to update the structure in the glue catalog.
Use MSCK REPAIR TABLE only if the folders are created with the structure 'field1 = / field2 = /.../ fieldN = ' and only after creating the table, since it is' msck repair table 'is expensive operation and it is preferable to use' alter table add partition 'or glue api to add partitions.
and do you have any good practice that you recommend? Comment it in the comment box.
I hope this blog is useful for you. Greetings!
Original Link: https://dev.to/aws-builders/best-practices-for-aws-athena-12gb
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To