Move read-only TrueMoney transactions from MySQL to S3 and query with Athena πŸ₯š 🐣 πŸ₯

To reduce the cost on the database size and to make the application works faster!!!

Tanopwan
Ascend Developers

--

For any running system on production, data is growing every second so our database size is increasing. For a typical relational database, it can only scale up vertically and of course more expensive πŸ€‘ and slower.

According to regulations and customers, we have classified the transactions on how we access and modify the data as below:

  • Read/Write transaction is the transaction that still be accessed and can be updated by application
  • Read-Only transaction is the transaction that still need to be accessed but will not be updated

The read-only transactions are the transactions we would like to do housekeeping here but we still need to be able to query these transactions

requirements are pain in the abdomen❣️ I know

πŸ₯š There were many choices out there but we chose to explore a bit in the field of Big Data, storing ORC files in AWS S3 and using AWS Athena to query.

🐀 A lot of new keywords right? so Let’s do some definitions

Columnar file format is the format that data are stored in column based not in row based (read more)

ORC is a columnar file format that has data types and designed for Hadoop workloads

As above the definition, how would we use it without Hadoop? Their official website has stated that there are many ways to adopt ORC file format.

As we did not have any Big Data framework or environment, we chose to work with Java Core ORC.

As Apache ORC has provided us the Java Core library, Perfect!! we can create a simple JAVA application that schedule to read transactions from MySQL and convert rows based to column based and save as orc files. There is also a tutorial for it. Get Started

I assumed that the example in the tutorial went well and we have orc files uploaded to S3 🀞 🀞 Cross fingers

They can be in any structured folders but this will give a benefit of partitioning which I will explain later. The file names are also irrelevant and only for human reading purpose

🐣 AWS Athena

Amazon Athena is an interactive query service that makes it easy to analyse data in Amazon S3 using standard SQL.

In Athena console

Create a Database from S3

This will tell Athena a database to query is from S3

CREATE DATABASE IF NOT EXISTS [DATABASE_NAME]
LOCATION 's3://path/to/orc/files'

Create a Table with Schema

This will create a table structure when reading orc file and we also specify the format as OrcSerde and the base location in S3

CREATE EXTERNAL TABLE IF NOT EXISTS [DATABASE_NAME].[TABLE_NAME] (
`id` string,
`status` string,
...
`created` timestamp,
`updated` timestamp
) PARTITIONED BY (
year string,
month string,
day string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://path/to/orc/files/'
TBLPROPERTIES ('has_encrypted_data'='false');

Working with Partitions

In the table we just created we also specify the partition folders in PARTITIONED BY. Athena can understand partitioning by folder structure names. This will help reducing the cost of data scanned. πŸ€‘

For example, /path/to/orc/files/year=2020/month=12/day=31/*.orc

Then all files under this structure can be loaded using the command

MSCK REPAIR TABLE [TABLE_NAME];

After running load partitions, Athena will partition the files under the partition structures

Repair: Added partition to metastore [TABLE_NAME]:year=2015/month=12/day=04
Repair: Added partition to metastore [TABLE_NAME]=2016/month=07/day=19
Repair: Added partition to metastore [TABLE_NAME]=2016/month=07/day=21
Repair: Added partition to metastore [TABLE_NAME]=2016/month=12/day=03

Now we can use normal SQL to query as we already have the database and the table with partitions 😏

SELECT * FROM "[DATABASE_NAME]"."[TABLE_NAME]" WHERE year='2020' limit 10

So now we can archive data to s3 and still able to use SQL query on archive data using AWS Athena πŸ‘ŒπŸ‘ŒπŸ‘Œ

πŸ₯ What TrueMoney benefits from implementing this?

Cost Reduction ⭐️ ⭐️ ⭐️ ⭐️ ⭐️

During Covid-19 in 2020–2021. Costs matter even more πŸ’° . As S3 storage is much cheaper than the SSD storage in RDS

  • Estimate data of 20 TB on RDS will cost 2300$/mo but it will cost 460$/mo for S3 standard tier. And 20 TB on RDS will not be 20 TB on S3 as it is binary and compressed. Imagine how much we will can for a year!
  • This is not including the costs for database snapshots
  • Since the ORC file is designed for big data, it stores in protocol buffer and compressed format. The size is pretty d*rn small❣️

Application Performance and Maintenance ⭐️ ⭐️

  • Smaller database = easier to query and make modifications
  • The application will work on a small set of data that are left in main database
  • We had to add more disk to the database every 3–6 months, and we hope next month we don’t have to do so. We developer may rest in piece πŸ›

Easy Access πŸ“… and Security ⭐️ ⭐️ ⭐️ ⭐️

ORC files stored in S3 are compressed and in binary. So they are unreadable using a normal editor unlike other formats (JSON, CSV). This will give a harder way on accessing files but Yes 😁 we do have a way out of this with Athena of course❗️ Athena recently added support on OrcSerde.

  • πŸ†Ž it gives us an ability to use normal SQL queries on these archive files
  • πŸ™ˆ We can access with Index from ORC footer
  • πŸ—„ We can use partitioning to skip data scanned
  • πŸ”’ We can use column encryption to ensure some sensitive columns will not be exposed ❗️This is still new so Athena is not yet support at the time.
  • Even though the ORC files are stored separately in smaller files on S3. We are able to load all partitions in Athena πŸŽ‰ So in logical level, it is the same as we have another MySQL behind Athena

--

--