ORC Core Java 1.6 and Column Encryption with AWS KMS πŸ”

In Kotlin

Tanopwan
Ascend Developers

--

If you work with Hadoop, everything should be working out of the box β˜‘οΈ but if you are using Java Core, this could be helpful

Continue from the previous article Move read-only TrueMoney transactions from MySQL to S3 and query with Athena, we have covered on how to query data in S3 using the AWS Athena. In this blog, we will dive deeper into one feature of the ORC specifications.

As Apache has introduced the new ORC file specification version 1.6, that comes with a feature Column Encryption. Column Encryption feature will encrypt each column with a local key per column and results in another level of security in your ORC files.

From the specifications

Column encryption provides fine-grain column level security even when many users have access to the file itself. The encryption is transparent to the user and the writer only needs to define which columns and encryption keys to use. When reading an ORC file, if the user has access to the keys, they will get the real data. If they do not have the keys, they will get the masked data.

Each encrypted column in each file will have a random local key generated for it. Thus, even though all of the decryption happens locally in the reader, a malicious user that stores the key only enables access that column in that file. The local keys are encrypted by the Hadoop or Ranger Key Management Server (KMS). The encrypted local keys are stored in the file footer’s StripeInformation.

https://orc.apache.org/specification/ORCv1/
Credit: https://orc.apache.org/specification/ORCv1/

In conclusion, keys πŸ”‘ to decrypt the column data are stored encrypted πŸ” in the Stripe Footer. Repeat this 3 times πŸ˜‚

It wouldn’t make sense if we just keep the keys unencrypted and ready to use beside our data right? πŸ˜…πŸ˜…πŸ˜…πŸ˜…πŸ˜…πŸ˜…

This means, to decrypt the column data, we need to decrypt the keys first with another key πŸ— (Master key)

The specification said that keys are stored in Hadoop or Ranger Key Management Server (KMS) which in our case we have neither of them (We do have the Java Core library though). So the main purpose of this blog is to implement a Java KeyProvider interface with the AWS KMS

Why so complicated?

Before starting the implementation, you should understand the concept of Envelope encryption πŸ“©

AWS KMS has the customer master key (CMK) that have the following properties

  • Never leaves AWS
  • Use to encrypt/decrypt data on AWS (data has to be less than 4KB)

Since the whole table cannot be sent to encrypt at AWS KMS, but the local key can.

let’s do some def. here: local key = data key

AWS KMS has provided us developers the way to generate this data key, from the Generate Data Key function. We would get 2 keys from the result.

The First Key is a plaintext data key (aka. local key πŸ”‘ ) for encrypting a column.
The Second Key is an encrypted data key (aka. local key πŸ” ) that will be stored in the file footer.

Credit: https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html

That is the specification of 1.6. How about the implementation??? After a lot of googling and reading documentations, I found that the Java Core has provided InMemoryKeystore class to provide the implementation to generate

a local key πŸ”‘ and an encrypted local key πŸ”

After the ORC file writer is done using the local key πŸ”‘ to encrypt a column, it should delete this key for security reasons and save the encrypted local key πŸ” in the footer of the ORC file.

This class is good for testing but in the real implementation we cannot store Master key πŸ— in the memory. πŸ˜… We need a safer place to store this Master key, since AWS is our tech stack so AWS KMS is what we prefer.

πŸ₯‘πŸ₯‘πŸ₯‘ By the time I worked on this, it could be my bad googling skill but I could not find any AWS Keystore implementations anywhere that fit the KeyProvider interface.

Fine!! Let’s implement this. After I took a look at the KeyProvider Interface and the specifications, it already has an AWS kind for the key provider enum.

The important functions here are createLocalKey and decryptLocalKey.

createLocalKey function is called when the ORC file writer wants to encrypt a column. AWS KMS is called in order to generate the data keys and return with both encrypted πŸ” and decrypted πŸ”‘ keys.

decryptLocalKey function is called when the ORC file reader wants to read a column. AWS KMS is called in order to decrypt the encrypted data key.

Now to use this AWSKeystore.kt we just implemented, we need to add the key to the store.

val awsKeystore = AwsKeystore(kmsClient, "KEY-ID", "AES_256")
awsKeystore.addKey("key_name", EncryptionAlgorithm.AES_CTR_256)

and call setKeyProvider in the ORC writer options

OrcFile
.writerOptions(Configuration())
.setKeyProvider(awsKeystore)
.encrypt("key_name:column_name")

When we want to read the file, we do the same thing by calling KeyProvider in the ORC reader options

OrcFile
.readerOptions(Configuration())
.setKeyProvider(awsKeystore)

Now your local keys are protected by AWS KMS βœ…

--

--