Disaster Recovery for AWS CloudHSM
“Everything fails, all the time”
— AWS CTO, Werner Vogels
No matter how highly available your infrastructure is but having disaster recovery plan for each of the critical infrastructure services is equally important and is always rewarding. Having a well-tested DR plan helps organisation’s to recover from an event that has negatively affected the business operations.
Disaster Recovery planning normally revolves around RTO (Recovery Time Objective) and RPO (Recovery Point Objective). RTO refers to the time an organisation can afford to stay offline without affecting the business adversely and RPO refers to the maximum amount of data an organisation can afford to loose. For example, if the backup is set to happen every 4 hours, the organisation can loose maximum of 4 hours of data in case of a disaster and this is called RPO.
Based on the RTO and RPO a recovery strategy is decided. There are 4 recovery strategies to choose from:
CloudHSM
AWS CloudHSM is a single-tenant dedicated hardware security module that sits within your VPC and can store both symmetric and asymmetric encryption keys. As of today, AWS do not provide native DR support for CloudHSM hence, we need to perform few manual steps to add DR for our CloudHSM cluster.
Note: Having a CloudHSM cluster in your account is not required to follow along but is good to have. If you don't have one and want to create one you can refer to one of our article that helps you create, initialise and activate your CloudHSM cluster. Before creating the cluster make sure to check the pricing page as the service do not offer any free tier usage.
Let’s visit the CloudHSM dashboard to add DR for our cluster.
Once you are at the dashboard page, using the left panel click on the Backups link to see automated backups of your CloudHSM cluster.
Then, select the latest backup, click on Actions and click on Copy backup to another region. Select the region and hit the Copy backup button to add DR for your CloudHSM cluster.
By switching the region, verify if the backup was copied over to the destination region.
Once the backup is available in the destination region you can create or restore a CloudHSM cluster from it whenever your primary region goes down.
Note: By copying only one backup you are not fully prepared for disaster. You need to periodically copy the latest backups from source region to destination.
Caveat
The above mentioned process is manual and requires us to copy latest backups periodically to the destination region which is not very efficient way to add DR support to our CloudHSM cluster because this adds the risk of forgetting to copy backups at regular interval. To avoid this it is highly recommended to automate the process.
To automate the process, you can either use the open-source module created by us or you can build your own automation by creating a CloudWatch event that triggers a lambda function at regular intervals to copy the CloudHSM backups from source region to destination.
Covering the basics
-
CloudHSM is a single-tenant hardware security module that is deployed in your own VPC. It can be used to store root CA, encryption keys including symmetric keys and asymmetric key pairs, SSL certificates, etc.
-
KMS is a multi-tenant hardware security module that is fully managed by AWS whereas CloudHSM is a single-tenant hardware security owned and managed by the customer. KMS has been validated under FIPS 140-2 Level 2 compliance where CloudHSM has been validated under FIPS 140-2 Level 3 compliance.
-
AWS CloudHSM can be deployed either via console, cli or API. You start by creating a cluster and initialising it. After which you attach HSM nodes depending on the requirement. It is recommended to have at least 2 HSM nodes per cluster for high availability. Once the node is deployed, you need to activate the cluster by activating the admin user using the CloudHSM cli tool.