SQL Server 2019 Failover Cluster on EC2 – Lessons Learned

Not as straightforward as hoped

Due to collation and framework restrictions, the recommended way to set up a SQL cluster on AWS—via RDS—was not an option. Instead, we opted for a SQL Failover Cluster running on EC2 Windows Server instances. The AWS reference setup is mostly accurate and helpful, but one crucial tweak was needed to get the cluster up and running.

TL;DR

Do not create secondary ENIs for the IPs required by Windows Failover Cluster and SQL Failover Cluster. Instead, predefine the secondary IPs yourself and attach them to the primary network interface before installing the clusters.

Prerequisites

  • AMI used: ami-0927dabc35e7e1948 - Windows Server 2019 with SQL 2019 Standard
  • EC2 Instance Type: t3.xlarge

The tweak: Secondary IPs instead of secondary ENIs

SQL Server Failover Cluster Instances (FCI) operate on a Windows Server Failover Cluster (WSFC). Each node in the cluster requires a dedicated IP address for network communication. The AWS guide suggests creating secondary Elastic Network Interfaces (ENIs) to accommodate these IPs. However, in our setup, secondary ENIs caused issues, preventing proper cluster communication and failover behavior. Either the Windows refused to use the IP Address for the cluster as it was already in use at a network adapter or the cluster broke when attaching the secondary ENI with a cluster IP.

What Worked

Instead of adding secondary ENIs, I manually assigned secondary IPs to the primary network interface (eth0) of each EC2 instance before installing the failover cluster. Here’s how:

  • Ensure Network Configuration is correct: The instance needs to be joined to an AD domain and the network adapter (in Windows)/primary network interface (in AWS) should be set to use Active Directory (AD) DNS servers.
  • Predefine Secondary IPs: Depending on the subnet where a cluster node will reside, define which available IP of the subnet will be used. You will need two IPs, one for the Windows Failover cluster and one for the SQL Failover cluster.
  • Assign Additional Private IPs: Edit the primary network interface settings of the running instance and assign the defined secondary IPs.

Try setting up the Windows Failover Cluster, like mentioned in the AWS reference setup, with powershell static addresses. If the creation of the cluster fails, you can go ahead create the cluster with standard settings, just ommit the block about static IP addresses in the powershell command or use the Failover Cluster Manager GUI. If the cluster was created without static IP addresses, manually remove the IP addresses created during standard installation and assign the secondary IP addresses you want to use for the cluster via powershell or, what worked well for me, through the Failover Cluster Manager.

Windows will then assign these secondary IP addresses to the primary network interface of the respective node and, as AWS already knows about the IP assignment, traffic will flow. The same will happen for the IP addresses used in the subsequent SQL Failover cluster installation. After the successful installation each nodes network adapter should show three IP addresses.

Additional Considerations

SQL Service Account

Ensure that the SQL Server Agent and SQL Server Database Engine service accounts have the necessary permissions in Active Directory. These accounts require:

  • Permissions to create and manage cluster objects in the AD
  • The ability to log on as a service.

Troubleshooting Installation Blocks

When adding a node to a deployed SQL failover cluster, I encountered an issue where credentials for service accounts have not been properly loaded. Thus the installation could not proceed. If you have fields in the Service Account that are misconfigured, you can manually override the installation using the following command (in the SQL setup directory):

...
setup.exe /Action=AddNode /INSTANCENAME=MSSQLSERVER /SQLSVCACCOUNT="aws\admin" /SQLSVCPASSWORD="YourPassword" /AGTSVCACCOUNT="aws\admin" /AGTSVCPASSWORD="YourPassword"
...

Performance Considerations

For optimal performance:

  • Consider using NVMe-based EC2 instances, such as the m5 or i3 instance families. These provide significantly better disk I/O performance, especially for temporary tables, compared to traditional EBS-backed instances.
  • Ensure that FSx for NetApp ONTAP is configured with sufficient throughput capacity to handle SQL Server workload demands.

Conclusion

Setting up a SQL Server Failover Cluster on EC2 is certainly possible, but not without its challenges. By manually assigning secondary IPs to the primary network interface before installation, you can avoid unnecessary complications and ensure a smoother deployment process.

Additionally, ensuring that:

  • The instance is AD-joined before configuring the cluster,
  • The primary network adapter uses AD DNS servers, and
  • SQL Server service accounts have appropriate permissions

will help prevent common pitfalls.

By following these best practices, your SQL Server Failover Cluster should be up and running efficiently on EC2 in no time. If you need support doing your transition seamlessly, use our experience and do not hesitate to contact tecRacer at https://www.tecracer.com/kontakt/. Happy clustering!

— Patrik

Similar Posts You Might Enjoy

Regaining Amazon QuickSight SPICE capacity

When using Business Intelligence solutions, it can quickly become difficult to manage all the resources created by numerous users. In this blog article, we will see how the Amazon QuickSight API was leveraged to determine which resources used up all the available SPICE capacity in our Amazon QuickSight account. - by Franck Awounang Nekdem

How I spent a few hours using advanced technology to save $2

Opportunity cost is an important economic concept, but sometimes we need to ignore it to learn something. Join me in using a variety of services and tools to figure out what is using my KMS keys and if I can safely delete them. - by Maurice Borgmeier

Embedded Embeddings Database: Building a low cost serverless RAG solution

Retrieval-Augmented Generation (RAG) solutions are an impressive way to talk to one’s data. One of the challenges of RAG solutions is the associated cost, often driven by the vector database. In a previous blog article I presented how to tackle this issue by using Athena with Locality Sensitive Hashing (LSH) as a knowledge database. One the of the main limitations with Athena is the latency and the low number of concurrent queries. In this new blog article, I present a new low-cost serverless solution that makes use of an embedded vector database, SQLite, to achieve a low cost while maintaining high concurrency. - by Franck Awounang Nekdem