Creating Recommended Alarms for Amazon OpenSearch Service with Terraform

Amazon OpenSearch Service is a powerful tool for search, log analytics, and RAG (Retrieval-Augmented Generation) systems. To ensure optimal performance and reliability, effective monitoring is crucial.

This article explores how to leverage CloudWatch metrics and alarms to monitor Amazon OpenSearch Service, enabling proactive identification and resolution of potential issues before they impact your users.

Prerequisites

Before diving into the monitoring setup, ensure you have:

  1. Basic knowledge of AWS, Terraform and Amazon OpenSearch Service
  2. An active AWS account with console access and needed permissions to create CloudWatch resources and read the configurations.
  3. A running Amazon OpenSearch Service cluster
  4. The following tools installed:
    • Terraform (local machine or CI/CD pipeline)
    • AWS CLI (configured with your AWS credentials)

Understanding CloudWatch Metrics and Alarms for Amazon OpenSearch Service

Amazon OpenSearch Service generates a variety of metrics that provide insights into cluster performance, current load, and overall health. These metrics are accessible through the AWS Console:

  1. Navigate to CloudWatch -> Metrics -> All Metrics -> ES/OpenSearch Metrics image-20240715165244094
  2. Use the automated dashboard to track metric evolution over time image-20240715165342454

Monitoring these metrics helps you understand the current state of your OpenSearch clusters. However, to proactively manage your clusters, it’s crucial to set up CloudWatch alarms. These alarms trigger actions when specific metrics exceed defined thresholds for a certain period, enabling you to respond promptly to potential issues.

For example, you can configure AWS to send you an email if your cluster health status remains red for more than one minute. The following section provides recommended alarms for Amazon OpenSearch Service and guidance on how to respond to them.

Recommended CloudWatch Alarms for Amazon OpenSearch Service:

  • Node Failures: Identify when a node goes down, affecting cluster performance and data availability.

    Example: NodesMinimum alarm notifies if any node has been unreachable for one day.

  • Indexing Issues: Detect slow or failing indexing, impacting search performance and data freshness.

    Example: ThreadpoolWriteQueue and ClusterIndexWritesBlocked alarms notify when indexing requests are blocked or queued excessively.

  • Query Performance: Monitor query latency and throughput to ensure optimal search performance.

    Example: ThreadpoolSearchQueueAverage and 5xxErrors alarms notify when query concurrency is high or query failures occur frequently.

  • Storage Capacity: alarm when storage capacity is approaching limits, preventing data loss and ensuring cluster stability.

    Example: FreeStorageSpace alarm notifies when a node’s free storage space is critically low.

  • Cluster Health: Monitor the overall health status of the cluster.

    Example: ClusterStatusRed: This alarm goes off when a crucial part of your data storage system is missing or not working, indicating a serious problem that needs immediate attention. ClusterStatusYellow: This alarm alarms you when a backup part of your data storage system is missing or not working, signaling a potential risk but not an emergency.

  • JVM Memory Pressure: Detect high JVM memory usage, which could lead to out-of-memory errors.

    Example: JVMMemoryPressure and OldGenJVMMemoryPressure alarms notify when JVM memory pressure is critically high.

  • Snapshot Failures: Identify issues with automated snapshots, crucial for data backup and recovery.

    Example: AutomatedSnapshotFailure alarm notifies if an automated snapshot fails.

  • Anomaly Detection and Plugins Health: Monitor the health of various plugins and anomaly detection features.

    Example: ADPluginUnhealthy and SQLUnhealthy alarms notify if the anomaly detection or SQL plugins are not functioning correctly.

Implementing CloudWatch alarms with Terraform

To streamline the process of setting up alarms, we’ve prepared Terraform code that creates all recommended alarms as per AWS documentation. Here’s how to use it:

  1. Clone the repository: git clone https://github.com/tecracer/opensearch.git
  2. Navigate to the alarms directory: cd opensearch/opensearch-alarms
  3. Review and modify the terraform.tfvars file to match your OpenSearch cluster setup
  4. Run terraform apply to create the alarms

Note: You don’t need to set up all alarms if you’re not using certain features (e.g., plugins, UltraWarm, or master nodes). Review the descriptions and comment out unnecessary alarms as needed.

For more details, see the GitHub repository.

Example alarm

resource "aws_cloudwatch_metric_alarm" "cluster_status_red" {
  alarm_name          = "ClusterStatusRed"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  metric_name         = "ClusterStatus.red"
  namespace           = "AWS/ES"
  period              = "60"
  statistic           = "Maximum"
  threshold           = "1"
  alarm_description   = "Cluster status is red"
  alarm_actions       = [aws_sns_topic.alarms.arn]
  ok_actions          = [aws_sns_topic.alarms.arn]
  dimensions = {
    DomainName = var.domain_name
    ClientId   = data.aws_caller_identity.current.account_id
  }
}

This alarm triggers when the cluster status turns red, indicating that at least one primary shard and its replicas are not allocated to a node.

You can view the status of alarms in the CloudWatch console and receive the necessary notifications using the SNS topic.

image-20240715192102323

Cost Considerations

Implementing these CloudWatch alarms incurs a small additional cost:

  • Approximately $0.10 per month per metric
  • Total cost for all 38 metrics: approximately $3.80 per month

This minimal investment provides substantial value in terms of improved monitoring and reduced downtime risk.

Conclusion

Effective monitoring of Amazon OpenSearch Service is essential for maintaining high performance, reliability, and user satisfaction. By leveraging CloudWatch metrics and alarms, you can gain deep insights into your cluster’s health and performance, enabling proactive issue resolution.

The provided Terraform code simplifies the process of setting up comprehensive monitoring, allowing you to focus on deriving value from your OpenSearch cluster rather than managing its underlying infrastructure.

Remember to regularly review and adjust your monitoring setup as your usage patterns and requirements evolve, ensuring that your Amazon OpenSearch Service continues to meet your needs.

Similar Posts You Might Enjoy

Creating an Alarm to Detect Usage of a Pending Deletion KMS Keys and AWS Secrets

In cloud computing, security is a critical concern. While AWS provides backup solutions for many resources, custom configurations often require additional protection. Two key services, AWS Key Management Service (KMS) and AWS Secrets Manager, don’t offer direct backup options. However, they implement a deletion grace period— by default 30 days and this is the maximum — allowing for potential restoration. - by Alexey Vidanov

Decoupling Search Logic in Application Development with OpenSearch Templates and Aliases

Imagine you’re managing an e-commerce platform with millions of products. Your search functionality is tightly coupled with your application code, making it challenging to update search logic without redeploying the entire application. This scenario illustrates the critical need for decoupling search logic in modern application development. This blog post explores how OpenSearch templates and aliases can address this challenge, offering practical strategies to enhance search performance and simplify maintenance. - by Alexey Vidanov

Find It Fast: Streamline Phone Number Searches with OpenSearch.

This guide empowers you to optimize OpenSearch for lightning-fast and accurate phone number searches. Frustration-free experiences are key for your customers, and by leveraging edge ngrams and custom analyzers, you can empower OpenSearch to efficiently handle even large datasets. - by Alexey Vidanov