Alibaba Cloud Elasticsearch Tutorial & Optimisation

Alibaba Cloud’s Elasticsearch Service has come a long way from when it was first introduced, and I believe it is ready for most production workloads. It wipes out most of the pain of operating Elasticsearch all by yourself, and letting you to focus on your application or business.

Note that, if you’ve never set up an Elasticsearch cluster before, Alibaba Cloud Elasticsearch can be quite unintuitive. In this Alibaba Cloud Elasticsearch Tutorial, I’ll walk you through the steps to provision a cluster on Alibaba Cloud’s Elasticsearch Service.

What is Elasticsearch

Elasticsearch, according to them as the central part of the Elastic Stack, is a Lucene-based distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. At its core, you can think of Elasticsearch as a server that can process JSON requests and give you back JSON data.

As I mentioned, Elasticsearch is part of the Elastic Stack, also known as ELK (named after its components Elasticsearch, Logstash, and Kibana, although now also includes Beats). Going deep into these tools is out of scope for this article, but Kibana is used for data visualisation, Logstash is a data processing tool and Beats a single-purpose data shipping agents (I like to call it a mini-logstash).

Why a search engine is necessary

Elastic likes to describe this engine as a “search everything, anywhere” thing. And they are right, it can be used for search within applications, websites, enterprise data, log & infrastructure analytics, etc. It basically helps you to structure the data you have and therefore make it easier to find what you are looking for.

Alibaba Cloud Elasticsearch: Your First Cluster

If you read me often you know I’m a huge advocate of IaC (Infrastructure as Code) so, instead of showing you how to click-click-clack this on the Alibaba Cloud web interface, I’ll run the deployment part using Terraform.

Terraform config to spin up an Alibaba Cloud Elasticsearch

provider "alicloud" {
  region = "cn-shanghai"
}

resource "alicloud_elasticsearch_instance" "main" {
  instance_charge_type = "PrePaid"
  period               = "1"
  data_node_amount     = "2"
  data_node_spec       = "elasticsearch.n4.small"
  data_node_disk_size  = "20"
  data_node_disk_type  = "cloud_efficiency"
  client_node_amount   = 2
  vswitch_id           = "vsw-uf6wwcr5oczmcam4riup4"
  password             = "5up3r54f3p455w0rd"
  version              = "7.10.0_with_X-Pack"
  description          = "my-main-es-cluster"
  zone_count           = "1"

  # Elasticsearch opened to VPC
  private_whitelist = [
    "0.0.0.0/0",
  ]

  # Elasticsearch opened to Internet
  enable_public    = true
  public_whitelist = [
    "0.0.0.0/0",
  ]

  # Kibana opened to Internet
  enable_kibana_public_network = true
  kibana_whitelist             = [
    "0.0.0.0/0",
  ]
}

output "domain" {
  value = alicloud_elasticsearch_instance.main.domain
}

output "kibana_domain" {
  value = alicloud_elasticsearch_instance.main.kibana_domain
}

After this, you know, terraform init, terraform plan, and then terraform apply. Classic =)

Using the Elasticsearch engine

The time for IaC is over. Let’s play with our cluster. Now that we have our own Alibaba Cloud Elasticsearch cluster up and running, I’ll show you some basic operations to get you started. In this case, we will set up a logstash pipeline to sync data from a PostgreSQL database running on Alibaba Cloud RDS.

Logstash: Pipeline to sync an RDS

As we mentioned before, Logstash is a data processing tool. I like to see it as a tool to convert almost any source of data into data understandable by Elasticsearch. In its most basic mode, a pipeline will run on schedule to re-index our information and delete the old indices.

An example of a Logstash config file to sync all the data from an RDS to Elasticsearch would be like:

input {
  jdbc {
    jdbc_driver_library => "/postgresql-42.2.19.jar"
    jdbc_driver_class => "org.postgresql.Driver"
    jdbc_page_size => 10000
    jdbc_paging_enabled => true
    jdbc_connection_string => "jdbc:postgresql://${JDBC_CONNECTION_STRING}"
    jdbc_user => "${DB_USER}"
    jdbc_password => "${DB_PASSWORD}"
    statement => "${SQL_STATEMENT}"
  }
}

output {
  elasticsearch {
    hosts => [
        "${ES_PROTO}://${ES_ENDPOINT}:${ES_PORT}"
    ]
    index => "${ES_INDEX}"
    document_id => "${ES_DOCUMENT_ID}"
    user => "${ES_USER}"
    password => "${ES_PASSWORD}"
  }
}

First, because Logstash can be used to sync data from almost anywhere, logically it doesn’t come with drivers to work with all type of data sources, that’s why we need to load the JDBC driver by ourselves. In this case, we got the “jar” file to connect to a PostgreSQL database.

To make things easier, I prepared a Docker Image called roura/logstash-postgresql-es (docs & source) so you can just run a container with it as a cronjob in a Kubernetes cluster or just as you wish.

About the values, all of them shown like ${VALUE} are pulled from the available environment variables in the system is running on. Examples:

JDBC_CONNECTION_STRING: This string is made of “db-host/db-name“, like “example-db.pg.rds.aliyuncs.com/db_name”
SQL_STATEMENT: A standard SQL Query like “SELECT id, name, surname FROM table_name”
ES_INDEX: Your index name like “my_first_index”
ES_DOCUMENT_ID: The document ID, like “%{id}”, matching the “id” from the SQL Query.

The rest of variables are self explanatory, I think (let me know if not).

Search Optimisation: Index Templates

Even if you didn’t follow the Logstash step, you’ll be able to manually create indices and create some data by using Kibana, you’ll find the URL as an output of the Terraform config we did before. Before anything, I recommend you to read the docs on this and learn how to play with it.

After you create your first index (using the Logstash from above as example), you’ll be able to query data from the cluster already and the results would or not be in the order you wanted. This happens because Elasticsearch is not magic and doesn’t know what exactly how you want your data to be returned. It just makes a good guess. Because this, Index Templates are very important. This type of templates are a way to tell Elasticsearch how to configure an index when it is created.

A good “first approach” to Index Templates is to create one with an n-grams tokeniser so you can search partial words. Another interesting way to set up a template is to recognise English (Or any other language) stop words. Have a look to the “N-gram” tokeniser documentation and play with the amazing world of “Search Engines“. Have fun searching!