Elasticsearch: blazing fast products per category

This post will show how to create a list with how many products there are in a category and their sub categories with Elasticsearch aggregations covered with tests.

On our site there are products with a category and they have a sub category, for example product category is shoes and sneakers is a sub category.

shirts (134)
shoes (254)
    - running (54)
    - sneakers (200)

Let’s get started. First we create a CategoryItem with the fields main and sub and store the datastructures.

public class CategoryItemRepository {
    public static final String INDEX_CAT_ITEM = "category-item";
    private final RestHighLevelClient client;
    private final ObjectMapper objectMapper;

    public static record CategoryItem(String id, String main, String sub) {}

    public CategoryItemRepository(RestHighLevelClient client, ObjectMapper objectMapper) {
        this.client = client;
        this.objectMapper = objectMapper;
    }

    public void save(CategoryItem categoryItem) throws IOException {
        var request = new IndexRequest(INDEX_CAT_ITEM)
            .id(categoryItem.id())
            .source(objectMapper.writeValueAsString(categoryItem), JSON);

        client.index(request, DEFAULT);
    }
}

To count the product categories there is the aggregations API. Now we add a function getCategoriesCount with a aggregation on the field main and a sub aggregation on field sub. The search response return buckets with the counts, the keys of the buckets are the categories. Default the results are returned in random order, to sort on the categories we will use the bucket key.

public class CategoryItemRepository {
    // ...

    public List<MainCategory> getCategoriesCount() throws IOException {
        var source = searchSource()
            .size(0) // return only the aggregations to avoid filling the aggregation cache
            .aggregation(
                terms("categories").field("main").size(50)
                    .subAggregation(
                        terms("subCategories").field("sub").size(100).order(BucketOrder.key(true))
                    )
            );

        var response = client.search(new SearchRequest(INDEX_CAT_ITEM).source(source), DEFAULT);

        List<MainCategory> results = new ArrayList<>();
        var categoriesTerms = response.getAggregations().<Terms>get("categories");
        categoriesTerms.getBuckets().forEach(categoryBucket -> {
            List<SubCategory> subCategories =
                categoryBucket.getAggregations().<Terms>get("subCategories")
                .getBuckets().stream()
                .map(subCategoryBucket ->
                    new SubCategory(subCategoryBucket.getKeyAsString(),
                        subCategoryBucket.getDocCount())
                ).toList();

            results.add(new MainCategory(categoryBucket.getKeyAsString(),
                categoryBucket.getDocCount(),
                subCategories));
        });

        return results;
    }
}

The result in json will look like

[ {
  "category" : "shirts",
  "count" : 134,
  "subCategories" : []
}, {
  "category" : "shoes",
  "count" : 254,
  "subCategories" : [ {
    "category" : "running",
    "count" : 54
  }, {
    "category" : "sneakers",
    "count" : 200
  } ]
} ]

Now we want to test if the search works as expected and then Testcontainers comes to the rescue, I really love it because no more fiddling around to start it on different operating systems for testing. See also the blog about Testcontainers from Tom!

Step one is to start the container once for all tests because it will costs a couple seconds to start the container. In JUnit 5 there are the annotations @Testcontainers and @Container to start the container. To have no side effects the index is deleted before each test and recreated.

@Testcontainers
class CategoryItemRepositoryTest {

    private RestHighLevelClient client;
    private ObjectMapper objectMapper;
    private static final String MAPPING = "{}"; // added later

    @Container
    private static final ElasticsearchContainer container = new ElasticsearchContainer(
        DockerImageName.parse("docker.elastic.co/elasticsearch/elasticsearch-oss").withTag("7.10.2")
    );

    @BeforeEach
    void setUp() throws IOException {
        client = new RestHighLevelClient(RestClient.builder(HttpHost.create(container.getHttpHostAddress())));

        if (client.indices().exists(new GetIndexRequest(INDEX_CAT_ITEM), RequestOptions.DEFAULT)) {
            client.indices().delete(new DeleteIndexRequest(INDEX_CAT_ITEM), RequestOptions.DEFAULT);
        }
        client.indices().create(new CreateIndexRequest(INDEX_CAT_ITEM).source(MAPPING, XContentType.JSON), RequestOptions.DEFAULT);
        objectMapper = new ObjectMapper();
    }

    @Test
    void shouldReturnCategoriesCount() throws IOException {
        var repository = new CategoryItemRepository(client, objectMapper);

        repository.save(new CategoryItem("1", "main1", "sub1"));
        repository.save(new CategoryItem("2", "main1", "sub2"));
        repository.save(new CategoryItem("3", "main1", "sub2"));
        repository.save(new CategoryItem("4", "mainOnly", null));
        repository.save(new CategoryItem("5", "mainOnly", null));
        repository.save(new CategoryItem("6", "main2", "sub2"));
        repository.save(new CategoryItem("7", "main2", "sub2"));

        refreshIndex(); // force store, default documents are stored async

        var categories = repository.getCategoriesCount();

        assertThat(categories.stream().map(MainCategory::category)).containsExactly("main1", "main2", "mainOnly");
        assertThat(categories.stream().map(MainCategory::count)).containsExactly(3L, 2L, 2L);
        var maybeMain1 = categories.stream().filter(cat -> cat.category().equals("main1")).findFirst();
        assertThat(maybeMain1).isPresent();
        assertThat(maybeMain1.get().subCategories().stream()
            .filter(sub -> sub.category().equals("sub2"))
            .map(CategoryItemRepository.SubCategory::count)).containsExactly(2L);
    }

    private void refreshIndex() throws IOException {
        client.indices().refresh(new RefreshRequest(INDEX_CAT_ITEM), RequestOptions.DEFAULT);
    }
}

If this test is run an exception is thrown.

ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [main] in order to load field data by uninverting the inverted index. Note that this can use significant memory.]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [main] in order to load field data by uninverting the inverted index. Note that this can use significant memory.]];

To fix this problem add a mapping with keyword fields.

{
    "mappings": {
        "properties": {
            "id": {
                "type": "keyword"
            },
            "main": {
                "type": "keyword"
            },
            "sub": {
                "type": "keyword"
            }
        }
    }
}

After running the test with this mapping it will run successful.

Conclusion

With Elasticsearch it is possible to get products per category blazing fast and always up to date. When there are new categories or sub categories they are automatically added only when there are products in that category. Testcontainers helps to easily create integration tests to test your assumptions with the not so intuitive Java API’s of Elasticsearch.

Code can be found on GitHub

Written with Elasticsearch 7.10.2, JUnit 5, Java 17

The examples use a Elaticsearch version that is still compatible with Elasticsearch and Opensearch (previously Opendistro).