Skip to content
This repository was archived by the owner on Apr 11, 2024. It is now read-only.

This repo will have a custom elasticsearch plugin which allows to do aggregation on stored binary representations of hyperloglogplus data structure.


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



9 Commits

Repository files navigation

Elasticsearch HyperloglogPlus Aggregator

Simple aggregation which allows the aggregation of HyperloglogPlus serialized objects which have been saved as a binary field in Elasticsearch.


The HyperloglogPlus data structure allows us to compute the cardinality of a multiset with a small trade-off in accuracy. One of the interesting properties of Hyperloglog is that it allows to merge more than one Hyperloglog to compute total multiset cardinality.

This plugin uses a HyperloglogPlus implementation from stream-lib. We first considered Elasticsearch HLL implemenation, but decided to go with stream-lib as ES implementation may change and usage of HyperLogLogPlus directly is not supported.

Sample Use Case

Unique audience count

Given a data set of social media posts with a few billion visits and 10 million posts ( with hashtags) , we must count the distinct count of users who had seen a combination of hashtags associated with pages . One approach would be to index each user as a document with posts and their hashtags as properties in the document. This demands a huge index and the view is very 'user centric'.

Instead, the approach used here is to index each post as a document with a hyperloglog binary field representing unique visitors for that post. Then, using the hyperlogsum aggregation provided by this plugin, unique visitors can be computed by any post dimension ( such as hashtags here).

Computing Hyperloglog binary field to index can be done by one of these ways:

1 - As demonstrated in the integration tests, directly using stream-lib with same precision used by plugin internally (HyperUniqeSumAggregationBuilder.SERIALIZED_DENSE_PRECISION, HyperUniqeSumAggregationBuilder.SERIALIZED_SPARSE_PRECISION)

2 - If using spark to index data into elasticsearch, an example UDAF has been provided which internally uses the same precision.

3 - One could also build a custom UDAF for your stack, modeled on the spark version (feel free to contribute back!)


Elasticsearch Versions

Since this plugin might change with minor versions of ES , Source code is organized into ES version specific directory. Currently we support ES 5.4.3.

Please create an issue if a different ES-version-specific release needs to be made.


Switch to ES version specific directory .

cd es-5.4.3

In order to install this plugin, you need to create a zip distribution first by running

gradle clean assemble

This will produce a zip file in build/distributions.

After building the zip file, you can install it like this

bin/elasticsearch-plugin install file:///path/to/elasticsearch-hyperloglogplus/build/distribution/

Elasticsearch Mappings

Since we are using binary field in ES for storing HLL, doc_values should be set to true in the mapping declaration.

Indexing Example

hll field should be mapped to type = binary and doc_values = true, like using an index template in below example

curl -XPUT http://localhost:9200/_template/template_1 -d '
  "order": 0,
  "template": "*",
  "mappings": {
    "product": {
      "properties": {
        "hll": {
          "type": "binary",
          "doc_values": true
        "desc": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
  "aliases": {}

python sample code to populate index

import json
import requests

data = {
          "desc" : "bar",
          "hll" : "rO0ABXNyAFBjb20uY2xlYXJzcHJpbmcuYW5hbHl0aWNzLnN0cmVhbS5jYXJkaW5hbGl0eS5IeXBlckxvZ0xvZ1BsdXMkU2VyaWFsaXphdGlvbkhvbGRlct2sxediKNllDAAAeHB6AAACVf////4OGQHIAearF8aBOYiDAfAZlpgGrusBwO8M7ij6uSHKkgOA0RDW9Wuyig30qCvu9CTQ3yT62grKsRjumQeE9QG+vQ3c7ATc6TC0OsjSH/CaE6RY+PYD0I8WtvM1tMcBlswJ7vsQ0E/6/wLEvSWW/ibm4Q6YL9KADMyeCLq1JsDWKszxE7CzBIibCsrVCfS3CPCbG9TsA8TpVeaUCejZE9KlH4iPCNSNEczxB/pu+NUPutEC/KgVxpwBzKpI0poTwk6i6CO83APe/waKqRv+tg328CewmxLugwLqwA7GxwimiQWIswuC/i/WsxCo0Rrs2QbAywi2yATqkyz46SLO6wic/gXyGaTcAaiHMezAAvzfCrrwEsyUGbhHnKUDhLgj9IMSvogC7MMIrusBjPMD2L0FkvQPrKcfoJpM2r1K7rEGtP0S2rAr+tJM9oga3q8OivkJrPAF4uQojscWmMw62towntIEoN0DzowQjvAq7NYI4IQJ7sojsq0L5I8B+OoUptgohswMrNEa2Pw5iIMB5uEdurUmioM40M4CgrEj2OIKqIcPmosL7NkGvsMK5LIisOkH0PELuucEirAitKEH7NkG9JcBzPICnrAO8qEh0LkJkKoN0MIQ1pgLwu8XruUm3IQQ5rgVwt0RgJIwzK4DqMEw4IoZkO8KrqUP1uIIns4nyokO2kuorxPUr1OEvweWogL8qyDkpAyyig64shDqoS++zQvu4xbOuhL48grqohHg8QfSlQKO+wPmH6brQqSiCJSiAqSZVujUA9jDArqyC/T1Xng="

url = 'http://localhost:9200/newindex/product/a'

for  i in range(1,100): + str(i),json = data)

Querying with custom aggregation

curl -XPOST 'localhost:9200/_search?pretty=true&size=0' -d '{ "query" : {"match_all" : {}}, "aggs" : { "desc.keyword" : { "terms" : { "field" : "desc.keyword"}  ,  "aggs" : {  "uniq" : {"hyperlogsum" : { "field" : "hll" } } } } } }'

  "took" : 36,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  "hits" : {
    "total" : 99,
    "max_score" : 0.0,
    "hits" : [ ]
  "aggregations" : {
    "desc.keyword" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
          "key" : "bar",
          "doc_count" : 99,
          "uniq" : {
            "value" : 200.0

Bugs & TODO

  • Use an encoding format for HLL ( Like REDIS )
  • Consider LogLog Beta
  • HLL Mapping Type


Thanks to David Pilato's gradle cookie cutter generator to make plugin projects easier.


This repo will have a custom elasticsearch plugin which allows to do aggregation on stored binary representations of hyperloglogplus data structure.







No releases published


No packages published