Applying DataFusion in HoraeDB

Practices and Challenges

Jiacai Liu
July 21, 2024

About ME

1. What is time series

  • A collection of time-based data points that can be connected into (time) lines.
  • Use tags to differentiate between lines

Characteristics

  • Vertical writes, horizontal(-ish) reads

Scenarios

  • IoT
  • APM (Application Performance Monitoring)
  • Weather Forecasting
  • Stock Market Analysis

2. Time series database

Specialized database that efficiently stores and retrieves time-stamped data

  • Prometheus
  • InfluxDB
  • Apache HoraeDB
  • GreptimeDB

Challenge

  • High write throughput
  • High-cardinality tags leading to BIG index
  • Real-time OLAP like query pattern

Data & Inverted Index

3. FDAP

How it works for time series

DataFusion

LLVM-like Infrastructure for Databases

4. HoraeDB's QueryEngine

SELECT
    time_bucket(`timestamp`, 'PT1M') AS `timestamp`,
    SUM(`value`) AS `value_sum`
FROM
    `table`
WHERE
    `timestamp` >= '2023-12-15 07:17:00'
    AND `timestamp` < '2023-12-14 08:17:00'
    AND ((`col2` IN ('T')))
GROUP BY
    time_bucket (`timestamp`, 'PT1M')

Local

LSM-like engine

Distributed

Partitioned tables to solve single-table bottlenecks:

  • Hash
  • Range
  • Round-robin

Open as a "normal" table

(Single point hotspot)

Open as a "virtual" table

pub enum AggregateMode {
    /// Partial aggregate that can be applied in parallel across input partitions
    Partial,
    /// Final aggregate that produces a single partition of output
    Final,
    ....
}

AggregateMode

Challenge

Thanks

https://horaedb.apache.org/