Applying DataFusion in HoraeDB

Practices and Challenges

Jiacai Liu

July 21, 2024

About ME

Senior Engineer at AntGroup
5 years of experience with Rust
Apache HoraeDB PPMC Member
❤️ Open source
- Podcast：RustTalk
- https://github.com/jiacai2050

1. What is time series

A collection of time-based data points that can be connected into (time) lines.
Use tags to differentiate between lines

Characteristics

Vertical writes, horizontal(-ish) reads

Scenarios

IoT
APM (Application Performance Monitoring)
Weather Forecasting
Stock Market Analysis

2. Time series database

Specialized database that efficiently stores and retrieves time-stamped data

Prometheus
InfluxDB
Apache HoraeDB
GreptimeDB
…

Challenge

High write throughput
High-cardinality tags leading to BIG index
Real-time OLAP like query pattern

Data & Inverted Index

3. FDAP

How it works for time series

DataFusion

LLVM-like Infrastructure for Databases

4. HoraeDB's QueryEngine

SELECT
    time_bucket(`timestamp`, 'PT1M') AS `timestamp`,
    SUM(`value`) AS `value_sum`
FROM
    `table`
WHERE
    `timestamp` >= '2023-12-15 07:17:00'
    AND `timestamp` < '2023-12-14 08:17:00'
    AND ((`col2` IN ('T')))
GROUP BY
    time_bucket (`timestamp`, 'PT1M')

Local

LSM-like engine

Distributed

Partitioned tables to solve single-table bottlenecks:

Hash
Range
Round-robin

Open as a "normal" table

(Single point hotspot)

Open as a "virtual" table

pub enum AggregateMode {
    /// Partial aggregate that can be applied in parallel across input partitions
    Partial,
    /// Final aggregate that produces a single partition of output
    Final,
    ....
}

AggregateMode

Challenge

Indexing Support in DataFusion? #9963
Teach query engine aware of data distribution patterns(TSID)
Upgrade pain, 1-3 weeks

Thanks

https://horaedb.apache.org/