Best Open Source Data Processing Tools

Open source data processing tools, ranked by score and analyzed honestly. Part of our Data & Storage coverage.

Ranked by score. Updated weekly.

1

Flink

85

Stream processing framework

26,100Javapermissive
2

Spark

83

Unified analytics engine for large-scale data processing

43,487Scalapermissive
3

Zod

83

TypeScript-first schema validation with type inference

43,023TypeScriptpermissive
4

Polars

83

Extremely fast DataFrame query engine

38,835Rustpermissive
5

Kafka

83

Distributed event streaming platform

32,893Javapermissive
6

Prefect

83

Workflow orchestration for resilient data pipelines

22,653Pythonpermissive
7

Trino

83

Distributed SQL query engine for big data

12,944Javapermissive
8

dbt

81

Data transformation using software engineering practices

13,037Pythonpermissive
9

airbyte

77

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

21,499Pythonunknown
10

lance

75

Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..

6,693Rustpermissive
11

tilelang

75

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

6,523Pythonpermissive
12

ceres-solver

75

A large scale non-linear optimization library

4,499C++permissive
13

gravitino

73

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.

3,017Javapermissive
14

quix-streams

70

Python Streaming DataFrames for Kafka

1,554Pythonpermissive
15

bruin

66

Build data pipelines with SQL and Python, ingest data from different sources, add quality checks, and build end-to-end flows.

1,621Gopermissive
16

kreuzberg

63

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 91+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

8,511Rustunknown
17

root

63

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

3,234C++unknown
18

arc

52

High-performance analytical database combining DuckDB SQL engine, Parquet storage, and Arrow format. 18M+ records/sec.

610Gostrong-copyleft

Explore More Categories