Best Open Source Data Processing Tools

Open source data processing tools, ranked by score and analyzed honestly. Part of our Data & Storage coverage.

Ranked by score. Updated weekly.

1

Flink

85

Stream processing framework

26,006Javapermissive
2

Spark

83

Unified analytics engine for large-scale data processing

43,269Scalapermissive
3

Zod

83

TypeScript-first schema validation with type inference

42,673TypeScriptpermissive
4

Polars

83

Extremely fast DataFrame query engine

38,484Rustpermissive
5

Kafka

83

Distributed event streaming platform

32,593Javapermissive
6

Prefect

83

Workflow orchestration for resilient data pipelines

22,405Pythonpermissive
7

Trino

83

Distributed SQL query engine for big data

12,813Javapermissive
8

dbt

81

Data transformation using software engineering practices

12,783Pythonpermissive
9

airbyte

77

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

21,260Pythonunknown
10

tilelang

75

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

6,201Pythonpermissive
11

ceres-solver

75

A large scale non-linear optimization library

4,474C++permissive
12

quix-streams

70

Python Streaming DataFrames for Kafka

1,549Pythonpermissive
13

kreuzberg

63

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 91+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

8,311Rustunknown
14

arc

52

High-performance analytical database combining DuckDB SQL engine, Parquet storage, and Arrow format. 18M+ records/sec.

594Gostrong-copyleft

Explore More Categories