Open source data processing tools, ranked by score and analyzed honestly. Part of our Data & Storage coverage.
Ranked by score. Updated weekly.
Stream processing framework
Unified analytics engine for large-scale data processing
TypeScript-first schema validation with type inference
Extremely fast DataFrame query engine
Distributed event streaming platform
Workflow orchestration for resilient data pipelines
Distributed SQL query engine for big data
Data transformation using software engineering practices
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
A large scale non-linear optimization library
World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
Python Streaming DataFrames for Kafka
Build data pipelines with SQL and Python, ingest data from different sources, add quality checks, and build end-to-end flows.
A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 91+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
High-performance analytical database combining DuckDB SQL engine, Parquet storage, and Arrow format. 18M+ records/sec.