Best Open Source Data Processing Tools

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

21,499 ★Pythonunknown

lance

Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..

6,693 ★Rustpermissive

tilelang

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

6,523 ★Pythonpermissive

ceres-solver

A large scale non-linear optimization library

4,499 ★C++permissive

gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.

3,017 ★Javapermissive

quix-streams

Python Streaming DataFrames for Kafka

1,554 ★Pythonpermissive

bruin

Build data pipelines with SQL and Python, ingest data from different sources, add quality checks, and build end-to-end flows.

1,621 ★Gopermissive

kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 91+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

8,511 ★Rustunknown