The Lens

Apache Spark processes massive datasets (logs, events, transactions) across a cluster of machines in parallel. Basically, MapReduce's faster, more versatile successor. It handles batch processing, streaming, SQL queries, machine learning, and graph processing all in one engine.

Apache 2.0, backed by the Apache Software Foundation. This is the industry standard for big data processing. Every major cloud provider offers managed Spark (Databricks, AWS EMR, Google Dataproc, Azure HDInsight).

The engine itself is free. You pay for the compute, either your own cluster or a managed service. Databricks (founded by the Spark creators) charges $0.07-$0.55/DBU depending on tier. AWS EMR adds ~$0.015-$0.27/hr per instance on top of EC2 costs.

The catch: Spark is not for small data. If your dataset fits in memory on one machine, use Polars or DuckDB. They'll be faster with zero cluster overhead. Spark's power comes with real operational complexity: cluster management, memory tuning, shuffle optimization. It's the right tool when you have big data, and overkill for everything else.