Big Data
PySpark
Apache Spark
Hadoop
FP-Growth
Python

Scalable Market Basket Analysis using PySpark & FP-Growth

End-to-end distributed pipeline on 3.2M+ Instacart transactions — from raw CSV to association rules — using Apache Spark. Demonstrates scalability across dataset sizes with real benchmarking results.

3.2M+

Transactions processed

797

Frequent itemsets

436

Association rules

73.8×

Max lift score

Top Frequent Itemsets
Top Association Rules by Confidence
AntecedentConsequentConfidenceLiftSupport
Total 2% Lowfat Greek Yogurt (Blueberry)Total 2% Greek Yogurt (Strawberry)45.8%48.8×0.003
Non Fat Raspberry YogurtIcelandic Skyr Blueberry Yogurt44.2%73.8×0.0023
Apple Honeycrisp Organic + Org. Hass AvocadoBag of Organic Bananas44.2%3.8×0.0021
Cucumber Kirby + Organic AvocadoBanana41.8%2.8×0.002
Organic Raspberries + Org. Hass AvocadoBag of Organic Bananas43.3%3.7×0.0034
Boneless Skinless Chicken BreastsBanana28.8%1.95×0.0045
Green Bell PepperOrganic Baby Spinach16.8%2.23×0.0029
Limes + BananaOrganic Avocado23.1%4.2×0.0023
Confidence vs. Lift — Association Rules