An end-to-end data analytics project analyzing vendor and product performance for a multi-store beverage and spirits retailer. The project covers the full pipeline from raw data ingestion to an interactive Power BI dashboard, with exploratory analysis, feature engineering, and statistical hypothesis testing in between.
A beverage and spirits retailer operating across multiple stores and cities needed answers to critical procurement and profitability questions:
- Which vendors generate the most revenue — and which consume budget without adequate return?
- Are there hidden high-margin products being undersold due to lack of visibility?
- Does purchasing in bulk actually reduce unit costs, and by how much?
- Is the underperformance of low-tier vendors statistically meaningful or just noise?
- Which vendors are tying up cash in unsold inventory?
| Metric | Result |
|---|---|
| Total Sales | $452M |
| Total Purchases | $322M |
| Gross Profit | $129.72M |
| Avg Profit Margin | 38.8% |
| Unsold Inventory | $9M |
| Top 10 vendor purchase concentration | 65.69% |
| Bulk order unit cost saving vs small orders | −72.4% ($39.07 → $10.78) |
| High-margin, low-sales brands identified | 198 |
| T-test result (top vs low vendors) | T = −17.67, p < 0.0001 ✅ Reject H₀ |
vendor-performance-analysis/
│
├── data/ # Raw CSV source files (not included — see Data section)
│ ├── begin_inventory.csv
│ ├── end_inventory.csv
│ ├── purchases.csv
│ ├── purchase_prices.csv
│ ├── sales.csv
│ └── vendor_invoice.csv
│
├── Ingestion.ipynb # Step 1: Load CSVs into DuckDB
├── Exploretey_Data_Analysis.ipynb # Step 2: EDA + feature engineering + vendor_sales_summary
├── Vendor_Performance_Analysis.ipynb # Step 3: Analysis, visualizations & statistical tests
│
├── vendor_sales_summary_final.xls # Exported summary table (Power BI source)
│
├── dashboard/
│ └── Vendor_Performance_Dashboard.png # Power BI dashboard screenshot
│
├── report/
│ └── Vendor_Performance_Report.docx # Full professional analysis report
│
└── README.md
Raw CSVs (6 files, ~15.6M rows)
│
▼
[ Ingestion.ipynb ]
→ DuckDB (inventory.db)
→ 6 tables created in 30 seconds
│
▼
[ Exploretey_Data_Analysis.ipynb ]
→ Multi-CTE SQL joins all 6 tables
→ Computes: GrossProfit, ProfitMargin, StockTurnover, SalesToPurchaseRatio
→ Cleans & exports: vendor_sales_summary (10,692 rows → 8,564 valid)
│
▼
[ Vendor_Performance_Analysis.ipynb ]
→ EDA: distributions, correlation heatmap, summary statistics
→ Vendor ranking & purchase concentration analysis
→ Bulk purchasing analysis
→ High-margin / low-sales brand discovery (198 brands)
→ Welch's T-test: top vs low vendor profit margins
│
▼
[ Power BI Dashboard ]
→ KPI cards, donut chart, bar charts, scatter plot, low performer highlight
The raw data consists of 6 CSV files representing one full year (Jan 1 – Dec 31, 2024) of retail operations:
| File | Rows | Description |
|---|---|---|
begin_inventory.csv |
206,529 | Opening stock by store, brand, city |
end_inventory.csv |
224,489 | Closing stock at year end |
purchases.csv |
2,372,474 | All purchase orders with vendor & pricing details |
purchase_prices.csv |
12,261 | Reference: brand-level pricing per vendor |
sales.csv |
12,825,363 | Daily sales transactions including excise tax |
vendor_invoice.csv |
5,543 | Freight/logistics cost per vendor |
⚠️ Raw data files are not included in this repository due to size. Place your CSVs in a/datafolder before running the notebooks.
The core analytical table is built via a multi-CTE SQL query and enriched with these KPIs:
| Feature | Formula |
|---|---|
GrossProfit |
Total_sales_dollars − TotalPurchaseDollars |
ProfitMargin |
(GrossProfit / Total_sales_dollars) × 100 |
StockTurnover |
Total_sales_quantity / TotalPurchaseQuantity |
SalesToPurchaseRatio |
Total_sales_dollars / TotalPurchaseDollars |
| Chart | Insight |
|---|---|
| Distribution Histograms | Most metrics are right-skewed — a few top performers dominate |
| Correlation Heatmap | PurchasePrice vs ActualPrice: 0.99 | GrossProfit vs TotalPurchaseDollars: 0.96 |
| Scatter Plot (Sales vs Margin) | 198 red-dot brands identified in the high-margin, low-sales quadrant |
| Donut Chart | DIAGEO leads at 16.3%; top 10 total = 65.69% of all purchases |
| CI Histogram | Top vendors (mean 31.18%) vs Low vendors (mean 41.57%) — clearly distinct distributions |
| Power BI Dashboard | Interactive: KPI cards, vendor ranking, low performer bar chart, scatter by target brand |
A two-sample Welch's T-test was conducted to validate whether the profit margin difference between top-performing and low-performing vendors is statistically significant.
- H₀: No significant difference in profit margins between the two groups
- H₁: A significant difference exists
| Top Vendors (top 25% by sales) | Low Vendors (bottom 25% by sales) | |
|---|---|---|
| Mean Profit Margin | 31.18% | 41.57% |
| 95% Confidence Interval | (30.74%, 31.61%) | (40.50%, 42.64%) |
Result: T = −17.6695, p = 0.0000 → Reject H₀
The two vendor groups operate in fundamentally different profitability regimes. This statistically validates using performance tiering as the basis for vendor management decisions.
- Diversify vendor portfolio — reduce 65.69% top-10 concentration to below 55% within 12 months
- Consolidate orders — shift volume to large-order tiers for 72% unit cost savings
- Promote 198 hidden-gem brands — high-margin products with very low current sales exposure
- Issue improvement plans to 5 vendors with stock turnover < 1.0 (TRINCHERO, MAJESTIC FINE WINES, CRUSH WINES, LAIRD & CO, SEA BREEZE CELLARS)
- Resolve $9M unsold inventory — urgent SKU-level audit across stores
- Automate the pipeline — monthly DuckDB refresh + Power BI Service for real-time procurement decisions
| Layer | Tool |
|---|---|
| Database | DuckDB |
| Data Wrangling | Python, Pandas, SQL (CTEs) |
| Statistical Testing | SciPy (ttest_ind) |
| Visualization | Matplotlib, Seaborn |
| Dashboard | Microsoft Power BI |
| Environment | Jupyter Notebook, Conda |
1. Clone the repository
git clone https://github.com/your-username/vendor-performance-analysis.git
cd vendor-performance-analysis2. Install dependencies
pip install duckdb pandas numpy matplotlib seaborn scipy openpyxl3. Add your data
Place the 6 CSV files inside a /data folder in the project root.
4. Run notebooks in order
# Step 1 — Ingest CSVs into DuckDB
jupyter notebook Ingestion.ipynb
# Step 2 — EDA + build vendor_sales_summary
jupyter notebook Exploretey_Data_Analysis.ipynb
# Step 3 — Performance analysis + statistical tests
jupyter notebook Vendor_Performance_Analysis.ipynb5. Open the dashboard
Import vendor_sales_summary_final.xls into Power BI Desktop to explore the dashboard.
A full written analysis report is available in /report/Vendor_Performance_Report.docx, covering:
- Business problem definition
- Data architecture & pipeline
- EDA findings with all embedded charts
- Vendor performance tables
- Statistical test results
- 6 prioritized recommendations
Imran Khan Student — Digital Technology and Management (DTM) Ostbayerisch Technisch Hochschule Amberg-Weiden Email: iamkhan7694@gmail.com Linkdin: (https://www.linkedin.com/in/imran-khan-7296681b5/)
This project is for educational and portfolio purposes.