Procter & Gamble
AI Engineering Intern • Demand Forecasting • Feature Selection at Scale
Worked as an AI Engineering Intern at P&G's Cincinnati headquarters, building ML pipelines to analyze millions of rows of demand data and identify the key business drivers that determine consumer demand. Reproduced an outsourced ML platform in-house, saving the business over $5 million in outsourcing costs.
Problem
P&G needed to understand what drives consumer demand across their product portfolio. With millions of rows of sales, pricing, promotion, and market data, manually identifying the key drivers was impossible. The existing solution was outsourced to a third-party ML vendor at significant cost, and the team had no visibility into how the models worked.
The challenge: build an in-house ML platform that could process massive datasets, test multiple modeling approaches, and use advanced feature selection to isolate the variables that actually matter — while proving P&G's internal AI team could match or exceed the outsourced vendor's results.
Approach
Data Processing
Ingested and processed millions of rows of demand data spanning sales volumes, pricing history, promotional calendars, competitive activity, weather patterns, and macroeconomic indicators. Built full ETL pipelines in Azure ML Studio to clean, normalize, and feature-engineer this data into model-ready formats.
Modeling Strategy
Tested multiple modeling techniques to find the best approach for demand driver identification:
Linear Regression
Baseline model for interpretable coefficient analysis
Ridge / Lasso
Regularized regression for feature importance with multicollinearity
XGBoost
Gradient-boosted trees for non-linear demand patterns
Random Forest
Ensemble method for robust feature importance rankings
Feature Selection with Boruta
Used the Boruta algorithm — a wrapper around Random Forest that creates "shadow features" (shuffled copies of each variable) and iteratively tests whether each real feature performs significantly better than its random counterpart. This is one of the most rigorous methods for identifying truly important features vs. noise.
# Boruta feature selection workflow
1. Create shadow features (random permutations)
2. Train Random Forest on real + shadow features
3. Compare each feature's importance to max shadow importance
4. Mark features as Confirmed / Tentative / Rejected
5. Iterate until all features are classified
Technical Stack
ML Platform
Azure ML Studio for end-to-end pipeline management — ETL, training, testing, simulation
Feature Selection
Boruta algorithm for rigorous feature importance — shadow features, statistical testing, iterative classification
Modeling
scikit-learn for regression and classification, XGBoost for gradient boosting, ensemble methods
Data Engineering
Azure Data Explorer clusters with Python SDK for large-scale data ingestion and querying
Languages
Python with pandas, NumPy, scikit-learn, XGBoost, Boruta, matplotlib for visualization
Scale
Millions of rows of demand data across sales, pricing, promotions, competitive and macro indicators
ML Pipeline
Data Ingestion
Millions of rows ingested from Azure Data Explorer into the ML pipeline
Feature Engineering
Created lag features, rolling averages, interaction terms, and temporal encodings
Feature Selection
Boruta + correlation analysis to reduce hundreds of features to the most significant drivers
Model Training
Multiple regression and tree-based models trained with cross-validation
Model Optimization
Improved XGBoost test accuracy by 20% and reduced training time by 13%
Simulation & Deployment
Scenario modeling for demand forecasting and business driver analysis
Results
$5M+
Cost Savings
20%
Accuracy Improvement
13%
Faster Training
M+
Rows Processed
Impact
- Reproduced an outsourced ML platform in-house, saving P&G over $5 million in vendor costs and proving internal AI capabilities
- Used Boruta feature selection to identify the key demand drivers from hundreds of candidate variables across millions of rows
- Improved existing XGBoost model accuracy by 20% through hyperparameter tuning and better feature engineering
- Reduced model training time by 13% through pipeline optimization and efficient data preprocessing
- Set a new standard for ML projects at P&G — proved the viability of in-house AI and established patterns for future teams