Procter & Gamble

AI Engineering Intern • Demand Forecasting • Feature Selection at Scale

Worked as an AI Engineering Intern at P&G's Cincinnati headquarters, building ML pipelines to analyze millions of rows of demand data and identify the key business drivers that determine consumer demand. Reproduced an outsourced ML platform in-house, saving the business over $5 million in outsourcing costs.

Problem

P&G needed to understand what drives consumer demand across their product portfolio. With millions of rows of sales, pricing, promotion, and market data, manually identifying the key drivers was impossible. The existing solution was outsourced to a third-party ML vendor at significant cost, and the team had no visibility into how the models worked.

The challenge: build an in-house ML platform that could process massive datasets, test multiple modeling approaches, and use advanced feature selection to isolate the variables that actually matter — while proving P&G's internal AI team could match or exceed the outsourced vendor's results.

Approach

Data Processing

Ingested and processed millions of rows of demand data spanning sales volumes, pricing history, promotional calendars, competitive activity, weather patterns, and macroeconomic indicators. Built full ETL pipelines in Azure ML Studio to clean, normalize, and feature-engineer this data into model-ready formats.

Modeling Strategy

Tested multiple modeling techniques to find the best approach for demand driver identification:

Linear Regression

Baseline model for interpretable coefficient analysis

Ridge / Lasso

Regularized regression for feature importance with multicollinearity

XGBoost

Gradient-boosted trees for non-linear demand patterns

Random Forest

Ensemble method for robust feature importance rankings

Feature Selection with Boruta

Used the Boruta algorithm — a wrapper around Random Forest that creates "shadow features" (shuffled copies of each variable) and iteratively tests whether each real feature performs significantly better than its random counterpart. This is one of the most rigorous methods for identifying truly important features vs. noise.

# Boruta feature selection workflow

1. Create shadow features (random permutations)

2. Train Random Forest on real + shadow features

3. Compare each feature's importance to max shadow importance

4. Mark features as Confirmed / Tentative / Rejected

5. Iterate until all features are classified

Technical Stack

ML Platform

Azure ML Studio for end-to-end pipeline management — ETL, training, testing, simulation

Feature Selection

Boruta algorithm for rigorous feature importance — shadow features, statistical testing, iterative classification

Modeling

scikit-learn for regression and classification, XGBoost for gradient boosting, ensemble methods

Data Engineering

Azure Data Explorer clusters with Python SDK for large-scale data ingestion and querying

Languages

Python with pandas, NumPy, scikit-learn, XGBoost, Boruta, matplotlib for visualization

Scale

Millions of rows of demand data across sales, pricing, promotions, competitive and macro indicators

ML Pipeline

1

Data Ingestion

Millions of rows ingested from Azure Data Explorer into the ML pipeline

2

Feature Engineering

Created lag features, rolling averages, interaction terms, and temporal encodings

3

Feature Selection

Boruta + correlation analysis to reduce hundreds of features to the most significant drivers

4

Model Training

Multiple regression and tree-based models trained with cross-validation

5

Model Optimization

Improved XGBoost test accuracy by 20% and reduced training time by 13%

6

Simulation & Deployment

Scenario modeling for demand forecasting and business driver analysis

Results

$5M+

Cost Savings

20%

Accuracy Improvement

13%

Faster Training

M+

Rows Processed

Impact

  • Reproduced an outsourced ML platform in-house, saving P&G over $5 million in vendor costs and proving internal AI capabilities
  • Used Boruta feature selection to identify the key demand drivers from hundreds of candidate variables across millions of rows
  • Improved existing XGBoost model accuracy by 20% through hyperparameter tuning and better feature engineering
  • Reduced model training time by 13% through pipeline optimization and efficient data preprocessing
  • Set a new standard for ML projects at P&G — proved the viability of in-house AI and established patterns for future teams