Varun Medappa — Data Science Leader

What I've Built

Things I Shipped

Real systems running in production at scale.

All Projects

Staff DS IAS 2025 – Now

Senior DS IAS 2023 – 2025

Data Scientist IAS 2021 – 2023

Associate DS IAS 2019 – 2021

Analyst Capillary 2016 – 2017

🎥

3T imps/day

Integral Ad Science · Trust & Safety · Founding project

Multimodal Video Classification Engine

Built the core Trust & Safety classification product from zero. Detects hard-to-catch, low-prevalence unsafe content across TikTok, YouTube, Meta, and Twitter. 3 trillion impressions/day at 34% YoY growth. Deployed across 23 countries. Patented.

TensorFlow ECS Fargate Multimodal ML 23 Countries

🔍

99.5% cost cut

Integral Ad Science · Trust & Safety

Multimodal Video Retrieval & Labeling Pipeline

Built an internal Python package for end-to-end multimodal labeling. Vector search finds relevant content, LLM+VLM classifies it, human A/B testing validates, and DSPy optimizes prompts in a continuous RLAIF loop. Cut labeling cost by 99.5%.

DSPy RLAIF VLM / LLM FAISS

🤖

New product, 80% less compute

Integral Ad Science · Applied GenAI

Audio & Video Deepfake Detection

Led PEFT/LoRA adoption for deepfake detection models, slashing compute by ~80% and boosting experiment throughput 3x. No performance degradation.

LoRA PEFT HuggingFace A/V Models

🛡

22M videos/day

Integral Ad Science · Trust & Safety

Misinformation Detection System

Directed end-to-end development using multimodal and DistilBERT models extended for longer token classification. Catches low-prevalence misinformation, content that evades simple filters, across 22M+ videos per day in production.

DistilBERT Multimodal NLP Classification

⚡

90%+ cost reduction

Integral Ad Science · Media Team

Pseudo-Labeling & Synthetic Data Generation

Built a pseudo-labeling pipeline using SigLIP and Vicuna that cut ML development costs by 90%+ across 49 video categories. Separately led synthetic data generation with GPT-Neo that boosted Brand Safety precision by 49%.

SigLIP Vicuna GPT-Neo Data Pipelines

🌐

67% savings, 75T chars

Integral Ad Science · Media Team

Machine Translation at Scale

Built custom translation models with OpenNMT for 42 language pairs. Translates 75 trillion characters annually while reducing costs by 67% for the core contextual classification pipeline.

OpenNMT 75T chars/year 42 Languages 67% Savings

📈

8B impressions/day

Integral Ad Science · Incrementality Team

Online Conversion Lift Pipeline

Built a causal inference pipeline ingesting 8 billion ad impressions per day to measure real cause-and-effect on conversion rates for ad campaigns using Bayesian methods.

PyStan Bayesian Causal Inference ETL

💰

18% conversion

Capillary Technologies · Bangalore

Customer Churn Prediction

Built predictive models and churn strategies that identified customer behavior patterns and converted 18% of one-time buyers into repeat customers.

Python Predictive Modeling Churn

Featured

Project I Am Proud Of

★ Featured Project

Multimodal AI Labeling & Prompt Optimization Pipeline

Integral Ad Science · Staff DS · Trust & Safety

The Problem

We classify extremely low-prevalence, nuanced categories of unsafe content in social media video. Previously, building a new classifier meant weeks of work: sampling the right balance of rare data from a large amount of production data which is actually just safe content, designing experiments to extract signal, then paying for expensive, time-consuming human annotations on subjective user-generated video. Each iteration was slow, costly, and hard to scale.

What I Built

An internal Python package that closes the entire loop. It finds relevant content in production via vector search, accepts custom prompts (versioned in GitHub), and processes any modality. For video, it extracts keyframes, deduplicates frames, then sends everything to a cost-optimized LLM + VLM system for multimodal fusion. The output includes multilabel classifications, topic/subtopic stratification for comprehensive dataset sampling, and artifacts ready for model training or human review.

Vector Search → Keyframe Extraction → LLM + VLM Fusion → Multilabel Output → Human A/B Testing → DSPy Prompt Optimization ↻

The Feedback Loop

Human reviewers validate AI labels through a custom A/B testing UI. Their responses feed back into the package, where DSPy and GEPA optimize the original prompts automatically. This is essentially an RLAIF pipeline. The LLM-as-judge acts as a reward model, with engineered reward functions calibrated against human baselines to prevent reward hacking. The loop runs continuously: label, validate, optimize, retrain.

DSPy LLM-as-Judge RLAIF VLM / LLM FAISS Prompt Optimization Multimodal Fusion

Hi, I'm Varun
Medappa

What I've Driven

Where I Studied

Things I Shipped

Multimodal Video Classification Engine

Multimodal Video Retrieval & Labeling Pipeline

Audio & Video Deepfake Detection

Misinformation Detection System

Pseudo-Labeling & Synthetic Data Generation

Machine Translation at Scale

Online Conversion Lift Pipeline

Customer Churn Prediction

Project I Am Proud Of

Multimodal AI Labeling & Prompt Optimization Pipeline

The Problem

What I Built

The Feedback Loop

Stack Map

Let's Connect

Hi, I'm VarunMedappa

What I've Driven

Where I Studied

Things I Shipped

Multimodal Video Classification Engine

Multimodal Video Retrieval & Labeling Pipeline

Audio & Video Deepfake Detection

Misinformation Detection System

Pseudo-Labeling & Synthetic Data Generation

Machine Translation at Scale

Online Conversion Lift Pipeline

Customer Churn Prediction

Project I Am Proud Of

Multimodal AI Labeling & Prompt Optimization Pipeline

The Problem

What I Built

The Feedback Loop

Stack Map

Let's Connect

Hi, I'm Varun
Medappa