Senior Data Engineer · 6+ Years · Bengaluru, India
Saket Kumar
I build open, cost-efficient data platforms — Snowflake, Databricks, and BigQuery — with a focus on compute cost, query latency, and CI/CD release cycles. Currently New Relic's founding data engineer on the Product Rating platform.
Open-source projects I ship & maintain.
- 01
dbt-polyglot
Compile-time SQL-dialect transpiler for dbt. Run models authored in Snowflake, BigQuery, Redshift, T-SQL, or DuckDB on Spark / Databricks unchanged — via sqlglot transpilation with a Spark correctness fix-up layer. Battle-tested inside my New Relic Snowflake → lakehouse migration.
- Python
- sqlglot
- dbt-core
- Spark
- PyPI
- 02
lakehouse-lab
A self-authored, laptop-safe Data Engineering production-challenges curriculum. 58 modules across Spark performance (skew, OOM, AQE), Apache Iceberg & Delta Lake correctness, Kafka + Structured Streaming, Debezium CDC, dbt quality with Great Expectations, and Airflow — with a full incident-simulator capstone.
- Spark
- Iceberg
- Delta Lake
- Kafka
- Debezium
- dbt
- Airflow
Data platforms should be fast, cheap, and boring — in that order.
I'm a Senior Data Engineer with 6+ years designing and scaling cloud-native data platforms. I've delivered $45K+ in annual Snowflake credit savings on the flagship ELT pipeline alone, 50% dbt model speed-ups, and I'm currently leading migration of 440+ dbt models from Snowflake to a fully open-source lakehouse — Iceberg on S3, Spark Thrift compute, dbt-spark, Project Nessie catalog, and Airflow 3.
I gravitate toward the "unsexy" wins: cost curves, dialect gaps, correctness fixtures, migration safety. I open-source what I'd want to find, not what I'd want to ship. Multi-cloud (AWS · GCP · Azure), GDPR/PII-aware by design.
Off-hours: cosmology, open-source data tooling, and slowly writing more than I read.
Where I've built things.
-
Senior Data Engineer · New Relic
Founding Member — Product Rating Data (India)
- Open Lakehouse Migration (Snowflake → Iceberg): Leading migration of ELT pipeline (440+ dbt models) to a fully open-source lakehouse — Iceberg on S3, Spark Thrift compute, dbt-spark, Project Nessie catalog, Airflow 3. Automated model-parity validation + mismatch-spike dashboard for zero-downtime cutover.
- Snowflake Cost & Performance: Refactored critical dbt models — $45.2K annual credit savings, 50% faster queries, 100% data parity.
- CI/CD for Data: Jenkins + GitHub framework (linting, BDD, dbt Cloud triggers, quality gates, Slack) — 85% cut in deployment lead time.
- Zero-Downtime Migration & Monetization Modeling: Migrated a 10 GB/hour billing pipeline from Airflow 1.0 to dbt Cloud + Snowflake. Rating engine supported 15+ new product SKUs (Feb 2025).
-
Data Engineer · Falabella
LATAM e-commerce — Falabella.cl third-party marketplace
- Fast Shipping Tags: Data product on BigQuery/DataProc/Pub-Sub for 4M+ SKUs at 97% accuracy — drove 50% lift in platform conversion.
- Serverless Ingestion Framework: Cloud Functions + Federated Queries — 80% faster pipeline setup, killed Compute Engine overhead.
- Airflow Observability at Scale: Custom monitoring for 2,000+ DAGs — automated alerts, root-cause analytics, 60% MTTR improvement, 99.9% availability.
-
Specialist Programmer · Infosys
Clients: Walmart & Five Below
- Enterprise ETL on Databricks & Spark (Walmart): Spark-Scala + PySpark ETL on Databricks and GCP DataProc, GCS → BigQuery with automated data-quality gates. Spark Structured Streaming for near-real-time.
- PII Encryption Framework (Five Below): Cross-org AES-256/PGP framework in Python/PySpark/Scala/Java/PGPy, processing 80 GB+ files for end-to-end PII compliance.
The stack I actually use.
Warehouses & Lakehouses
- Databricks
- Snowflake
- BigQuery
- Delta Lake
- Apache Iceberg
- dbt (Core & Cloud)
Big Data, Streaming & Orchestration
- Apache Spark
- PySpark
- Spark Structured Streaming
- Kafka
- Debezium (CDC)
- Airflow
- ETL / ELT
- Great Expectations
Cloud & Infrastructure
- AWS (Glue, S3, Athena, EMR)
- GCP (BigQuery, DataProc, Pub/Sub, Cloud Functions)
- Azure
- Kubernetes
- Docker
- Terraform
- MongoDB
- Linux
Languages & Frameworks
- Python
- SQL
- NoSQL
- Scala
- Shell
- FastAPI
- Pandas
- Pytest
- REST APIs
DevOps, BI & Governance
- Jenkins
- GitHub Actions
- Prometheus / Grafana
- Looker Studio
- Tableau
- GDPR / PII
- AES-256 / PGP
- Data Quality Gates
Certifications
- dbt Fundamentals · dbt Labs (2024)
- Azure Data Fundamentals DP-900 · Microsoft (2021)
- Deep Learning Specialization · DeepLearning.AI (2020)
- Machine Learning · Andrew Ng, Stanford / Coursera (2019, 95%)
- Python Advanced · Cutshort (2023)
- AWS AI & ML Scholarship · Udacity (2026)
Notes from the field.
- Jul 5, 2026 58 broken pipelines you can fix on your laptop A production-challenges Data Engineering curriculum. Break Spark, Iceberg, Kafka, Debezium, dbt, and Airflow at small scale, watch them fail in the UI, and fix them — 58 modules, laptop-safe.… Jul 5, 2026 A dbt project. Snowflake to Spark. Zero rewrites. How I built dbt-polyglot — a compile-time SQL-dialect transpiler that lets you migrate a dbt project from Snowflake to Spark without editing a single .sql file.…
The best inbox is a short one.
I'm always up for talking about lakehouses, dbt patterns, migration battle-scars, or interesting job problems. Fastest reach — email or LinkedIn.
- Email kumar.saket0021@gmail.com
- LinkedIn linkedin.com/in/saketkr21
- GitHub github.com/Saketkr21
- Résumé Download PDF