Reliability-focused SRE believing in moving fast with stable infrastructure. Built production infrastructure from scratch as sole engineer. Scaled systems to 300K+ RPS maintaining 99.99% uptime. Developed collectors and alerting that reduced MTTD from 2 hours to 1 minute. Delivered over half a million dollars in cost savings through automation and cloud optimization. Experienced with both cloud and bare-metal infrastructure, including GPU clusters. When something breaks, I fix it - whether debugging application code, redesigning architecture, or implementing security. Comfortable with early-stage ambiguity and wearing multiple hats to deliver what the business needs
♦ Migrated observability signals (logs, metrics, traces) to unified Grafana Cloud, reducing costs by 40%
♦ Built Kubernetes controller to auto-inject StatsD proxy for seamless Datadog migration
• Established AWS-GCP cloud interconnectivity for cross-cloud workloads
• Developed custom OTEL collector using Native Histograms to improve precision and reduce cardinality
• Migrated autoscaling to KEDA-based solution
• Implemented eBPF-based auto-monitoring
♦ Achieved 30% YoY incident reduction by identifying patterns in post-mortems and implementing permanent solutions
♦ Built metrics collectors (Go/Rust/Python), reducing MongoDB MTTD from 2 hours to 1 minute
• Scaled infrastructure from thousands to 300K+ RPS using NGINX, Linkerd, and APISIX
• Deployed Temporal clusters for workflow orchestration, replacing Databricks with significant cost savings
• Implemented distributed tracing with OpenTelemetry across multiple K8s clusters (EKS/AKS)
♦ Sole infrastructure engineer - delivered production, staging, and dev environments in 3 months
• Built CI/CD pipeline with Tekton and self-service ML platform on Kubeflow
• Reduced cluster provisioning from 4 hours to 30 minutes using RHACM
• Architected observability stack with Jaeger/OpenTelemetry and Istio service mesh
♦ Built bare-metal GPU clusters with Nomad scheduling for protein research teams
♦ Implemented analytics platform with PrestoDB and Metabase, reducing query time by 75%
• Cut developer onboarding time by 50% through internal tooling
• Achieved 99.9% PostgreSQL uptime with HA and custom monitoring
• Maintained HIPAA compliance for healthcare data