SRE focused on velocity without sacrificing reliability. Built production infrastructure from scratch as sole engineer, scaling systems to 300K+ RPS with 99.9% uptime. Developed observability collectors and alerting that reduced MTTD from 2 hours to 1 minute. Delivered $500K+ in cost savings through automation and cloud optimization. Experienced across cloud and bare-metal infrastructure, including GPU clusters for ML workloads.
• Architected in-house observability solution to replace all tools with a unified store
• Migrated observability stack (logs, metrics, traces) to Grafana Cloud, reducing costs 40%; built custom OTEL collector with Native Histograms to reduce cardinality
• Built K8s controller to auto-inject StatsD proxy for Datadog migration; implemented eBPF-based auto-monitoring
• Established AWS-GCP cloud interconnectivity for cross-cloud workloads
• Improved infrastructure resilience: migrated autoscaling to KEDA; introduced Spegel for P2P image distribution; led disaster recovery exercises
• Defined SLOs/SLIs for critical services; managed error budgets and led incident response and resolution
• Mentored engineers through spec reviews; conducted interviews to build the SRE team
• Defined SLOs/SLIs and managed error budgets, achieving 30% YoY incident reduction through pattern analysis and permanent fixes
• Built metrics collectors (Go/Rust/Python), reducing MongoDB MTTD from 2 hours to 1 minute
• Scaled infrastructure from thousands to 300K+ RPS using NGINX, Linkerd, and APISIX
• Partnered with Experimentation team to deploy Temporal clusters for workflow orchestration, replacing Databricks with significant cost savings
• Implemented distributed tracing with OpenTelemetry across multiple K8s clusters (EKS/AKS); designed on-call rotations and escalation policies
• Sole infrastructure engineer supporting 5 developers: delivered production, staging, and dev environments in 3 months
• Built CI/CD pipeline with Tekton; implemented self-service ML platform on Kubeflow enabling Data Science team to run Python notebooks independently
• Reduced cluster provisioning from 4 hours to 30 minutes using RHACM
• Architected observability stack with Jaeger/OpenTelemetry and Istio service mesh
• Built bare-metal GPU clusters with Nomad scheduling; created self-service workflows enabling scientists to run protein research workloads without deep technical knowledge
• Implemented analytics platform with PrestoDB and Metabase, reducing query time by 75%
• Cut developer onboarding time by 50% through internal tooling
• Achieved 99.9% PostgreSQL uptime with HA and custom monitoring
• Maintained HIPAA compliance for healthcare data