Pablo Opazo

site reliability engineer

Bachelor of Engineering (B.E.) Computer Science

Summary

SRE focused on velocity without sacrificing reliability. Built production infrastructure from scratch as sole engineer, scaling systems to 300K+ RPS with 99.9% uptime. Developed observability collectors and alerting that reduced MTTD from 2 hours to 1 minute. Delivered $500K+ in cost savings through automation and cloud optimization. Experienced across cloud and bare-metal infrastructure, including GPU clusters for ML workloads.

Areas of Expertise

  • Platform Engineering
  • Infrastructure Automation
  • Database Administration
  • Security & Compliance
  • Incident Management
  • Root Cause Analysis
  • Observability
  • Capacity Planning
  • Technical Leadership
  • FinOps & Cost Management
  • Technical Documentation
  • System Design

Professional Experience

Harness.io • United States (Remote)

Modern software delivery platform that enables CI, CD, feature flags, and cloud cost management at enterprise scale.

Principal Site Reliability Engineer

2024 - Present

Led infrastructure integration following Harness acquisition of Split, consolidating observability and cutting costs 40%. Senior technical leader within 7-engineer SRE team enabling 30+ engineers to ship reliably.
Key Accomplishments:

• Architected in-house observability solution to replace all tools with a unified store
• Migrated observability stack (logs, metrics, traces) to Grafana Cloud, reducing costs 40%; built custom OTEL collector with Native Histograms to reduce cardinality
• Built K8s controller to auto-inject StatsD proxy for Datadog migration; implemented eBPF-based auto-monitoring
• Established AWS-GCP cloud interconnectivity for cross-cloud workloads
• Improved infrastructure resilience: migrated autoscaling to KEDA; introduced Spegel for P2P image distribution; led disaster recovery exercises
• Defined SLOs/SLIs for critical services; managed error budgets and led incident response and resolution
• Mentored engineers through spec reviews; conducted interviews to build the SRE team

Split.io (Acquired by Harness.io) • United States (Remote)

Split is a feature delivery platform that powers feature flag management, software experimentation, and continuous delivery.

Staff Site Reliability Engineer

2021 - 2024

Led reliability and observability initiatives across tracing, metrics, and performance analysis. Supported 50+ engineers across product and platform teams within a 5-engineer SRE team.
Key Accomplishments:

• Defined SLOs/SLIs and managed error budgets, achieving 30% YoY incident reduction through pattern analysis and permanent fixes
• Built metrics collectors (Go/Rust/Python), reducing MongoDB MTTD from 2 hours to 1 minute
• Scaled infrastructure from thousands to 300K+ RPS using NGINX, Linkerd, and APISIX
• Partnered with Experimentation team to deploy Temporal clusters for workflow orchestration, replacing Databricks with significant cost savings
• Implemented distributed tracing with OpenTelemetry across multiple K8s clusters (EKS/AKS); designed on-call rotations and escalation policies

Science (Sequoia Capital) • United States (Remote)

A healthcare startup based in Miami that focused on creating AI tools to help doctors and clinics reduce operational complexity.

Lead DevOps Engineer

2020 - 2021

Built entire infrastructure from scratch. Integrated Kafka, CockroachDB, and Elasticsearch on Kubernetes/OpenShift using operators pattern.
Key Accomplishments:

• Sole infrastructure engineer supporting 5 developers: delivered production, staging, and dev environments in 3 months
• Built CI/CD pipeline with Tekton; implemented self-service ML platform on Kubeflow enabling Data Science team to run Python notebooks independently
• Reduced cluster provisioning from 4 hours to 30 minutes using RHACM
• Architected observability stack with Jaeger/OpenTelemetry and Istio service mesh

uBiome (YCombinator S14) • United States (Remote)

A biotechnology company based in San Francisco that developed technology to sequence the human microbiome.

Production Engineer - Technical Lead

2016 - 2019

Led a 4-engineer team building SDLC platform using Kubernetes, Drone CI, and Spinnaker. Implemented cost automation across AWS/GCP and bare-metal.
Key Accomplishments:

• Built bare-metal GPU clusters with Nomad scheduling; created self-service workflows enabling scientists to run protein research workloads without deep technical knowledge
• Implemented analytics platform with PrestoDB and Metabase, reducing query time by 75%
• Cut developer onboarding time by 50% through internal tooling
• Achieved 99.9% PostgreSQL uptime with HA and custom monitoring
• Maintained HIPAA compliance for healthcare data

Education

Computer Science - Undergraduate Studies, 2019

Pontificia Universidad Católica de Chile (PUC)

Bachelor of Engineering (B.E.) Computer Science, 2012

Universidad Tecnológica de Chile INACAP

Technical Stack

Cloud
  • AWS
  • GCP
  • Azure
  • OCI
Orchestration
  • Nomad
  • Kubernetes
  • Rancher
  • OpenShift
OS
  • SmartOS
  • RHEL
  • CoreOS
  • Ubuntu
Databases
  • PostgreSQL
  • MongoDB
  • CRDB
  • ClickHouse
Messaging
  • RabbitMQ
  • Redis
  • Kafka
  • AutoMQ
Observability
  • Prometheus
  • CloudWatch
  • VMetrics
  • Loki
IaC
  • Salt
  • Ansible
  • Terraform
  • Crossplane
CI/CD
  • ArgoCD
  • GH Actions
  • Tekton
  • Harness
Secrets
  • Vault
  • AWS SM
  • Azure KV
  • ESO
Languages
  • Bash
  • Python
  • Go
  • Rust