Infrastructure Development Engineer (24043) Austin, Texas

Salary: USD60 - USD70 per hour

Infrastructure Development Engineer

W2 Contract

Salary Range: $124,800 - $145,600 per year

Location: Austin, TX - Remote Role

Duties and Responsibilities:

Platform Reliability & Operations

  • Own end-to-end reliability for our AI Agent Platform across all environments (Dev, Staging, Production).
  • Maintain and optimize EKS clusters, databases, and LangGraph/LangSmith environments.
  • Implement and manage proactive monitoring, alerting, and tracing systems across platform components.
  • Drive root-cause analysis (RCA) and implement incident prevention automations.

Observability & Tooling

  • Deliver a unified observability strategy across services using logging, metrics, and distributed tracing.
  • Lead the migration from DataDog to Mosaic for dashboards and alerting.
  • Develop self-healing automation and smoke tests to validate post-deployment system health.
  • Ensure visibility into latency, availability, and error budgets (SLOs/SLIs).

Support & Incident Management

  • Own the AI platform Support Channel — triage issues, answer platform questions, and guide onboarding.
  • Provide L1/L2 triage during business hours; coordinate after-hours escalation with the core team.
  • Establish structured runbooks, escalation policies, and post-incident review processes.

Deployment & Environment Consistency

  • Standardize infrastructure and CI/CD practices across environments.
  • Partner with platform and ML engineers to streamline release pipelines, security policies, and service configurations.
  • Ensure consistent rollout of new features and agent services with minimal downtime.

Automation & Continuous Improvement

  • Develop Python or Go utilities to automate deployment, monitoring, and maintenance tasks.
  • Build tooling for alert correlation, system diagnostics, and capacity forecasting.
  • Continuously evaluate new tools and frameworks to improve operational efficiency.

Requirements and Qualifications:

  • 4+ years of experience as an SRE, DevOps Engineer, or Platform Engineer in cloud environments
  • Deep expertise with Kubernetes (EKS/GKE), CI/CD pipelines, and infrastructure automation.
  • Proficiency with observability tools such as Grafana, Prometheus, DataDog, Splunk, or OpenTelemetry.
  • Experience in at least one modern programming language (Python, Go, or Rust).
  • Strong understanding of incident management, SLAs/SLOs, and post-mortem practices.
  • Excellent communication and collaboration skills; ability to work across platforms, AI, and data teams.

Preferred Qualifications:

  • Experience operating AI/ML workloads (LangGraph, LangChain, or distributed compute systems like Ray).
  • Familiarity with LLM-based infrastructure and AI observability tooling.
  • Prior experience in managed service transitions or vendor-to-product operating model shifts.
  • Exposure to Azure or AWS cloud ecosystems, Terraform, and GitOps workflows (ArgoCD/Flux).

 

Bayside Solutions, Inc. is not able to sponsor any candidates at this time. Additionally, candidates for this position must qualify as a W2 candidate.

Bayside Solutions, Inc. may collect your personal information during the position application process. Please reference Bayside Solutions, Inc.'s CCPA Privacy Policy at www.baysidesolutions.com.

;