Infrastructure Development Engineer (24043) Austin, Texas
| Salary: | USD60 - USD70 per hour |
Infrastructure Development Engineer
W2 Contract
Salary Range: $124,800 - $145,600 per year
Location: Austin, TX - Remote Role
Duties and Responsibilities:
Platform Reliability & Operations
- Own end-to-end reliability for our AI Agent Platform across all environments (Dev, Staging, Production).
- Maintain and optimize EKS clusters, databases, and LangGraph/LangSmith environments.
- Implement and manage proactive monitoring, alerting, and tracing systems across platform components.
- Drive root-cause analysis (RCA) and implement incident prevention automations.
Observability & Tooling
- Deliver a unified observability strategy across services using logging, metrics, and distributed tracing.
- Lead the migration from DataDog to Mosaic for dashboards and alerting.
- Develop self-healing automation and smoke tests to validate post-deployment system health.
- Ensure visibility into latency, availability, and error budgets (SLOs/SLIs).
Support & Incident Management
- Own the AI platform Support Channel — triage issues, answer platform questions, and guide onboarding.
- Provide L1/L2 triage during business hours; coordinate after-hours escalation with the core team.
- Establish structured runbooks, escalation policies, and post-incident review processes.
Deployment & Environment Consistency
- Standardize infrastructure and CI/CD practices across environments.
- Partner with platform and ML engineers to streamline release pipelines, security policies, and service configurations.
- Ensure consistent rollout of new features and agent services with minimal downtime.
Automation & Continuous Improvement
- Develop Python or Go utilities to automate deployment, monitoring, and maintenance tasks.
- Build tooling for alert correlation, system diagnostics, and capacity forecasting.
- Continuously evaluate new tools and frameworks to improve operational efficiency.
Requirements and Qualifications:
- 4+ years of experience as an SRE, DevOps Engineer, or Platform Engineer in cloud environments
- Deep expertise with Kubernetes (EKS/GKE), CI/CD pipelines, and infrastructure automation.
- Proficiency with observability tools such as Grafana, Prometheus, DataDog, Splunk, or OpenTelemetry.
- Experience in at least one modern programming language (Python, Go, or Rust).
- Strong understanding of incident management, SLAs/SLOs, and post-mortem practices.
- Excellent communication and collaboration skills; ability to work across platforms, AI, and data teams.
Preferred Qualifications:
- Experience operating AI/ML workloads (LangGraph, LangChain, or distributed compute systems like Ray).
- Familiarity with LLM-based infrastructure and AI observability tooling.
- Prior experience in managed service transitions or vendor-to-product operating model shifts.
- Exposure to Azure or AWS cloud ecosystems, Terraform, and GitOps workflows (ArgoCD/Flux).
Bayside Solutions, Inc. is not able to sponsor any candidates at this time. Additionally, candidates for this position must qualify as a W2 candidate.
Bayside Solutions, Inc. may collect your personal information during the position application process. Please reference Bayside Solutions, Inc.'s CCPA Privacy Policy at www.baysidesolutions.com.