Back to jobs

AI Infrastructure Engineer

kadence
United States
Full-time
AI tools:
AWS
Terraform

Senior AI Platform / LLM Agent Infrastructure Engineer

Location: Remote or Hybrid (US-friendly)

Employment Type: Full-time

Experience Level: 5+ years (strong production ownership required)

About the Company

We’re partnering with a well-funded, early-stage technology company building production-grade, agent-driven AI systems that automate complex, real-world workflows. The team is highly technical, execution-focused, and building software where AI agents operate reliably at scale, not just in demos.

This is an opportunity to join a small, senior team solving hard infrastructure and reliability problems at the intersection of cloud systems, DevOps, and applied AI.

The Role

We are seeking a Senior AI Platform / LLM Agent Infrastructure Engineer to own the deployment, scalability, and operational reliability of AI/LLM agents in production environments.

This role is focused on getting AI agents into production and keeping them there. You will be responsible for designing and operating the infrastructure that supports agent-based systems handling real customer workloads at scale.

You’ll work closely with AI engineers and product leadership to ensure agent systems are observable, resilient, cost-efficient, and performant across evolving use cases.

What You’ll Do

* Own the production deployment lifecycle of AI/LLM agents from launch through long-term operation

* Design and implement cloud-native infrastructure to support agent-based systems at scale

* Deploy, operate, and monitor AI agents handling real-world, customer-facing workloads

* Build and maintain CI/CD pipelines, deployment workflows, and infrastructure-as-code

* Implement monitoring, logging, alerting, and observability for agent behavior, failures, latency, and cost

* Optimize systems for reliability, performance, and cost efficiency across LLM providers

* Partner closely with AI engineers to harden agent architectures for production usage

* Troubleshoot complex production issues across infrastructure, models, and system integrations

* Establish best practices for secure, scalable, and maintainable AI systems

What We’re Looking For

* Strong experience deploying and operating production cloud systems

* Proven background in DevOps, platform engineering, or infrastructure roles, ideally supporting ML or AI workloads

* Hands-on experience deploying AI/LLM-powered applications or agents into production

* Experience operating systems at scale in either a startup or large technology environment

* Deep familiarity with AWS infrastructure (or equivalent cloud platforms)

* Strong programming skills in Python and/or TypeScript

* Experience with CI/CD, infrastructure-as-code (Terraform, CDK, Pulumi, etc.), and cloud automation

* Comfortable owning ambiguous systems and making pragmatic tradeoffs in production

Nice to Have

* Experience supporting LLM agents or ML systems in production

* Familiarity with agent frameworks, orchestration, or distributed systems

* Background in MLOps, SRE, or platform teams

* Experience working in early-stage or fast-scaling environments

Applications go to the hiring team directly