AI Infrastructure Engineer
kadenceSenior AI Platform / LLM Agent Infrastructure Engineer
Location: Remote or Hybrid (US-friendly)
Employment Type: Full-time
Experience Level: 5+ years (strong production ownership required)
About the Company
We’re partnering with a well-funded, early-stage technology company building production-grade, agent-driven AI systems that automate complex, real-world workflows. The team is highly technical, execution-focused, and building software where AI agents operate reliably at scale, not just in demos.
This is an opportunity to join a small, senior team solving hard infrastructure and reliability problems at the intersection of cloud systems, DevOps, and applied AI.
The Role
We are seeking a Senior AI Platform / LLM Agent Infrastructure Engineer to own the deployment, scalability, and operational reliability of AI/LLM agents in production environments.
This role is focused on getting AI agents into production and keeping them there. You will be responsible for designing and operating the infrastructure that supports agent-based systems handling real customer workloads at scale.
You’ll work closely with AI engineers and product leadership to ensure agent systems are observable, resilient, cost-efficient, and performant across evolving use cases.
What You’ll Do
* Own the production deployment lifecycle of AI/LLM agents from launch through long-term operation
* Design and implement cloud-native infrastructure to support agent-based systems at scale
* Deploy, operate, and monitor AI agents handling real-world, customer-facing workloads
* Build and maintain CI/CD pipelines, deployment workflows, and infrastructure-as-code
* Implement monitoring, logging, alerting, and observability for agent behavior, failures, latency, and cost
* Optimize systems for reliability, performance, and cost efficiency across LLM providers
* Partner closely with AI engineers to harden agent architectures for production usage
* Troubleshoot complex production issues across infrastructure, models, and system integrations
* Establish best practices for secure, scalable, and maintainable AI systems
What We’re Looking For
* Strong experience deploying and operating production cloud systems
* Proven background in DevOps, platform engineering, or infrastructure roles, ideally supporting ML or AI workloads
* Hands-on experience deploying AI/LLM-powered applications or agents into production
* Experience operating systems at scale in either a startup or large technology environment
* Deep familiarity with AWS infrastructure (or equivalent cloud platforms)
* Strong programming skills in Python and/or TypeScript
* Experience with CI/CD, infrastructure-as-code (Terraform, CDK, Pulumi, etc.), and cloud automation
* Comfortable owning ambiguous systems and making pragmatic tradeoffs in production
Nice to Have
* Experience supporting LLM agents or ML systems in production
* Familiarity with agent frameworks, orchestration, or distributed systems
* Background in MLOps, SRE, or platform teams
* Experience working in early-stage or fast-scaling environments