Back to jobs

MLOps Platform Engineer

Cliff Services Inc
Reston, VA
Full-time
5,200 – 6,200 / year
AI tools:
AWS

Reston VA Onsite

W2

Description

The Data Modeling Analytics & AI Engineering team is seeking an experienced MLOps Platform Engineer to design, build, and support enterprise-grade machine learning operations capabilities. This role will play a key part in enabling scalable, reliable, and secure ML model development and deployment across our cloud and container platforms.

This is a hands-on engineering role requiring strong expertise in AWS, Kubernetes (EKS), CI/CD automation, containerization, and ML platform operations. The ideal candidate will have solid engineering fundamentals combined with practical knowledge of ML workflows, deployment patterns, and platform reliability.

Key Responsibilities

Platform Engineering & Operations

* Engineer, manage, and support MLOps platform components across AWS and EKS-based environments.

* Oversee deployment, configuration, and operation of infrastructure used for ML training, batch inference, and real-time model serving.

* Ensure platform availability, resilience, and performance across dev, test, and production environments.

* Implement role-based access controls (RBAC), network policies, and scalable namespace designs within EKS.

Model Deployment & CI/CD Automation

* Build and support CI/CD pipelines (GitLab) for model packaging, container image builds, vulnerability scanning, and automated deployment flows.

* Enable standardized model release processes including environment promotion, versioning, and rollback workflows.

* Integrate CI/CD with ML frameworks, model repositories, artifacts, and runtime environments.

Container & Kubernetes Workloads

* Design and manage EKS workloads supporting containerized ML jobs and microservices.

* Implement auto-scaling, resource quotas, cluster optimization, and multi-tenant workload isolation.

* Support GPU and CPU-based training/inference workloads.

Monitoring, Observability & Optimization

* Implement logging, monitoring, and alerting for ML pipelines, model endpoints, batch jobs, and platform components.

* Analyze compute, storage, and data transfer usage to optimize cost efficiency across ML workloads.

* Perform incident response, root cause analysis, and long-term remediation planning.

Collaboration & Enablement

* Partner with Data Scientists, ML Engineers, and application teams to operationalize end-to-end machine learning solutions.

* Provide technical guidance on best practices for ML model lifecycle management, deployment patterns, and scalable architectures.

* Contribute to documentation, runbooks, onboarding materials, and internal knowledge bases.

Required Qualifications

* 3+ years of hands-on experience with AWS services, including EKS, EC2, S3, IAM, CloudWatch, and ECR.

* Strong experience operating and troubleshooting Kubernetes (preferably AWS EKS).

* Proficiency in containerization (Docker) and orchestration concepts.

* Strong programming/scripting experience in Python and Bash.

* Experience building and managing CI/CD pipelines (GitLab or equivalent).

* Familiarity with machine learning workflows, including training, inference, and model monitoring.

* Experience with infrastructure-as-code (Terraform or CloudFormation).

* Experience supporting production platforms, including incident management and root cause analysis.

Preferred Qualifications

* Experience managing Data Analytics Platforms / Tools (e.g., Domino, SageMaker)

* Experience with ML lifecycle tools such as MLflow, or similar.

* Experience supporting GPU-based workloads or distributed training environments.

* Familiarity with enterprise MLOps architectures and patterns (batch, real-time, microservices).

* Understanding of data processing frameworks and feature pipelines.

Other Competencies

* Strong analytical, troubleshooting, and problem-solving skills.

* Effective communication and documentation abilities.

* Ability to collaborate across engineering, analytics, and product teams.

* Self-motivated with the ability to drive initiatives independently.

* Ability to work in a complex, regulated enterprise environment.

Thanks & Regards

[email protected]

Applications go to the hiring team directly