MLOps Architect
QuantiphiFull Description
About Quantiphi:
Quantiphi is an award-winning, AI-First global digital engineering company that helps the world’s leading Fortune 1000 organizations transform bold ideas into measurable business impact. We go beyond building innovative AI technologies—we solve the problems that matter most to our clients.
Since our founding in 2013, Quantiphi has built a proven track record of turning complex challenges into meaningful outcomes across industries.
Headquartered in Boston, with more than 4,000 professionals worldwide, we partner with global enterprises to deliver large-scale digital, cloud, and AI-driven transformation. #SolvingWhatMatters
We are an Elite and Premier partner to Google Cloud, AWS, NVIDIA, Snowflake, and other leading technology platforms, and our work has been recognized across the industry, including:
* 21 Google Cloud Partner of the Year awards in the past 10 years
* 3 AWS AI/ML Partner of the Year awards
* 3 NVIDIA Partner of the Year awards
* 3 Snowflake Partner of the Year awards
* Rated Leaders by Gartner, Forrester, IDC, ISG, Everest Group and other leading analyst firms
Quantiphi delivers First-in-class AI solutions across Life Sciences, Healthcare, Banking, Financial Services, CPG, Manufacturing, Energy, High-Tech, Telecommunications, etc., powered by cutting-edge Generative AI and Agentic AI accelerators.
We are also proud to be certified as a Great Place to Work—reflecting our commitment to our people and our culture.
For more details, visit: Website or LinkedIn Page
Job Description:
The Partner Consultant for AI Infrastructure & MLOps is a specialized technical expert focused on designing and building the scalable, automated, and resilient platforms required for large-scale machine learning. This role supports customers moving beyond experimentation to productionalize AI/ML, focusing on the underlying infrastructure for distributed training and low-latency inference. This consultant provides the bridge between Data Science teams and Cloud Platform teams, leveraging Google Kubernetes Engine (GKE), Vertex AI, and specialized hardware (GPUs and TPUs) to create robust MLOps "factories." This role requires a "platform engineering" mindset where all infrastructure is provisioned as code.
Key Responsibilities
* Platform Architecture: Design the foundational infrastructure for AI workloads, including secure and scalable Google Kubernetes Engine (GKE) clusters, network configurations, and IAM policies.
* Infrastructure Automation (IaC): Lead the automation of all AI infrastructure provisioning using Terraform to ensure repeatable, scalable, and secure environments.
* MLOps Pipeline Design: Architect end-to-end MLOps automation using Vertex AI Pipelines (or Kubeflow Pipelines) to cover the full lifecycle: data ingestion, validation, model training, registration, and automated deployment.
* Training & Inference Optimization: Design solutions for large-scale distributed training and scalable, low-latency serving (e.g., Vertex AI Endpoints, GKE autoscaling).
* Production Monitoring & Governance: Implement robust monitoring for model performance, data drift, and system health. Ensure all solutions adhere to security and governance standards.
* Hardware Advisory: Advise customers on the optimal hardware selection (cost vs. performance), including the provisioning and utilization of Google Cloud GPUs (A2, G2) and TPUs (v4, v5e).
* Technical Advisory & Collaboration: Act as the subject matter expert for customers and internal teams, providing guidance and hands-on support to streamline the entire ML lifecycle.
Required Credentials & Skills (Mandatory)
Google Cloud Certifications:
Google Cloud Certified - Professional Cloud Architect
Google Cloud Certified - Professional Machine Learning Engineer
HashiCorp Certification:
HashiCorp Certified: Terraform Associate
Cloud & AI Skills:
* Deep expertise with Google Kubernetes Engine (GKE), including cluster design, node pools, and security (Workload Identity).
* Hands-on, production-level experience with Terraform for automating GCP infrastructure.
* Demonstrable expertise across the Vertex AI Platform (Training, Pipelines, Endpoints).
* Strong Python programming and scripting skills.
* Concepts: Strong understanding of the complete MLOps lifecycle, CI/CD principles, and container-based workflows (Docker).
* Consulting Skills: 3-5+ years in a customer-facing technical role (DevOps, MLOps, or Cloud Engineering).
Preferred Credentials & Skills (Nice-to-Have)
Google Cloud Certification:
Google Cloud Certified - Professional Cloud DevOps Engineer
Industry Certifications (CNCF):
Certified Kubernetes Administrator (CKA)
HashiCorp Certification:
HashiCorp Certified: Terraform Authoring and Operations Professional
Technical Skills:
* Direct hands-on experience provisioning and managing Cloud TPUs.
* Deep expertise with Google Kubernetes Engine (GKE), including cluster design, node pools, and security (Workload Identity).
* Data Engineering integration experience (BigQuery, Dataflow, Pub/Sub).
* Familiarity with monitoring tools like Prometheus and Grafana.
What’s in it for YOU at Quantiphi?
* Join one of the world’s fastest-growing AI-first digital engineering companies and make a real impact at scale.
* Lead and collaborate with a high-energy team of talented, driven individuals solving complex, meaningful challenges.
* Work with Fortune 500 companies and disruptive innovators in a research-driven environment with 60+ patents.
* Stay ahead of the curve by gaining hands-on experience with cutting-edge AI, ML, data, and cloud technologies while continuously upskilling.