Overview

We are seeking an AI Data Platform Engineer with 4+ years of experience designing and building scalable data systems for AI/ML applications. This role is focused on transforming large volumes of structured and unstructured data into high-quality, model-ready datasets for large language model (LLM) training, fine-tuning, and production systems.

You will design the evolution of data infrastructure across hybrid environments (cloud + on-prem GPU clusters), enabling efficient data pipelines for LLM training, RAG systems, and analytics. This role requires strong systems thinking, deep experience with data pipelines, and a focus on reliability, scalability, and cost efficiency.

Key Responsibilities

- Design, build, and maintain scalable data pipelines for structured and unstructured data across hybrid (cloud + on-prem GPU) environments

- Architecture and evolution of data infrastructure supporting large-scale AI/ML systems

- Develop automated ETL/ELT workflows to transform raw data into validated, structured, model-ready datasets

- Build document processing and understanding pipelines (PDFs, images, OCR) to extract structured data into standardized formats (e.g., JSON)

- Design and implement pipelines for LLM dataset preparation, including instruction tuning, evaluation datasets, and synthetic data generation

- Implement data cleaning, deduplication, validation, and quality checks to produce “gold-standard” datasets for CPT, SFT, and parameter-efficient fine-tuning (LoRA, QLoRA)

- Integrate vector databases and support data pipelines for retrieval-augmented generation (RAG) systems

- Implement data versioning, lineage tracking, and reproducible data workflows for ML systems

- Optimize pipeline performance, scalability, and cost efficiency across distributed systems

- Design and implement observability frameworks, including logging, monitoring, data quality checks, and alerting

- Collaborate closely with ML engineers, platform engineers, and product teams to deliver reliable data systems

- Operate and troubleshoot Linux-based systems and GPU-enabled environments

- Stay current with emerging tools and best practices in data engineering and AI infrastructure

Required Qualifications

- Bachelor’s or Master’s degree in Computer Science, Data Engineering, AI, or a related field

- 4+ years of experience building and operating data pipelines in production environments

- Strong proficiency in Python and experience with data pipeline development

- Experience with workflow orchestration tools (e.g., Airflow, Prefect, or Dagster)

- Experience with cloud platforms (AWS, GCP, or Azure) and scalable data infrastructure

- Hands-on experience with containerization (Docker) and orchestration (Kubernetes)

- Experience working with distributed data processing systems (e.g., Spark, Ray)

- Strong understanding of data modeling, schema design, and data validation techniques

- Experience working in Linux environments and scripting (e.g., Bash)

- Familiarity with GPU-based environments and hybrid infrastructure

Nice to Have

- Experience with LLM data workflows, including CPT, SFT, RLHF, or DPO

- Experience with document AI, OCR systems, and unstructured data processing

- Experience with vector databases (e.g., pgvector, Chroma, Pinecone)

- Familiarity with data lakehouse technologies (e.g., Delta Lake, Apache Iceberg)

- Experience with experiment tracking or data versioning tools

- Relevant certifications (AWS, GCP, Azure, or NVIDIA)

What Success Looks Like

- Build scalable, reliable data pipelines that support LLM training and production systems

- Deliver high-quality, validated datasets that improve model performance and reduce iteration cycles

- Enable efficient hybrid infrastructure utilization across cloud and on-prem GPU environments

- Improve data pipeline reliability, observability, and cost efficiency over time

Why Join Us

- Work on cutting-edge AI systems with real-world impact

- Critical data infrastructure powering advanced AI/ML models

- Collaborate with a high-performing, fast-moving team

- Opportunity to shape the foundation of next-generation AI platforms

AI Data Platform Engineer

Skills & Expertise

Key Responsibilities

Full Description