AI Data Platform Engineer
Akoncagua AIFull Description
Overview
We are seeking an AI Data Platform Engineer with 4+ years of experience designing and building scalable data systems for AI/ML applications. This role is focused on transforming large volumes of structured and unstructured data into high-quality, model-ready datasets for large language model (LLM) training, fine-tuning, and production systems.
You will design the evolution of data infrastructure across hybrid environments (cloud + on-prem GPU clusters), enabling efficient data pipelines for LLM training, RAG systems, and analytics. This role requires strong systems thinking, deep experience with data pipelines, and a focus on reliability, scalability, and cost efficiency.
Key Responsibilities
- Design, build, and maintain scalable data pipelines for structured and unstructured data across hybrid (cloud + on-prem GPU) environments
- Architecture and evolution of data infrastructure supporting large-scale AI/ML systems
- Develop automated ETL/ELT workflows to transform raw data into validated, structured, model-ready datasets
- Build document processing and understanding pipelines (PDFs, images, OCR) to extract structured data into standardized formats (e.g., JSON)
- Design and implement pipelines for LLM dataset preparation, including instruction tuning, evaluation datasets, and synthetic data generation
- Implement data cleaning, deduplication, validation, and quality checks to produce “gold-standard” datasets for CPT, SFT, and parameter-efficient fine-tuning (LoRA, QLoRA)
- Integrate vector databases and support data pipelines for retrieval-augmented generation (RAG) systems
- Implement data versioning, lineage tracking, and reproducible data workflows for ML systems
- Optimize pipeline performance, scalability, and cost efficiency across distributed systems
- Design and implement observability frameworks, including logging, monitoring, data quality checks, and alerting
- Collaborate closely with ML engineers, platform engineers, and product teams to deliver reliable data systems
- Operate and troubleshoot Linux-based systems and GPU-enabled environments
- Stay current with emerging tools and best practices in data engineering and AI infrastructure
Required Qualifications
- Bachelor’s or Master’s degree in Computer Science, Data Engineering, AI, or a related field
- 4+ years of experience building and operating data pipelines in production environments
- Strong proficiency in Python and experience with data pipeline development
- Experience with workflow orchestration tools (e.g., Airflow, Prefect, or Dagster)
- Experience with cloud platforms (AWS, GCP, or Azure) and scalable data infrastructure
- Hands-on experience with containerization (Docker) and orchestration (Kubernetes)
- Experience working with distributed data processing systems (e.g., Spark, Ray)
- Strong understanding of data modeling, schema design, and data validation techniques
- Experience working in Linux environments and scripting (e.g., Bash)
- Familiarity with GPU-based environments and hybrid infrastructure
Nice to Have
- Experience with LLM data workflows, including CPT, SFT, RLHF, or DPO
- Experience with document AI, OCR systems, and unstructured data processing
- Experience with vector databases (e.g., pgvector, Chroma, Pinecone)
- Familiarity with data lakehouse technologies (e.g., Delta Lake, Apache Iceberg)
- Experience with experiment tracking or data versioning tools
- Relevant certifications (AWS, GCP, Azure, or NVIDIA)
What Success Looks Like
- Build scalable, reliable data pipelines that support LLM training and production systems
- Deliver high-quality, validated datasets that improve model performance and reduce iteration cycles
- Enable efficient hybrid infrastructure utilization across cloud and on-prem GPU environments
- Improve data pipeline reliability, observability, and cost efficiency over time
Why Join Us
- Work on cutting-edge AI systems with real-world impact
- Critical data infrastructure powering advanced AI/ML models
- Collaborate with a high-performing, fast-moving team
- Opportunity to shape the foundation of next-generation AI platforms