QA/Validation Engineer with Agentic AI Experience -Toronto, CA Hybrid
TestingXpertsFull Description
QA / Validation Engineer – Agentic AI & Machine Learning (2 Positions)
Charles Street West, Toronto (Hybrid)
Duration: C2H or FTE
We are seeking a QA / Validation Engineer to assure the quality, safety, and reliability of Machine Learning and Agentic AI solutions (LLM/RAG/tool-using agents) from development through production. This is a hands-on engineering role focused on designing test strategies, building automated evaluation pipelines, and implementing quality gates for data, models, prompts, tools, and end-to-end agent workflows. You will work primarily in Python and leverage an open-source AI/ML stack, collaborating closely with ML/GenAI engineers, data engineering, and platform teams in environments that may include Databricks and Spark.
Key Objectives
* Define and execute an AI/ML quality strategy covering data, model, and agent behavior validation across offline evaluation and production monitoring.
* Build repeatable, automated evaluation and regression patterns for ML models and agentic workflows (including prompt, tool, and retrieval changes).
* Improve reliability, safety, and user trust by systematically reducing hallucinations, tool misuse, regressions, and unintended behaviors.
* Partner with engineering and platform teams to implement scalable, governed validation pipelines (including Databricks/Spark where applicable) while meeting security, privacy, and Responsible AI requirements.
Primary Responsibilities
* Own end-to-end QA/validation for ML and Agentic AI solutions: requirements-to-metrics, test planning, execution, defect triage, and release sign-off.
* Design and maintain evaluation frameworks for agentic systems: task success rate, tool-call correctness, grounding/citation quality (where used), latency/cost, and regression detection.
* Build automated test suites in Python for: data validation (schema, drift, anomalies), feature/label quality checks, model inference correctness, and agent workflow validation (multi-step, tool-using, and memory-based flows).
* Implement LLM/agent-specific quality checks: hallucination and factuality testing, prompt injection and jailbreak resistance testing, PII leakage checks, toxicity/safety filters, and policy conformance.
* Validate RAG systems end-to-end: document chunking/embedding quality, retrieval accuracy (precision/recall), reranking behavior, and answer faithfulness to retrieved context.
* Establish test data and “golden” datasets: curated evaluation sets, adversarial test cases, synthetic data generation (where appropriate), and clear acceptance criteria.
* Integrate quality gates into CI/CD: unit/integration tests, evaluation runs, reporting dashboards, and release-blocking thresholds.
* Partner with engineers to instrument observability: tracing, structured logs, metrics, error cohorts, and production monitoring for drift, degradation, bias, latency, and cost.
* Collaborate with platform teams to run validations at scale (Databricks jobs, Spark pipelines, scheduled workflows) and ensure governance over data/model access.
* Document validation approaches, test evidence, and risk assessments; support audits and compliance needs for regulated or high-impact use cases.
Required Skills & Experience
* Strong QA engineering experience with a focus on AI/ML systems, including validation strategies beyond traditional functional testing.
* Strong Python skills (test design, automation frameworks, packaging, code quality, and performance awareness).
* Experience validating ML models and pipelines: dataset splits, leakage checks, metric selection, thresholding, and regression testing.
* Hands-on familiarity with an open-source AI stack (examples: scikit-learn, PyTorch/TensorFlow, XGBoost/LightGBM, Hugging Face ecosystem).
* Experience testing GenAI/LLM/agentic systems: prompt/version management, evaluation harnesses, and quality metrics for non-deterministic outputs.
* Understanding of RAG concepts (embeddings, vector search, retrieval, reranking) and how to evaluate them.
* Working knowledge of MLOps/LLMOps practices: experiment tracking, model/prompt versioning, reproducibility, and monitoring (e.g., MLflow or equivalent).
* Experience with CI/CD, containerization (Docker), and test reporting; ability to integrate evaluations into automated pipelines.
* Strong data skills: SQL fundamentals and experience with data analysis/validation using pandas/NumPy.
* Clear communication and stakeholder management—able to translate quality risks into actionable engineering work.
Preferred / Nice to Have
* Awareness of Databricks concepts (workspaces, notebooks, jobs, clusters) and how QA/validation can be operationalized via Databricks workflows.
* Experience with Spark for large-scale data validation and distributed test execution.
* Familiarity with Databricks MLflow Model Registry and/or Unity Catalog (or similar governance) for managing models, features, prompts, and controlled data access.
* Experience with data quality frameworks (e.g., Great Expectations, Deequ) and drift/monitoring practices.
* Experience with agent frameworks and tooling (e.g., LangChain/LangGraph, LlamaIndex, OpenAI-compatible tool calling patterns) and related evaluation approaches.
* Experience with LLM observability/evaluation tools (e.g., OpenTelemetry tracing, prompt/agent telemetry, evaluation dashboards) and red-teaming practices.
* Familiarity with API testing for AI services (e.g., FastAPI-based services) and performance testing (latency, throughput, cost profiling).
* Exposure to Responsible AI practices (privacy, explainability, robustness, security controls) and validation evidence for audits.