Back to jobs

QA/Validation Engineer with Agentic AI Experience -Toronto, CA Hybrid

TestingXperts
Toronto, Ontario, Canada
Contract
AI tools:
PyTorch
TensorFlow
Applications go directly to the hiring team

Full Description

QA / Validation Engineer – Agentic AI & Machine Learning (2 Positions)

Charles Street West, Toronto (Hybrid)

Duration: C2H or FTE

We are seeking a QA / Validation Engineer to assure the quality, safety, and reliability of Machine Learning and Agentic AI solutions (LLM/RAG/tool-using agents) from development through production. This is a hands-on engineering role focused on designing test strategies, building automated evaluation pipelines, and implementing quality gates for data, models, prompts, tools, and end-to-end agent workflows. You will work primarily in Python and leverage an open-source AI/ML stack, collaborating closely with ML/GenAI engineers, data engineering, and platform teams in environments that may include Databricks and Spark.

Key Objectives

* Define and execute an AI/ML quality strategy covering data, model, and agent behavior validation across offline evaluation and production monitoring.

* Build repeatable, automated evaluation and regression patterns for ML models and agentic workflows (including prompt, tool, and retrieval changes).

* Improve reliability, safety, and user trust by systematically reducing hallucinations, tool misuse, regressions, and unintended behaviors.

* Partner with engineering and platform teams to implement scalable, governed validation pipelines (including Databricks/Spark where applicable) while meeting security, privacy, and Responsible AI requirements.

Primary Responsibilities

* Own end-to-end QA/validation for ML and Agentic AI solutions: requirements-to-metrics, test planning, execution, defect triage, and release sign-off.

* Design and maintain evaluation frameworks for agentic systems: task success rate, tool-call correctness, grounding/citation quality (where used), latency/cost, and regression detection.

* Build automated test suites in Python for: data validation (schema, drift, anomalies), feature/label quality checks, model inference correctness, and agent workflow validation (multi-step, tool-using, and memory-based flows).

* Implement LLM/agent-specific quality checks: hallucination and factuality testing, prompt injection and jailbreak resistance testing, PII leakage checks, toxicity/safety filters, and policy conformance.

* Validate RAG systems end-to-end: document chunking/embedding quality, retrieval accuracy (precision/recall), reranking behavior, and answer faithfulness to retrieved context.

* Establish test data and “golden” datasets: curated evaluation sets, adversarial test cases, synthetic data generation (where appropriate), and clear acceptance criteria.

* Integrate quality gates into CI/CD: unit/integration tests, evaluation runs, reporting dashboards, and release-blocking thresholds.

* Partner with engineers to instrument observability: tracing, structured logs, metrics, error cohorts, and production monitoring for drift, degradation, bias, latency, and cost.

* Collaborate with platform teams to run validations at scale (Databricks jobs, Spark pipelines, scheduled workflows) and ensure governance over data/model access.

* Document validation approaches, test evidence, and risk assessments; support audits and compliance needs for regulated or high-impact use cases.

Required Skills & Experience

* Strong QA engineering experience with a focus on AI/ML systems, including validation strategies beyond traditional functional testing.

* Strong Python skills (test design, automation frameworks, packaging, code quality, and performance awareness).

* Experience validating ML models and pipelines: dataset splits, leakage checks, metric selection, thresholding, and regression testing.

* Hands-on familiarity with an open-source AI stack (examples: scikit-learn, PyTorch/TensorFlow, XGBoost/LightGBM, Hugging Face ecosystem).

* Experience testing GenAI/LLM/agentic systems: prompt/version management, evaluation harnesses, and quality metrics for non-deterministic outputs.

* Understanding of RAG concepts (embeddings, vector search, retrieval, reranking) and how to evaluate them.

* Working knowledge of MLOps/LLMOps practices: experiment tracking, model/prompt versioning, reproducibility, and monitoring (e.g., MLflow or equivalent).

* Experience with CI/CD, containerization (Docker), and test reporting; ability to integrate evaluations into automated pipelines.

* Strong data skills: SQL fundamentals and experience with data analysis/validation using pandas/NumPy.

* Clear communication and stakeholder management—able to translate quality risks into actionable engineering work.

Preferred / Nice to Have

* Awareness of Databricks concepts (workspaces, notebooks, jobs, clusters) and how QA/validation can be operationalized via Databricks workflows.

* Experience with Spark for large-scale data validation and distributed test execution.

* Familiarity with Databricks MLflow Model Registry and/or Unity Catalog (or similar governance) for managing models, features, prompts, and controlled data access.

* Experience with data quality frameworks (e.g., Great Expectations, Deequ) and drift/monitoring practices.

* Experience with agent frameworks and tooling (e.g., LangChain/LangGraph, LlamaIndex, OpenAI-compatible tool calling patterns) and related evaluation approaches.

* Experience with LLM observability/evaluation tools (e.g., OpenTelemetry tracing, prompt/agent telemetry, evaluation dashboards) and red-teaming practices.

* Familiarity with API testing for AI services (e.g., FastAPI-based services) and performance testing (latency, throughput, cost profiling).

* Exposure to Responsible AI practices (privacy, explainability, robustness, security controls) and validation evidence for audits.

Applications go to the hiring team directly