Back to jobs

Sr. Machine Learning Ops Engineer(Toronto -Hybrid)(Fulltime/ Contract to Hire)

TestingXperts
Toronto, Ontario, Canada
Full-time
AI tools:
PyTorch
TensorFlow
Applications go directly to the hiring team

Full Description

Title: Sr. Machine Learning Ops Engineer

Location: Charles Street West, Toronto (Hybrid)

Type: FTE or Contract to Hire

Role Overview

We are seeking a Machine Learning Developer to design, build, and deploy ML solutions that turn data into measurable business impact. This is a hands-on engineering role focused on developing end-to-end ML pipelines—data preparation, feature engineering, model training, evaluation, and production deployment—using Python and an open-source AI/ML stack. You will collaborate with data engineering and platform teams and work in environments that may include Databricks and Spark for scalable data processing and model operations.

Key Objectives

* Deliver production-grade ML models and data products from discovery through deployment.

* Build repeatable, maintainable ML engineering patterns for training, evaluation, and inference.

* Improve model quality, reliability, and performance through robust testing, monitoring, and iteration.

* Partner with data and platform teams to leverage scalable compute and data platforms (including Databricks/Spark) while meeting security and governance requirements.

Primary Responsibilities

* Design, develop, and iterate on machine learning models for classification, regression, clustering, recommendation, forecasting, and/or NLP use cases as needed.

* Build end-to-end ML pipelines in Python: data ingestion and preparation, feature engineering, training, evaluation, and batch/real-time inference.

* Apply sound experimentation practices: baselines, ablation studies, cross-validation (as applicable), and clear success metrics aligned to business outcomes.

* Develop and maintain reusable ML code (packages, utilities, pipelines) with strong software engineering practices (tests, code review, documentation, CI/CD).

* Implement model evaluation and testing: offline benchmarks, data/label quality checks, reproducible training runs, and regression tests to prevent performance degradation.

* Operationalize MLOps: model versioning, experiment tracking, model registry, automated deployments, and monitoring for drift, bias, latency, and cost.

* Integrate ML services with product systems via APIs and event-driven patterns; collaborate on feature stores, data contracts, and production SLAs.

* Leverage open-source AI/ML components (e.g., scikit-learn, PyTorch/TensorFlow, XGBoost/LightGBM, Hugging Face ecosystem) and choose the right tool for accuracy, latency, and maintainability.

* Collaborate with data engineering and platform teams to use Databricks/Spark for large-scale ETL, feature computation, distributed training (where relevant), and scheduled jobs.

* Ensure solutions follow security, privacy, and responsible AI practices, including safe handling of sensitive data and auditability of model decisions.

Required Skills & Experience

* Strong software engineering experience in Python (clean architecture, API design, testing, packaging, performance tuning).

* Hands-on experience building and deploying machine learning models in production environments.

* Proficiency with common ML libraries and frameworks (e.g., scikit-learn, PyTorch or TensorFlow; XGBoost/LightGBM as applicable).

* Experience with data processing in Python (e.g., pandas, NumPy) and strong SQL fundamentals.

* Understanding of ML concepts (bias/variance, regularization, feature leakage, evaluation metrics, calibration) and ability to select appropriate metrics for the use case.

* Experience with MLOps practices and tooling (e.g., MLflow or equivalent), including experiment tracking, model versioning, and reproducible training.

* Experience deploying services (Docker, CI/CD) and operating them with monitoring/observability practices.

* Ability to communicate tradeoffs clearly—balancing accuracy, latency, cost, reliability, and risk.

Preferred / Nice to Have

* Awareness of Databricks concepts (workspaces, notebooks, jobs, clusters) and practical experience with Spark for large-scale data processing.

* Experience with Databricks MLflow Model Registry and/or Unity Catalog (or similar governance) for managing models, features, and controlled data access.

* Experience with feature stores, data versioning, and data quality frameworks.

* Experience with model serving and optimization (e.g., FastAPI, TorchServe, ONNX, quantization, batching, caching).

* Familiarity with modern open-source LLM and embeddings ecosystem (e.g., Hugging Face Transformers, sentence-transformers) and applying them to NLP tasks when relevant.

* Experience with cloud ML services and distributed training patterns (Ray, Spark ML, Horovod, or similar).

* Experience implementing responsible AI practices (privacy, explainability, robustness, and security controls).

Applications go to the hiring team directly