Sr. Machine Learning Ops Engineer(Toronto -Hybrid)(Fulltime/ Contract to Hire)
TestingXpertsFull Description
Title: Sr. Machine Learning Ops Engineer
Location: Charles Street West, Toronto (Hybrid)
Type: FTE or Contract to Hire
Role Overview
We are seeking a Machine Learning Developer to design, build, and deploy ML solutions that turn data into measurable business impact. This is a hands-on engineering role focused on developing end-to-end ML pipelines—data preparation, feature engineering, model training, evaluation, and production deployment—using Python and an open-source AI/ML stack. You will collaborate with data engineering and platform teams and work in environments that may include Databricks and Spark for scalable data processing and model operations.
Key Objectives
* Deliver production-grade ML models and data products from discovery through deployment.
* Build repeatable, maintainable ML engineering patterns for training, evaluation, and inference.
* Improve model quality, reliability, and performance through robust testing, monitoring, and iteration.
* Partner with data and platform teams to leverage scalable compute and data platforms (including Databricks/Spark) while meeting security and governance requirements.
Primary Responsibilities
* Design, develop, and iterate on machine learning models for classification, regression, clustering, recommendation, forecasting, and/or NLP use cases as needed.
* Build end-to-end ML pipelines in Python: data ingestion and preparation, feature engineering, training, evaluation, and batch/real-time inference.
* Apply sound experimentation practices: baselines, ablation studies, cross-validation (as applicable), and clear success metrics aligned to business outcomes.
* Develop and maintain reusable ML code (packages, utilities, pipelines) with strong software engineering practices (tests, code review, documentation, CI/CD).
* Implement model evaluation and testing: offline benchmarks, data/label quality checks, reproducible training runs, and regression tests to prevent performance degradation.
* Operationalize MLOps: model versioning, experiment tracking, model registry, automated deployments, and monitoring for drift, bias, latency, and cost.
* Integrate ML services with product systems via APIs and event-driven patterns; collaborate on feature stores, data contracts, and production SLAs.
* Leverage open-source AI/ML components (e.g., scikit-learn, PyTorch/TensorFlow, XGBoost/LightGBM, Hugging Face ecosystem) and choose the right tool for accuracy, latency, and maintainability.
* Collaborate with data engineering and platform teams to use Databricks/Spark for large-scale ETL, feature computation, distributed training (where relevant), and scheduled jobs.
* Ensure solutions follow security, privacy, and responsible AI practices, including safe handling of sensitive data and auditability of model decisions.
Required Skills & Experience
* Strong software engineering experience in Python (clean architecture, API design, testing, packaging, performance tuning).
* Hands-on experience building and deploying machine learning models in production environments.
* Proficiency with common ML libraries and frameworks (e.g., scikit-learn, PyTorch or TensorFlow; XGBoost/LightGBM as applicable).
* Experience with data processing in Python (e.g., pandas, NumPy) and strong SQL fundamentals.
* Understanding of ML concepts (bias/variance, regularization, feature leakage, evaluation metrics, calibration) and ability to select appropriate metrics for the use case.
* Experience with MLOps practices and tooling (e.g., MLflow or equivalent), including experiment tracking, model versioning, and reproducible training.
* Experience deploying services (Docker, CI/CD) and operating them with monitoring/observability practices.
* Ability to communicate tradeoffs clearly—balancing accuracy, latency, cost, reliability, and risk.
Preferred / Nice to Have
* Awareness of Databricks concepts (workspaces, notebooks, jobs, clusters) and practical experience with Spark for large-scale data processing.
* Experience with Databricks MLflow Model Registry and/or Unity Catalog (or similar governance) for managing models, features, and controlled data access.
* Experience with feature stores, data versioning, and data quality frameworks.
* Experience with model serving and optimization (e.g., FastAPI, TorchServe, ONNX, quantization, batching, caching).
* Familiarity with modern open-source LLM and embeddings ecosystem (e.g., Hugging Face Transformers, sentence-transformers) and applying them to NLP tasks when relevant.
* Experience with cloud ML services and distributed training patterns (Ray, Spark ML, Horovod, or similar).
* Experience implementing responsible AI practices (privacy, explainability, robustness, and security controls).