Back to jobs

Senior Site Reliability Engineer

Galent
Toronto, Ontario, Canada
Contract
AI tools:
Dynatrace
Splunk
Moogsoft
PagerDuty
Ansible
GitHub Actions
Python
Applications go directly to the hiring team

Full Description

Job Description: SRE / AI Ops Engineer

Location : Toronto ON Canada - Day 1 Onsite

Overview

We are seeking a highly skilled Site Reliability Engineer (SRE) / AI Ops Engineer to design, build, and operate intelligent, automated reliability solutions across our production environments. This role blends deep operational expertise with modern AI‑driven observability, monitoring, and automation practices. You will work with industry‑leading tools—Dynatrace, Splunk, Moogsoft, PagerDuty, Ansible, Git/GitHub Actions, and Python—to create proactive, self‑healing, AI‑enhanced workflows that elevate system reliability and reduce manual toil.

This is a hands‑on engineering role for someone who thrives at the intersection of SRE, automation, and AI‑powered operations.

Key Responsibilities

AI‑Driven Observability & Monitoring

* Implement and optimize monitoring solutions using Dynatrace, Splunk, and Moogsoft, leveraging their AI/ML capabilities (e.g., Davis AI, Splunk ITSI, Moogsoft AIOps) to:

* Detect anomalies

* Predict incidents

* Correlate events across distributed systems

* Reduce alert noise through intelligent clustering

AI Ops Workflow Engineering

* Design and build AI‑powered operational workflows that automate:

* Incident detection

* Root cause analysis

* Remediation actions

* Post‑incident insights

* Integrate AI insights from observability platforms into automated pipelines and runbooks.

Incident Response & Automation

* Configure and manage PagerDuty for intelligent alerting, escalation policies, and automated incident response.

* Build self‑healing automation using Ansible, Python, and GitHub Actions.

* Develop automated remediation playbooks triggered by AI‑driven events.

Platform Reliability & SRE Practices

* Apply SRE principles such as SLOs, SLIs, error budgets, and chaos testing.

* Improve system reliability through automation, performance tuning, and proactive engineering.

* Reduce operational toil by designing scalable, automated solutions.

DevOps & CI/CD Integration

* Use Git and GitHub Actions to build automated pipelines that integrate:

* Observability signals

* AI‑driven quality gates

* Automated rollback and recovery workflows

Python Scripting & Tooling

* Develop Python‑based automation, data processing, and AI‑enhanced operational tooling.

* Build integrations between monitoring platforms, ticketing systems, and automation engines.

Required Skills & Experience

Core Technical Skills

* Hands‑on experience with:

* Dynatrace (including Davis AI)

* Splunk (ITSI, Machine Learning Toolkit preferred)

* Moogsoft AIOps

* PagerDuty

* Ansible

* Git & GitHub Actions

* Python scripting

AI Ops & Automation

* Experience leveraging AI/ML features within observability and incident‑management tools.

* Ability to design automated workflows that use AI insights for:

* Event correlation

* Predictive alerting

* Automated remediation

* Intelligent routing

SRE Expertise

* Strong understanding of distributed systems, cloud infrastructure, and reliability engineering.

* Experience with SLO/SLI design, error budgets, and performance optimization.

* Familiarity with containerized environments (Kubernetes, Docker) is a plus.

Soft Skills

* Strong analytical mindset with a passion for automation and continuous improvement.

* Excellent communication and cross‑team collaboration abilities.

* Ability to translate operational challenges into scalable engineering solutions.

Preferred Qualifications

* Experience with cloud platform Redhat Openshift

* Exposure to LLM‑based automation or generative AI for operational workflows.

* Background in building or integrating with ChatOps frameworks.

* Knowledge of event‑driven architectures and message queues.

What You’ll Achieve

In this role, you will help transform traditional application and infrastructure operations into a modern, AI‑enhanced reliability ecosystem. You’ll build systems that not only detect and respond to issues but learn from them—driving a future where operations are predictive, automated, and intelligent.

Applications go to the hiring team directly