Architecture & Solution Design

* Define reference architectures for GenAI systems: RAG, agentic orchestration, tool/function calling, multi-step reasoning workflows, memory patterns, and context strategies.

* Design multi-tenant and enterprise-scale GenAI platforms with clear separation of concerns: UI, orchestration, retrieval, inference, evaluation, and observability.

* Select model strategies: hosted LLMs, open-weight models, fine-tuning vs. prompt/RAG, latency and cost tradeoffs, and deployment patterns.

2) Agentic AI Orchestration & Tooling

* Architect agent systems (single/multi-agent) including:

* Task decomposition, planners/executors, reflection/verification loops

* Tool use patterns (APIs, databases, search, workflow engines)

* Guardrails to prevent unsafe tool actions and hallucinated commands

* Build reliable flows for “human-in-the-loop” decision points and approvals (e.g., procurement, customer comms, incident triage).

3) Retrieval, Knowledge Systems & Data Design

* Lead design of knowledge ingestion pipelines:

* document parsing, chunking strategies, embeddings, metadata, lineage, freshness SLAs

* Architect vector search and hybrid retrieval:

* semantic + keyword, reranking, filtering, ACL-aware retrieval

* Ensure retrieval respects access control, PII handling, data residency, and auditability.

4) Production Engineering, Reliability & Cost

* Set non-functional requirements for GenAI workloads:

* SLOs, latency budgets, fallback models, caching, rate limiting

* Design cost controls: prompt/token optimization, model routing, batching, and usage governance.

* Implement resiliency patterns: circuit breakers, retries, queue-based orchestration, idempotency.

5) Security, Risk & Responsible AI

* Establish AI security posture:

* prompt injection defenses, data exfiltration controls, tool sandboxing

* Define policies and controls for:

* sensitive data, logging, redaction, encryption, secret management, and auditing

* Collaborate with risk/compliance to drive:

* model governance, content safety, bias/quality monitoring, and regulatory alignment

6) Evaluation, Observability & Continuous Improvement

* Create evaluation frameworks:

* offline evals (golden sets), automated regression, and scenario-based testing

* Instrument systems for observability:

* traces, prompt/versioning, retrieval diagnostics, tool-call logs, and outcome metrics

* Run A/B tests and iterate on prompts, retrieval, and agent policies based on measurable outcomes.

7) Leadership & Stakeholder Management

* Partner with product leaders to identify high-value use cases and define roadmap.

* Mentor engineers and data scientists on best practices for LLM apps.

* Produce architecture artifacts: ADRs, threat models, system diagrams, runbooks.

Required Skills & Experience

Core Technical Skills (Must Have)

* 8+ years in software/solution architecture with 2+ years delivering GenAI/LLM solutions in production (adjust as needed).

* Strong knowledge of LLMs: prompting patterns, context windows, tool/function calling, model limitations, and safety risks.

* Agentic AI design experience:

* orchestrators, workflows, multi-step reasoning, tool usage, HITL patterns

* RAG expertise:

* embeddings, vector DBs, hybrid retrieval, reranking, chunking strategies, evaluation

* Cloud architecture (Azure/AWS/GCP) with production engineering rigor:

* microservices, containers (Docker/K8s), serverless, CI/CD

* Solid programming skills (one or more):

* Python, TypeScript/JavaScript, Java, C#

* Experience with APIs and integration patterns:

* REST/gRPC, event-driven systems, queues, workflow engines

Security & Governance (Must Have)

* Understanding of GenAI-specific threats:

* prompt injection, data leakage, jailbreaks, insecure tool calling

* Familiarity with enterprise controls:

* IAM, key management, encryption, network isolation, audit logging

* Responsible AI practices:

* evaluation, content moderation, privacy, and compliance-by-design

Architecture & Systems Skills (Must Have)

* Distributed system design:

* scalability, fault tolerance, caching, performance tuning

* Observability:

* logging/metrics/tracing, prompt/version tracking, monitoring SLIs/SLOs

* Cost management and performance optimization:

* model selection/routing, token reduction, caching, batching

Preferred / Nice-to-Have Skills

* Fine-tuning approaches:

* LoRA/QLoRA, instruction tuning, adapters, distillation (when appropriate)

* Experience with:

* Knowledge graphs, semantic layers, enterprise search

* Advanced evaluation:

* LLM-as-judge with safeguards, rubric scoring, adversarial testing

* MLOps/LLMOps toolchains:

* experiment tracking, feature stores, model registries, data quality tools

* Domain experience:

* customer support automation, developer productivity copilots, IT ops agents, finance or healthcare compliance

* Experience building platforms:

* reusable agent frameworks, reusable RAG components, multi-team enablement

Artificial Intelligence Specialist