AI Architect
CentificFull Description
Location
Hybrid / Remote
Role Type
Full-time | Principal Level
Role Overview
Centific’s DAC (Digital Architecture & Cognitive) Command is expanding its global architecture unit to build and operationalize agentic, AI-driven business automation at production scale.
In this role, you will act as the end-to-end design authority for agentic inference solutions—owning outcomes from blueprint to live operations. You will architect multi-agent systems, runtime orchestration, and operational guardrails that meet demanding non-functional requirements (latency, reliability, cost, and security).
This is a hands-on role. You will prototype reference implementations, tune runtime behavior, and partner with engineering, platform, security, and product stakeholders to deliver production-first agentic systems.
Key Responsibilities
1. Agentic System Architecture & Orchestration
* Design multi-agent architectures (planner–executor, supervisor loops, routing/dispatch, delegation, reflection/verification patterns) aligned to business workflows.
* Define orchestration mechanisms for state/session handling, memory (short/long-term), tool invocation, retrieval/RAG, and structured I/O.
* Establish standards for prompt/agent templates, tool/skill contracts, agent-to-agent messaging, and deterministic fallbacks.
* Create reference implementations that teams can extend safely (agent frameworks, orchestration services, reusable libraries).
2. NFR-Driven Design for Production Inference
* Own non-functional design (latency, throughput, scalability, reliability, availability, cost) as first-class requirements.
* Design for performance and cost: token budgeting, caching strategies, batching, streaming responses, concurrency controls, and adaptive routing.
* Define resilience patterns: timeouts, retries, circuit breakers, idempotency, queue back-pressure, graceful degradation, and safe-mode behavior.
* Drive architecture decisions that balance quality vs. cost vs. speed—documenting trade-offs and expected SLOs/SLAs.
3. Solution Blueprint Ownership & End-to-End Delivery
* Own the end-to-end solution blueprint from concept through production rollout (architecture, integration, testing, operations).
* Translate business intent into system decomposition (services, agents, tools, data flows) with clear ownership boundaries and contracts.
* Collaborate with Solution Blueprint Architects, Platform Architects, Data/Governance, and Security/Compliance to align constraints early.
* Deliver architecture artifacts: sequence diagrams, decision records (ADRs), integration specs, runbooks, acceptance criteria, and launch checklists.
4. Integration Governance & Platform Compatibility
* Set integration standards for APIs/events (versioning, compatibility contracts, error semantics, schema governance).
* Define interfaces for tool invocation (capabilities registry, permissions, rate limits, safe parameterization).
* Ensure agentic systems integrate cleanly with enterprise platforms (IAM, logging, monitoring, workflow engines, data platforms).
* Partner with enterprise architecture to ensure interoperability across domains and prevent fragmentation.
5. Operational Readiness & Reliability
* Design and enforce operational guardrails: monitoring, alerting, evaluation hooks, rollback plans, and safety kill-switches.
* Establish runbooks for incident response, model/agent degradation, and dependency failures (tools, data sources, external APIs).
* Define observability standards for agent traces, tool calls, prompts/responses, evaluation scores, and cost telemetry.
* Lead postmortems and reliability improvements; ensure corrective actions are implemented and verified.
6. Technical Leadership & Enablement
* Act as a principal technical leader—aligning cross-functional teams on architecture, roadmap, and delivery priorities.
* Mentor engineers/architects on agentic design patterns, evaluation, and production hardening.
* Drive reuse: shared components, gold-standard reference flows, and platform primitives that accelerate delivery.
* Contribute to architecture councils/design reviews; influence standards and best practices across DAC Command.
Required Experience & Skills
Core Experience
* 10–15+ years in software/platform engineering with 5+ years in solution/AI/platform architecture roles.
* Proven delivery of production-grade AI/LLM systems (not just prototypes), including operational ownership considerations.
* Strong background in distributed systems, API/event-driven integration, and reliability engineering.
Agentic AI & LLM Runtime Expertise (Hands-On)
* Deep experience with agentic patterns: multi-agent coordination, planning, tool calling, routing, memory, and state management.
* Experience optimizing LLM inference: caching, batching, token/latency management, throughput tuning, and quality-cost trade-offs.
* Strong understanding of evaluation strategies (offline/online), prompt/agent regression testing, and release gates.
* Familiarity with common orchestration frameworks and patterns (e.g., graph-based agent flows, tool registries, function calling).
Platform & Operations
* Strong cloud-native architecture experience (AWS/Azure/GCP), microservices, event streaming, and container/Kubernetes ecosystems.
* Hands-on with observability stacks (logs/metrics/traces), SLO/error budgets, incident response practices, and postmortems.
* Ability to design secure-by-default tool access patterns (least privilege, scoped tokens, auditability).
Soft Skills & Ways of Working
* Production-first mindset: design for operability, safety, and reliability from day one.
* Strong systems thinking: can reason across product, platform, data, security, and cost dimensions.
* Clear communicator: able to explain architecture trade-offs to engineers, product, and executive stakeholders.
* Bias for action: prototypes quickly, then codifies reusable standards and reference implementations.
* Collaborative leadership: aligns teams without relying on formal authority.
Nice-to-Have / Preferred
* Experience with large-scale workflow orchestration and automation platforms (BPM/workflow engines, event-driven pipelines).
* Experience implementing agent observability and evaluation harnesses at scale.
* Background in regulated environments (SOC2, HIPAA, PCI, CJIS) and designing AI systems with audit-ready traces.
* Open-source contributions, talks, or published work in agentic systems, LLM infrastructure, or reliability engineering.
What Success Looks Like (First 12–18 Months)
* Agentic reference architectures and runtime standards are adopted across DAC Command deliveries.
* Production deployments meet defined SLOs for latency, availability, and cost; incident rates reduce over time through reliability improvements.
* Reusable orchestration primitives (routing, memory, tool registry, evaluation hooks) accelerate new use cases and reduce duplication.
* Integration governance prevents fragmentation—APIs/events are versioned, compatible, and observable.
* Teams trust the platform: safe rollouts, clear runbooks, and measurable quality/cost improvements are in place.