Back to jobs

AI Architect

Centific
Hyderabad, Telangana, India
Full-time
AI tools:
LLMs
Applications go directly to the hiring team

Full Description

Location

Hybrid / Remote

Role Type

Full-time | Principal Level

Role Overview

Centific’s DAC (Digital Architecture & Cognitive) Command is expanding its global architecture unit to build and operationalize agentic, AI-driven business automation at production scale.

In this role, you will act as the end-to-end design authority for agentic inference solutions—owning outcomes from blueprint to live operations. You will architect multi-agent systems, runtime orchestration, and operational guardrails that meet demanding non-functional requirements (latency, reliability, cost, and security).

This is a hands-on role. You will prototype reference implementations, tune runtime behavior, and partner with engineering, platform, security, and product stakeholders to deliver production-first agentic systems.

Key Responsibilities

1. Agentic System Architecture & Orchestration

* Design multi-agent architectures (planner–executor, supervisor loops, routing/dispatch, delegation, reflection/verification patterns) aligned to business workflows.

* Define orchestration mechanisms for state/session handling, memory (short/long-term), tool invocation, retrieval/RAG, and structured I/O.

* Establish standards for prompt/agent templates, tool/skill contracts, agent-to-agent messaging, and deterministic fallbacks.

* Create reference implementations that teams can extend safely (agent frameworks, orchestration services, reusable libraries).

2. NFR-Driven Design for Production Inference

* Own non-functional design (latency, throughput, scalability, reliability, availability, cost) as first-class requirements.

* Design for performance and cost: token budgeting, caching strategies, batching, streaming responses, concurrency controls, and adaptive routing.

* Define resilience patterns: timeouts, retries, circuit breakers, idempotency, queue back-pressure, graceful degradation, and safe-mode behavior.

* Drive architecture decisions that balance quality vs. cost vs. speed—documenting trade-offs and expected SLOs/SLAs.

3. Solution Blueprint Ownership & End-to-End Delivery

* Own the end-to-end solution blueprint from concept through production rollout (architecture, integration, testing, operations).

* Translate business intent into system decomposition (services, agents, tools, data flows) with clear ownership boundaries and contracts.

* Collaborate with Solution Blueprint Architects, Platform Architects, Data/Governance, and Security/Compliance to align constraints early.

* Deliver architecture artifacts: sequence diagrams, decision records (ADRs), integration specs, runbooks, acceptance criteria, and launch checklists.

4. Integration Governance & Platform Compatibility

* Set integration standards for APIs/events (versioning, compatibility contracts, error semantics, schema governance).

* Define interfaces for tool invocation (capabilities registry, permissions, rate limits, safe parameterization).

* Ensure agentic systems integrate cleanly with enterprise platforms (IAM, logging, monitoring, workflow engines, data platforms).

* Partner with enterprise architecture to ensure interoperability across domains and prevent fragmentation.

5. Operational Readiness & Reliability

* Design and enforce operational guardrails: monitoring, alerting, evaluation hooks, rollback plans, and safety kill-switches.

* Establish runbooks for incident response, model/agent degradation, and dependency failures (tools, data sources, external APIs).

* Define observability standards for agent traces, tool calls, prompts/responses, evaluation scores, and cost telemetry.

* Lead postmortems and reliability improvements; ensure corrective actions are implemented and verified.

6. Technical Leadership & Enablement

* Act as a principal technical leader—aligning cross-functional teams on architecture, roadmap, and delivery priorities.

* Mentor engineers/architects on agentic design patterns, evaluation, and production hardening.

* Drive reuse: shared components, gold-standard reference flows, and platform primitives that accelerate delivery.

* Contribute to architecture councils/design reviews; influence standards and best practices across DAC Command.

Required Experience & Skills

Core Experience

* 10–15+ years in software/platform engineering with 5+ years in solution/AI/platform architecture roles.

* Proven delivery of production-grade AI/LLM systems (not just prototypes), including operational ownership considerations.

* Strong background in distributed systems, API/event-driven integration, and reliability engineering.

Agentic AI & LLM Runtime Expertise (Hands-On)

* Deep experience with agentic patterns: multi-agent coordination, planning, tool calling, routing, memory, and state management.

* Experience optimizing LLM inference: caching, batching, token/latency management, throughput tuning, and quality-cost trade-offs.

* Strong understanding of evaluation strategies (offline/online), prompt/agent regression testing, and release gates.

* Familiarity with common orchestration frameworks and patterns (e.g., graph-based agent flows, tool registries, function calling).

Platform & Operations

* Strong cloud-native architecture experience (AWS/Azure/GCP), microservices, event streaming, and container/Kubernetes ecosystems.

* Hands-on with observability stacks (logs/metrics/traces), SLO/error budgets, incident response practices, and postmortems.

* Ability to design secure-by-default tool access patterns (least privilege, scoped tokens, auditability).

Soft Skills & Ways of Working

* Production-first mindset: design for operability, safety, and reliability from day one.

* Strong systems thinking: can reason across product, platform, data, security, and cost dimensions.

* Clear communicator: able to explain architecture trade-offs to engineers, product, and executive stakeholders.

* Bias for action: prototypes quickly, then codifies reusable standards and reference implementations.

* Collaborative leadership: aligns teams without relying on formal authority.

Nice-to-Have / Preferred

* Experience with large-scale workflow orchestration and automation platforms (BPM/workflow engines, event-driven pipelines).

* Experience implementing agent observability and evaluation harnesses at scale.

* Background in regulated environments (SOC2, HIPAA, PCI, CJIS) and designing AI systems with audit-ready traces.

* Open-source contributions, talks, or published work in agentic systems, LLM infrastructure, or reliability engineering.

What Success Looks Like (First 12–18 Months)

* Agentic reference architectures and runtime standards are adopted across DAC Command deliveries.

* Production deployments meet defined SLOs for latency, availability, and cost; incident rates reduce over time through reliability improvements.

* Reusable orchestration primitives (routing, memory, tool registry, evaluation hooks) accelerate new use cases and reduce duplication.

* Integration governance prevents fragmentation—APIs/events are versioned, compatible, and observable.

* Teams trust the platform: safe rollouts, clear runbooks, and measurable quality/cost improvements are in place.

Applications go to the hiring team directly