Company Description

Company.ai is building a network of category defining AI products in stealth-mode.

Examples of what we are creating(and are actively expanding):

* Email.com

* How.com

* Face.com

* Notebook.com

* Queen.com

If you see leverage where others see complexity, you will feel at home here.

Each product is a standalone, and together they compound through shared identity, shared context, shared distribution, and shared learning loops.

Role Description

You will design and ship agentic capabilities that power our vertical products. Your work will move from hypothesis to production quickly, because we optimize for shipping, measurement, and iteration.

This is applied research with consequences.

Reliability, evaluation, safety, and product feel matter as much as model quality.

* Reasoning and planning that survives real world messiness

* Memory that is useful, minimal, and correct (not creepy, not noisy)

* Tool use across real interfaces (web, email, calendars, docs, code)

* Personalization that improves outcomes without compromising trust

* Evaluation pipelines that predict failures before users find them

* Guardrails that keep behavior safe, aligned, and explainable

You will also work directly inside products, because the best agent research comes from real tasks, real users, and real constraints.

* You ship one major capability that measurably improves completion rate on a core workflow in one product

* You build an evaluation harness that catches regressions and exposes the top failure modes with clear metrics

* You define a reliability bar the whole team uses, and the product gets better every week because of it

Nice to Have

* Experience with agent benchmarks, tool use, computer use, or multi step workflow execution

* Experience with reliability research: verification, calibration, interpretability, safety testing

* Experience with retrieval, memory, and personalization systems at scale

Day to Day Responsibilities

* Invent and iterate on agent methods: tool planning, long horizon execution, retrieval and memory, preference learning, workflow decomposition, verification, and self correction

* Build fast experiment loops: dataset creation, training runs, ablations, analysis, and next hypotheses

* Own evaluation: automatic benchmarks, adversarial tests, human in the loop grading, reliability scoring, regression tracking

* Ship to production: partner with product engineers, instrument behavior, run A B tests, improve real user success rates

* Harden the system: failure mode discovery, mitigation, monitoring, and safe defaults

What We Are Looking For:

* You want to build the next defaults in major AI verticals, not a one size fits all assistant

* You have shipped LLM powered features to real users and you care about measurable outcomes: completion rate, quality, retention, cost, latency

* You can own an end to end loop: problem framing, data, tuning or training, evaluation, deployment, monitoring, iteration

* You are strong on agent building blocks: planning, tool use, memory, retrieval, orchestration, long horizon execution

* You can make LLM behavior predictable: constraints, structured outputs, verification, tool reliability, fallbacks, and clear failure handling

* You think in failure modes and counterexamples, and you build evals that catch regressions before users do

* You are comfortable crossing layers: model behavior, prompts, tools, product UX, telemetry, infra, and systems debugging

* You build for trust: privacy aware personalization, safe defaults, transparent behavior under uncertainty, and reliability as a first class goal

* You have a builder mindset and you like messy real world workflows: browser automation, HTML selectors, APIs, edge cases, and recovering gracefully

* You raise the bar around you: mentoring, clear standards, strong reviews, and multiplying the team’s output

* You thrive in a fast, high ownership, in person environment in San Francisco

Qualifications

* Strong ML foundation with a real track record: shipped systems, meaningful open source, or research output that moved something forward

* Deep working knowledge of modern LLMs and transformers: tokenization, context management, KV cache behavior, decoding tradeoffs, scaling behavior

* Hands on experience with tuning and alignment methods such as SFT, DPO, RLHF, reward modeling, RLAIF, preference data pipelines

* Practical tool use expertise: function calling, schema design, structured outputs, validation, tool routing, retries, and tool error recovery

* Reliability chops: hallucination reduction, calibration, self checking, verification, constraint driven generation, and safe fallbacks

* Retrieval and memory experience: embeddings, RAG, reranking, chunking strategies, long context tradeoffs, memory that stays useful over time

* Evaluation ownership: automated benchmarks, adversarial tests, human in the loop scoring, regression suites, and metrics that predict real user outcomes

* Data craftsmanship: cleaning, deduplication, contamination checks, eval hygiene, synthetic data when appropriate, strong dataset discipline

* Production readiness: instrumentation, A B testing, monitoring, on call quality, latency and throughput tradeoffs, cost aware iteration

* Systems and infra competence: data pipelines, training jobs, experiment tracking, reproducibility, and debugging under real constraints

* Strong product taste: you can simplify without losing power, and you care about the user experience as much as the model curve

How We Work:

* Hybrid, fast feedback loops, high trust

* Ship by default, measure everything that matters

* Direct communication, no politics, clean ownership and accountability

* High craft standards across research, engineering, and product

Compensation & Benefits

* Competitive salary and meaningful equity

* Top tier tooling and compute for serious iteration

* Relocation and immigration support when needed

* Build on brands that already have distribution and intent, because the domain is the category

How to Apply

* Your resume or LinkedIn

* Links to work you have shipped or built (GitHub, Huggingface, demos, papers, products, writeups)

One proposed agent flow for any Company.ai vertical you are excited about (How.com, Email.com Face.com, Notebook.com, Queen.com) For the agent flow, keep it short.

* The user goal and the trigger (what starts the agent)

* The steps the agent takes, including tools and actions

* What memory it needs, what it must not store, and how it stays trustworthy

* Failure modes and how the system recovers safely

* How you would evaluate it (success metrics plus reliability tests)

Optional but helpful:

Which vertical you chose and why you think it can become inevitable at scale?

Artificial Intelligence Engineer