Perception Algorithm Engineer
Black Sesame Technologies IncFull Description
Autonomous Driving Multimodal Model Algorithm Engineer
VLM / VLA / World Model
Black Sesame Technologies is building high-performance AI algorithms and self-developed chips for intelligent driving and beyond. As an Autonomous Driving Multimodal Model Algorithm Engineer, you will work on next-generation multimodal AI models for autonomous driving, including Vision-Language Models, Vision-Language-Action Models, and World Models.
You will collaborate with perception, prediction, planning, data, simulation, and deployment teams to integrate multimodal models with existing BEV perception, two-stage E2E, and one-stage E2E autonomous driving systems.
We are looking for candidates with hands-on experience in one or more of the following areas: Vision-Language Models, Vision-Language-Action Models, World Models.
Responsibilities
Multimodal Model Development for Autonomous Driving
* Work on one or more multimodal modeling directions for autonomous driving, including VLM-based scene understanding, VLA-style planning-oriented modeling, and World Model-based future prediction.
* Develop and optimize models that reason over multi-camera images, BEV features, map elements, object/lane instances, occupancy, trajectories, ego-motion, and driving context.
* Explore model architectures that connect perception, prediction, planning, and decision-making in two-stage and one-stage E2E autonomous driving systems.
* Collaborate with BEV perception and planning teams to improve representation quality, temporal consistency, long-tail robustness, and planning relevance.
Vision-Language and Vision-Language-Action Modeling
* Develop VLM-based methods for driving scene understanding, open-vocabulary perception, risk reasoning, corner-case analysis, and interpretable autonomy.
* Adapt and extend open-source multimodal architectures such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, or similar models for autonomous driving scenarios.
* Research VLA-style models that map multimodal driving context, navigation intent, and high-level instructions to trajectories, actions, or planning representations.
* Align visual, BEV, map, object, lane, occupancy, trajectory, and language representations for driving-specific tasks.
* Build supervised fine-tuning, instruction-tuning, and efficient adaptation pipelines for driving-relevant multimodal tasks.
World Model and Future Prediction
* Build world-model-based approaches for future BEV, occupancy, object motion, lane evolution, traffic interaction, and ego-conditioned scene rollout.
* Explore generative and predictive modeling methods such as diffusion models, autoregressive transformers, latent dynamics models, video prediction, and BEV prediction.
* Use learned world models for scenario generation, counterfactual reasoning, long-tail case mining, planning evaluation, and closed-loop analysis.
* Work with simulation and data teams to improve safety-critical scenario discovery and model-based evaluation.
Efficient Adaptation and Deployment
* Apply efficient fine-tuning and adaptation methods such as LoRA, QLoRA, Adapter, Prompt Tuning, Prefix Tuning, or other PEFT techniques.
* Develop multimodal feature alignment modules, including projection heads, query adapters, cross-attention modules, tokenization strategies, and representation converters.
* Optimize model architecture, latency, memory footprint, and compute cost for automotive deployment.
* Apply distillation, quantization, pruning, sparse computation, and efficient attention methods where appropriate.
* Collaborate with chip, compiler, runtime, and deployment teams to adapt multimodal models to in-house automotive AI hardware.
Research, Evaluation, and Iteration
* Track the latest research in VLM, VLA, World Models, BEV perception, E2E driving, robotics foundation models, generative simulation, and multimodal learning.
* Design evaluation metrics for reasoning quality, grounding accuracy, temporal consistency, prediction quality, planning relevance, and safety-critical scenarios.
* Perform systematic failure analysis and drive data/model iteration based on real-world autonomous driving cases.
* Contribute to patents, technical reports, internal research platforms, and conference or journal publications.
Qualifications
* MS or PhD in Computer Science, Electrical Engineering, Robotics, Artificial Intelligence, or a related field.
* Strong background in deep learning, computer vision, multimodal learning, robotics, or autonomous driving.
* Hands-on experience in one or more of the following areas:
* Vision-Language Models, multimodal large models, or open-source VLM adaptation
* Vision-Language-Action models, robotics foundation models, or action-conditioned modeling
* World models, generative prediction, latent dynamics modeling, or future scene simulation
* BEV perception, multi-view 3D perception, or end-to-end autonomous driving
* Motion prediction, planning, trajectory generation, or closed-loop evaluation
* Practical experience with open-source multimodal architectures such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, BLIP-style models, Flamingo-style models, or similar systems.
* Solid understanding of multimodal feature alignment, including vision-language alignment, cross-modal attention, visual tokenization, projection layers, query-based fusion, or embedding-space alignment.
* Experience with efficient fine-tuning or adaptation methods, such as LoRA, QLoRA, Adapter, Prompt Tuning, Prefix Tuning, supervised fine-tuning, or instruction tuning.
* Proficient in PyTorch and capable of modifying, training, debugging, and evaluating deep learning models.
* Familiar with transformer architectures, attention mechanisms, temporal modeling, and large-scale training.
* Experience with multimodal data, such as camera, radar, LiDAR, IMU, map, trajectory, language, or structured driving data.
* Strong engineering ability in Python; C++/CUDA/TensorRT experience is a plus.
* Comfortable with Git, Docker, Linux, distributed training, and collaborative development workflows.
* Strong communication skills and ability to work across perception, planning, data, simulation, and deployment teams.
Preferred Qualifications
* Experience adapting or fine-tuning VLM/VLA models such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, or similar architectures.
* Experience with Hugging Face Transformers, PEFT, DeepSpeed, FSDP, vLLM, SGLang, TensorRT-LLM, or similar training/inference frameworks.
* Experience building multimodal instruction datasets, driving-scene QA datasets, grounding datasets, scene-reasoning datasets, or planner-oriented supervision signals.
* Experience aligning multimodal model representations with BEV features, object queries, lane instances, occupancy grids, map vectors, trajectories, or planner inputs.
* Experience with autonomous driving architectures such as BEVFormer, DETR/DINO, MapTR/MapQR, occupancy networks, diffusion planners, trajectory transformers, or similar models.
* Experience with world models, generative models, video prediction, future BEV prediction, occupancy forecasting, learned simulation, or closed-loop evaluation.
* Experience with efficient adaptation of large models, including LoRA/QLoRA, distillation, quantization, pruning, sparse attention, or lightweight adapter design.
* Experience deploying deep learning models on automotive SoCs, ASICs, GPUs, or edge AI accelerators.
* Publications or strong project experience in CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, CoRL, ICRA, IROS, RSS, or related autonomous driving and robotics venues.
* Strong ability to convert research ideas into robust production systems.
* Experience with AI agent tools and basic harness engineering, including building evaluation scripts, task runners, automated workflows, tool-use pipelines, and reproducible testing environments for model or agent development.