Back to jobs

Research Scientist

Velvet
San Francisco, CA
Full-time
25,000,000 – 30,000,000 / year
AI tools:
PyTorch
Applications go directly to the hiring team

Join Velvet as a Research Scientist where you'll contribute to developing advanced models for audio and video enhancement in the exciting field of multimodal AI. Collaborate with a passionate team dedicated to making AI more human by enhancing the quality of training datasets, while enjoying the fast-paced environment of an early-stage company in San Francisco.

Full-time

Skills & Expertise

Deep Learning
PyTorch
Audio Processing
Video Processing
Signal Processing
Experiment Design
Model Fine-tuning
Multimodal Learning

Key Responsibilities

Develop and fine-tune models for audio and video enhancement tasks.

Experiment with novel architectures to improve model performance on real-world data.

Build evaluation frameworks to measure enhancement quality and guide model improvements.

Full Description

About Us

Velvet is a data research company building the datasets that power the next generation of multimodal AI. Founded by Lucas Mantovani (ex Meta FAIR) and Lucas Tucker (ex Adobe Infra), our mission is to make AI more human by producing high-quality audiovisual training data for frontier labs.

We're hiring a Research Scientist to develop and fine-tune models for video and audio data processing and enhancement, as well as to conduct data-oriented research that pushes the boundaries of multimodal quality.

What You'll Do

* Research, develop, and fine-tune models for audio and video enhancement — including denoising, super-resolution, speech restoration, and perceptual quality improvement — ensuring outputs meet the standards required for frontier model training.

* Experiment with novel architectures, training objectives, and data augmentation strategies to improve model performance across diverse and noisy real-world audiovisual data.

* Build evaluation frameworks and benchmarks to rigorously measure enhancement quality, guiding iterative model improvement.

* Collaborate with infrastructure and data pipeline engineers to integrate trained models into large-scale processing workflows that handle wide variation in speech, visual quality, and format.

What We're Looking For

* Strong research background in deep learning, with hands-on experience training and fine-tuning models for audio processing, video processing, or related domains.

* Proficiency in PyTorch. Experience designing and running experiments at scale.

* Solid understanding of signal processing fundamentals and how they inform model design for enhancement tasks.

* A publication track record or demonstrated research output in relevant areas (audio/speech enhancement, video restoration, generative models, multimodal learning).

* Ability to work effectively in an early-stage environment where scope is broad and priorities shift fast.

Even Better

* Prior work at a frontier AI lab or data company focused on multimodal data.

* Experience fine-tuning large pretrained models (diffusion models, autoencoders, or transformer-based architectures) for perceptual quality tasks.

* Familiarity with perceptual quality metrics and human evaluation methodologies for audio and video.

* Track record working with datasets spanning tens of thousands of hours of audio or video.

You'll Thrive Here If

* You're excited by applied research with immediate, visible impact on data quality and downstream model performance.

* You move fluidly between reading papers, writing training loops, and analyzing failure cases.

* You hold yourself to a high bar for rigor — because you understand that model quality directly determines the value of the data we produce.

Applications go to the hiring team directly