Back to jobs

Pyspark Data Engineer with Databricks

Capgemini
New York, United States
Full-time
9,000,000 – 10,000,000 / year
AI tools:
Databricks
MLflow
Applications go directly to the hiring team

Full Description

Position Title : Pyspark Data Engineer with Databricks

Location : New York, NY (Onsite/Hybrid)

Experience : 8+ Years

Employee Type : Full Time with Benefits

Note :- Must be comfortable to attend In Person Interview at New York Location

Job Description

We are looking for a hands-on mid–senior level PySpark Data Engineer with Databricks who can design, build, and own production-grade data pipelines and platform components. This role requires strong expertise in Python/PySpark, Databricks, and Snowflake, with a focus on building scalable, cost‑efficient, and reliable data systems that support both analytics and machine learning use cases.

Key Responsibilities

* Design, develop, and maintain end‑to‑end ETL/ELT pipelines using Python and PySpark on Databricks.

* Optimize Spark jobs for performance, scalability, and cost-efficiency in production environments.

* Implement data quality frameworks including validation, reconciliation, and anomaly detection.

* Build and manage orchestration workflows (Airflow / Databricks Workflows / equivalent).

* Implement pipeline monitoring, logging, alerting, and observability for reliable operations.

* Develop and operationalize ML workflows using MLflow (experiment tracking, model registry, packaging, deployment).

* Build scalable data ingestion and data modeling solutions for analytics and ML use cases.

* Collaborate with data scientists, platform teams, engineering stakeholders, and business partners.

Required Skills & Qualifications

* 8+ years of experience in data engineering with strong hands‑on work in PySpark and Python.

* Deep experience with Databricks, Spark optimization, cluster tuning, and performance troubleshooting.

* Strong experience working with Snowflake or similar cloud data warehouses.

* Practical knowledge of workflow orchestration tools and dependency management.

* Solid understanding of data modeling, ingestion frameworks, and distributed systems architecture.

* Hands‑on experience implementing CI/CD for data and ML pipelines.

* Strong experience with MLflow for managing the ML lifecycle.

* Excellent communication skills with the ability to work across engineering and business teams. Desired Skills

Nice-to-Have Skills

* Exposure to AI/LLM use cases, vector search, or RAG pipelines.

* Familiarity with Java-based services or microservices architecture.

* Knowledge of data governance, cataloging, and security practices.

Applications go to the hiring team directly