Pyspark Data Engineer with Databricks
CapgeminiJoin Capgemini as a Pyspark Data Engineer and leverage your expertise to design and build data pipelines in a collaborative environment. You'll have the opportunity to work with cutting-edge technologies like Databricks and Snowflake, impacting analytics and machine learning projects. This is a full-time role based in New York with a hybrid work option.
Skills & Expertise
Key Responsibilities
Design and maintain end-to-end ETL/ELT pipelines using Python and PySpark.
Optimize Spark jobs for performance, scalability, and cost-efficiency.
Implement data quality frameworks including validation and anomaly detection.
Full Description
Position Title : Pyspark Data Engineer with Databricks
Location : New York, NY (Onsite/Hybrid)
Experience : 8+ Years
Employee Type : Full Time with Benefits
Note :- Must be comfortable to attend In Person Interview at New York Location
Job Description
We are looking for a hands-on mid–senior level PySpark Data Engineer with Databricks who can design, build, and own production-grade data pipelines and platform components. This role requires strong expertise in Python/PySpark, Databricks, and Snowflake, with a focus on building scalable, cost‑efficient, and reliable data systems that support both analytics and machine learning use cases.
Key Responsibilities
* Design, develop, and maintain end‑to‑end ETL/ELT pipelines using Python and PySpark on Databricks.
* Optimize Spark jobs for performance, scalability, and cost-efficiency in production environments.
* Implement data quality frameworks including validation, reconciliation, and anomaly detection.
* Build and manage orchestration workflows (Airflow / Databricks Workflows / equivalent).
* Implement pipeline monitoring, logging, alerting, and observability for reliable operations.
* Develop and operationalize ML workflows using MLflow (experiment tracking, model registry, packaging, deployment).
* Build scalable data ingestion and data modeling solutions for analytics and ML use cases.
* Collaborate with data scientists, platform teams, engineering stakeholders, and business partners.
Required Skills & Qualifications
* 8+ years of experience in data engineering with strong hands‑on work in PySpark and Python.
* Deep experience with Databricks, Spark optimization, cluster tuning, and performance troubleshooting.
* Strong experience working with Snowflake or similar cloud data warehouses.
* Practical knowledge of workflow orchestration tools and dependency management.
* Solid understanding of data modeling, ingestion frameworks, and distributed systems architecture.
* Hands‑on experience implementing CI/CD for data and ML pipelines.
* Strong experience with MLflow for managing the ML lifecycle.
* Excellent communication skills with the ability to work across engineering and business teams. Desired Skills
Nice-to-Have Skills
* Exposure to AI/LLM use cases, vector search, or RAG pipelines.
* Familiarity with Java-based services or microservices architecture.
* Knowledge of data governance, cataloging, and security practices.