Pyspark Data Engineer with Databricks
CapgeminiFull Description
Position Title : Pyspark Data Engineer with Databricks
Location : New York, NY (Onsite/Hybrid)
Experience : 8+ Years
Employee Type : Full Time with Benefits
Note :- Must be comfortable to attend In Person Interview at New York Location
Job Description
We are looking for a hands-on mid–senior level PySpark Data Engineer with Databricks who can design, build, and own production-grade data pipelines and platform components. This role requires strong expertise in Python/PySpark, Databricks, and Snowflake, with a focus on building scalable, cost‑efficient, and reliable data systems that support both analytics and machine learning use cases.
Key Responsibilities
* Design, develop, and maintain end‑to‑end ETL/ELT pipelines using Python and PySpark on Databricks.
* Optimize Spark jobs for performance, scalability, and cost-efficiency in production environments.
* Implement data quality frameworks including validation, reconciliation, and anomaly detection.
* Build and manage orchestration workflows (Airflow / Databricks Workflows / equivalent).
* Implement pipeline monitoring, logging, alerting, and observability for reliable operations.
* Develop and operationalize ML workflows using MLflow (experiment tracking, model registry, packaging, deployment).
* Build scalable data ingestion and data modeling solutions for analytics and ML use cases.
* Collaborate with data scientists, platform teams, engineering stakeholders, and business partners.
Required Skills & Qualifications
* 8+ years of experience in data engineering with strong hands‑on work in PySpark and Python.
* Deep experience with Databricks, Spark optimization, cluster tuning, and performance troubleshooting.
* Strong experience working with Snowflake or similar cloud data warehouses.
* Practical knowledge of workflow orchestration tools and dependency management.
* Solid understanding of data modeling, ingestion frameworks, and distributed systems architecture.
* Hands‑on experience implementing CI/CD for data and ML pipelines.
* Strong experience with MLflow for managing the ML lifecycle.
* Excellent communication skills with the ability to work across engineering and business teams. Desired Skills
Nice-to-Have Skills
* Exposure to AI/LLM use cases, vector search, or RAG pipelines.
* Familiarity with Java-based services or microservices architecture.
* Knowledge of data governance, cataloging, and security practices.