As a Data Engineer you will be responsible for developing and optimizing data environments that enable the smooth extraction, transformation, and loading (ETL) of large and complex datasets. You will work with protein data, implementing efficient pipelines for storing, manipulating, and cross-matching data to enable similarity searching and data-driven decision-making for model training. The ideal candidate will have extensive experience with data engineering in cloud environments, particularly with Databricks and Azure technologies.
Key Responsibilities:
- Design, develop, and maintain efficient ETL pipelines to handle large-scale, customized datasets from various sources to destinations, ensuring data integrity and accessibility.
- Work closely with bioinformatics and data science teams to create data structures optimized for model training, enabling quick access and cross-matching for similarity searches.
- Implement, organize, and optimize protein data sets within Databricks or Microsoft Fabric, ensuring compatibility with AI/ML workflows.
- Manage data lakes and data warehouses, ensuring data consistency, accuracy, and optimal performance in a cloud-based environment.
- Collaborate with stakeholders to understand data needs, defining data architecture and structures that meet both current and future requirements.
- Ensure secure handling and transfer of sensitive research data, complying with company policies and regulatory standards.
- Build automated data pipelines for cross-matching datasets and generating insights using similarity searches and other bioinformatics techniques.
- Maintain documentation and best practices for data handling, processing, and storage to ensure smooth operations and reproducibility.
Required Qualifications
- Bachelor's or Master's degree in Computer Science, Data Engineering, Bioinformatics, or a related field.
- Proven experience with creating and managing customized data environments, including ETL processes, data manipulation, and data transformation.
- Strong expertise with Databricks or Microsoft Fabric (Azure Synapse Analytics, Azure Data Lake, etc.), particularly in the context of large-scale data processing.
- Hands-on experience in setting up, optimizing, and maintaining data pipelines in cloud environments, specifically on Microsoft Azure.
- Expertise in protein data processing, cross-matching, and similarity searching in bioinformatics.
- Proficient in SQL and Python for data manipulation and pipeline development.
- Strong understanding of data modeling, data architecture, and database optimization techniques.
Preferred Qualifications:
- Familiarity with data frameworks like Apache Spark, Apache Kafka, and other big data processing tools.
- Experience with machine learning workflows and supporting model training with efficient data pipelines.
- Knowledge of bioinformatics tools and libraries related to protein data analysis.
- Understanding of regulatory requirements related to research data (e.g., HIPAA, GxP).
- Familiarity with Docker, Kubernetes, or other containerization tools for deployment.