Ideal Candidate:
- An undergraduate or Master’s degree in Computer Science or equivalent engineering experience
- 6+ years of professional software engineering and programming experience (Java, Python) with a focus on designing and developing complex data-intensive applications
- 3+ years of architecture and design (patterns, reliability, scalability, quality) of complex systems
- Advanced coding skills and practices (concurrency, distributed systems, functional principles, performance optimization)
- Professional experience working in an agile environment
- Strong analytical and problem-solving ability
- Strong written and verbal communication skills
- Experience in operating and maintaining production-grade software
- Comfortable with tackling very loosely defined problems and thrive when working on a team which has autonomy in their day to day decisions
Preferred Skills
- In-depth knowledge of software and data engineering best practices
- Experience in mentoring and leading junior engineers
- Experience in serving as the technical lead for complex software development projects
- Experience with large-scale distributed data technologies and tools
- Strong experience with multiple database models ( relational, document, in-memory, search, etc )
- Strong experience with Data Streaming Architecture ( Kafka, Spark, Airflow, SQL, NoSQL, CDC, etc )
- Strong knowledge of cloud data platforms and technologies such as GCS, BigQuery, Cloud Composer, Pub/Sub, Dataflow, Dataproc, Looker, and other cloud-native offerings
- Strong Knowledge of Infrastructure as Code (IaC) and associated tools (Terraform, ansible etc)
- Experience pulling data from a variety of data source types including Mainframe (EBCDIC), Fixed Length and delimited files, databases (SQL, NoSQL, Time-series)
- Strong coding skills for analytics and data engineering (Java, Python, and Scala)
- Experience performing analysis with large datasets in a cloud-based environment, preferably with an understanding of Google’s Cloud Platform (GCP)
- Understands how to translate business requirements to technical architectures and designs
- Comfortable communicating with various stakeholders (technical and non-technical)
- Experience with Airflow and Spark:
- Airflow : Proven experience in using Apache Airflow for orchestrating and scheduling workflows. Ability to design, implement, and manage complex data pipelines. Understanding of DAGs (also how to dynamically create them), task dependencies, and error handling within Airflow.
- Spark : Hands-on experience with Apache Spark for large-scale data processing and analytics. Proficiency in writing Spark jobs in Java (PySpak also fine as we're moving in that direction). Also, contains the ability to optimizie performance, and handling data transformations and aggregations at scale.
- Familiarity with GCP Services:
- BigQuery : Experience with Google BigQuery for running SQL queries on large datasets, optimizing queries for performance, and in general managing data warehousing solutions.
- Composer : Knowledge of Google Cloud Composer for managing and orchestrating workflows.
- Dataproc : Experience with Dataproc for managing and scaling Spark clusters, including configuring clusters, running jobs, and integrating with other GCP services.
- Proficiency in Python, Java, and SQL:
- Python : Strong foundation in Python, with experience in writing clean, efficient code and utilizing libraries such as Pandas and NumPy for data manipulation. Proficient in debugging, testing, and using Python for API interactions and external service integration.
- Java : Proficiency in Java, especially for integrating with data processing frameworks. Experience with Java-based libraries and tools relevant to data engineering is a plus.
- SQL : Experience in writing and optimizing complex SQL queries for data extraction, transformation, and analysis.
- Knowledge of Terraform (Optional but Preferred):
- Terraform : Familiarity with Terraform to automate the provisioning and management of cloud resources. Ability to write and maintain Terraform scripts to define and deploy GCP resources, ensuring infrastructure consistency and scalability.
Nice to have Skills (though not required):
- Exposure to data-science or machine-learning packages (Pandas, Pytorch, Keras, TensorFlow, etc...)
- Contributions to open-source software (code, docs, or mailing list posts)
- GCP Professional Data Engineer Certification