Data Engineer Senior

Location:

Brooklyn, NY, 11215

Posted:

September 09, 2024

Contact this candidate

Resume:

SRIVIDYA

Email: ************@*****.***

PH: 913-***-****

Senior Data Engineer

LinkedIn UrL: https://www.linkedin.com/in/srividya-chekuri/

PROFESSIONAL SUMMARY

Seasoned Senior Data Engineer with over 10+ years of experience, specialize in the Hadoop ecosystem, cloud platforms and ETL processes, leveraging my diverse technical expertise to drive complex data engineering projects that emphasize security, scalability,

and efficiency.

Extensive hands-on experience with cloud platforms like AWS (EMR, EC2, RDS, S3, Lambda, Glue, Redshift), Azure (Azure Data Lake, Azure Storage, Azure SQL, Azure Databricks) and Google Cloud Platform, Kubernetes

Design, develop, and maintain Spark pipelines using Scala, adhering to the existing ETL pipeline framework.

Implement best practices in Spark programming to ensure high-performance data processing and transformation.

Utilize GitHub for version control, managing code repositories, and collaborating with other team members.

Skilled in data ingestion, pipeline design, PySpark, Hadoop information architecture, data modeling, data mining and optimizing ETL workflows.

Extensive knowledge of the Hadoop ecosystem, including technologies like HDFS, MapReduce, Hive, Pig, Oozie, Flume, Cassandra, Spark with Scala, PySpark, RDD, Data Frame, Spark SQL, Spark MLlib, and Spark GraphX, Kubernetes.

Strong background in ETL methods utilizing tools such as Microsoft Integration Services, Informatica Power Center, SnowSQL, OLAP and OLTP.

Leveraging Spring Boot to develop microservices for real-time data processing and integration, streamlining application startup and minimizing overhead through its convention-over-configuration approach.

Experienced in designing logical and physical data models, implementing both Star Schema and Snowflake Schema concepts.

Proficient in SQL Server, NoSQL databases like DynamoDB, MongoDB, complex Oracle queries with PL/SQL and utilizing SSIS for data extraction and enhancing SSRS reporting.

Experienced in visualization tools like Tableau and Power BI and leveraging Talend's big data capabilities to create scalable data processing pipelines.

Creating and maintaining Golang-based data processing pipelines for handling large data volumes, including ingestion, transformation, and loading. Configuring and managing Snowflake data warehouses within Azure and utilizing Terraform across AWS, Azure, and Google Cloud.

Designing secure API endpoints with authentication and authorization mechanisms like JWT, OAuth2 and API keys. Expertise in writing complex programs for various file formats, including Text, Sequence, XML and JSON.

Solid understanding of Agile and Scrum methodologies, emphasizing iterative and collaborative approaches, proficient in Test-Driven Development (TDD) and utilizing tools like Jenkins, Docker, CI/CD pipelines such as Concourse and Bitbucket. Familiar with version control systems like Git, SVN, Bamboo.

Proficient in using industry-leading testing tools such as Apache JMeter, QuerySurge and Talend Data Quality, to validate data transformation and ETL processes.

understanding of networking protocols, including DNS, TCP/IP, and VPN, with expertise in configuring and troubleshooting to ensure secure and seamless data communication

Strong Hadoop and platform support experience with the entire suite of tools and services in major Hadoop distributions (Cloudera, Amazon EMR, and Hortonworks). Hands-on experience in working in an Agile environment and following release management and golden rules.

Experience with version control tools such as GIT and Urban Code Deployment (UCD) tools. Adopt Python, Scala, Java, and Shell scripting, with significant experience in UNIX and Linux environments.

Solid understanding of Agile and Scrum methodologies, emphasizing iterative and collaborative approaches, being proficient in Test-Driven Development (TDD), and utilizing tools like Jenkins, Docker, and CI/CD pipelines such as Concourse and Bitbucket. Familiarize yourself with version control systems like Git, SVN, and Bamboo.

Proficient in using industry-leading testing tools such as Apache JMeter, QuerySurge, and Talend Data Quality to validate data transformation and ETL processes.

understanding of networking protocols, including DNS, TCP/IP, and VPN, with expertise in configuring and troubleshooting to ensure secure and seamless data communication.

Technical Skills:

Cloud Computing

Amazon Web Services (EMR, EC2, RDS, S3, Lambda, Glue, Redshift), Azure (Azure Data Lake, Azure Storage, Azure SQL, Azure Databricks), Google Cloud Platform

Big Data Technologies

Hadoop ecosystem (HDFS, MapReduce, Hive, Pig, Oozie, Flume, Cassandra, Spark with Scala, PySpark, RDD, Data Frame, Spark SQL, Spark MLlib, Spark GraphX)

ETL Processes

Microsoft Integration Services, Informatica Power Center, SnowSQL, OLAP, OLTP, Talend

Data Modeling & Databases

SQL Server, NoSQL (DynamoDB, MongoDB), Oracle (PL/SQL), Star Schema, Snowflake Schema

Programming Languages

Python, Scala, Java, Shell scripting

Real-Time Data Processing

Spring Boot, Golang-based pipelines

Visualization Tools

Tableau, Power BI

Networking Protocols

DNS, TCP/IP, VPN

DevOps & CI/CD

Terraform, Jenkins, Docker, Concourse, Bitbucket

Version Control Systems

Git, SVN, Bamboo

Testing Tools

Apache JMeter, QuerySurge, Talend Data Quality

Methodologies

Agile, Scrum, Test-Driven Development (TDD)

Education: Bachelor of Technology in Computer Science, ANU, Guntur, India

PROFESSIONAL EXPERIENCE

Senior Data Engineer

Genesis, Beaverton, OR May 2023 to Present

Responsibilities:

Led the full lifecycle of data engineering projects, from requirement analysis and planning to deployment and maintenance, aligning with both Agile and Waterfall methodologies.

Installed and configured multi-node clusters utilizing EC2 instances, managed AWS monitoring tools like CloudWatch and CloudTrail, created alarms for EBS, EC2, ELB, RDS, S3 and SNS and implemented secure data storage practices in S3 buckets.

Migrated on-premises databases and Informatica ETL processes to the AWS cloud, Redshift and Snowflake platforms using asynchronous task execution with Celery, RabbitMQ and Redis.

Integrated AWS DynamoDB with Lambda and developed Spark code for AWS Glue jobs and EMR, automating tasks using Python.

Managed both relational (SQL Server, MySQL, PostgreSQL, Oracle) and NoSQL (DynamoDB, MongoDB, Cassandra) databases, designing, optimizing, and implementing scalable data models, complex queries, and performance optimization.

Utilized Sqoop for data import/export from Snowflake, Oracle and DB2. Generated SQL and PL/SQL scripts to manage database objects and gained expertise in Snowflake Database.

Utilized BigQuery's data integration capabilities to efficiently load and transform data from various sources, including Google Cloud Storage, Cloud Pub/Sub, and external databases.

Utilized Spark, Pyspark, Hive, Hadoop, Golang and Scala for data analysis, ingestion, integrity checks and handling various data formats like JSON, CSV, Parquet, Avro.

Designed and implemented Apache Kafka-based data streaming solutions, enabling real-time data ingestion, and processing.

Automated data ingestion from diverse sources like APIs, AWS S3, Teradata and Redshift using Pyspark and Scala. Utilized Oozie for job scheduling within the SDLC.

Created interactive reports and visualizations using Tableau and Power BI.

Designed and tested dimensional data models using Star and Snowflake schemas, following Ralph Kimball and Bill Inmon methodologies.

Implemented logging, monitoring and error handling mechanisms within REST APIs.

Implemented microservices on Kubernetes clusters used Jenkins for CI/CD and Jira for ticketing and issue tracking.

Leveraged deep expertise in version control systems such as Git, SVN and Bitbucket to efficiently manage code repositories, ensuring a consistent and well-documented development process.

Utilized advanced testing tools and frameworks such as Apache JMeter, QuerySurge and Talend Data Quality to ensure the accuracy and integrity of ETL processes and data migrations.

Environment: AWS, EBS, EC2, CloudWatch, CloudTrail, S3, SNS, Redshift, Snowflake, Celery, RabbitMQ, Redis, DynamoDB, Lambda, Glue, EMR, SQL Server, MySQL, PostgreSQL, Oracle, MongoDB, Cassandra, Sqoop, Spark, PySpark, Hive, Hadoop, Golang, Scala, JSON, CSV, Parquet, Avro, Teradata, Oozie, Tableau, Power BI, Star Schema, Snowflake Schema, Ralph Kimball, Bill Inmon, REST APIs, Kubernetes, Jenkins, Jira, Git, SVN, Bitbucket, Apache JMeter, QuerySurge, Talend Data Quality.

Senior Azure Data Engineer

Mayo Clinic Rochester MN July 2021 to April 2023

Responsibilities:

Played a pivotal role across the complete SDLC process, from dissecting requirements to comprehending intricate workflows spanning source to destination systems.

Orchestrated end-to-end data solutions within the banking domain using Azure's cutting-edge PaaS services, including Azure Data Factory, T-SQL, Spark SQL, and U-SQL in Azure Data Lake Analytics, ensuring efficient data extraction, transformation and loading into Azure Data Storage.

Seamlessly managed data ingestion and processing across Azure components such as Data Lake, Azure Storage, Azure SQL and Azure DW, leveraging the power of Azure Databricks for streamlined data processing. Utilized Azure SQL Data Warehouse's external tables for insightful data visualization and reporting, empowering business users.

Designed, implemented, and maintained high-performance relational and NoSQL databases, such as SQL Server, Oracle, MySQL, PostgreSQL, MongoDB, and Cassandra. Ensured optimal database structure, query performance and data integrity.

Collaborated with BI teams to develop intricate queries that seamlessly integrate with business intelligence tools like Tableau, Power BI, and Looker, supporting advanced visualizations and reporting.

Designed and maintained large data warehouses using technologies like Amazon Redshift, Google BigQuery, and Azure SQL Data Warehouse to enable complex querying and analytics on structured and unstructured data.

Crafted intricate ETL pipelines with Azure Data Factory's Linked Services, Datasets and Pipelines, effectively orchestrating data movement and transformations across diverse sources like Azure SQL, Blob storage and Azure SQL Data Warehouse.

Applied the star-schema methodology to design data warehouses, skillfully converting heterogeneous source data into coherent SQL tables. Implemented optimized star and snowflake schemas using defined fact and dimension tables to enhance data storage and retrieval efficiency.

Developed Spark applications using Pyspark and Spark SQL, facilitating data extraction, transformation and aggregation from various formats to derive actionable insights. Fine-tuned Spark applications for optimal performance, enhancing batch intervals and parallelism.

Utilized Apache Airflow to create and manage complex data pipelines, scheduling tasks and dependencies to ensure smooth ETL workflows.

Employed Apache NiFi to facilitate data ingestion, transformation, and routing, enabling efficient data movement across various systems and platforms.

Engineered custom functions (UDFs) in Scala and Pyspark enriching data processing workflows to cater to specific business requirements.

Managed SSIS packages for seamless data import and export from diverse sources, integrating automation tools like Jenkins, Artifactory, SonarQube, Chef and Puppet for efficient continuous integration and delivery (CI/CD) operations.

Designed and executed robust audit, balance, and control frameworks to ensure data integrity and security during the entire data lifecycle. Leveraged SQL DB audit tables to maintain the accuracy of data throughout ingestion, transformation, and load processes.

Developed advanced multi-dimensional constructs, including cubes and dimensions, through SQL Server Analysis Services (SSAS), paving the way for sophisticated data analysis and insights.

Environment: Azure Data Factory, T-SQL, Spark SQL, U-SQL, Azure Data Lake Analytics, Azure Data Storage, Azure Data Lake, Azure Storage, Azure SQL, Azure DW, Azure Databricks, Azure SQL Data Warehouse, SQL Server, Oracle, MySQL, PostgreSQL, MongoDB, Cassandra, Tableau, Power BI, Looker, Amazon Redshift, Google BigQuery, Apache Airflow, Apache NiFi, Scala, Pyspark, SSIS, Jenkins, Artifactory, SonarQube, Chef, Puppet, SQL Server Analysis Services (SSAS).

Senior ETL Developer/Data Engineer

Homesite insurance, Boston, MA November 2018 to June 2021

Responsibilities:

Developed and managed automated ETL pipelines using Python, Spark and PySpark, utilizing Airflow within the Google Cloud Platform (GCP) for seamless data ingestion and database updates.

Led the migration of on-premises Hadoop systems to GCP, harnessing the capabilities of Cloud Storage, Dataproc, Dataflow and BigQuery. Performed proof-of-concept evaluations to compare self-hosted Hadoop with GCP's Data Proc and explored BigTable performance improvements within the GCP ecosystem.

Architected large-scale data warehousing and integration solutions across platforms like Snowflake Cloud, AWS Redshift and Informatica Intelligent Cloud Services (IICS). Devised workflows and mappings through Informatica ETL tools, integrating various relational databases and leveraged Power BI for reporting and data visualization following Oracle to BigQuery migrations.

Designed data models within Neptune, efficiently loading data into the Neptune database and employing the Gremlin query language for complex data querying. Utilized Grafana to create dashboards for real-time monitoring of metrics from the Cassandra database.

Created custom User-Defined Functions (UDFs) in Pig and Hive, enriching Pig Latin and HiveQL with Python functionality and deployed applications on servers such as Glassfish, Tomcat and CGI for enhanced interoperability.

Designed, implemented, and maintained high-performance and scalable database systems across various platforms including SQL Server, Oracle, MySQL, PostgreSQL, MongoDB, and Cassandra. Established database structures that reduce redundancy and increase query efficiency.

Utilized Terraform Cloud and Terraform Enterprise for collaboration, state management and execution of Terraform configurations in a secure and centralized manner.

Implemented RESTful APIs using Golang to expose data processing functionalities to other applications and services.

Designed and implemented robust Kafka clusters to facilitate real-time data streaming, ensuring high availability, fault tolerance and optimal performance across various data sources and applications.

Coordinated and performed integration testing across different data platforms and applications, ensuring seamless interaction and data flow between various systems, including databases, APIs, and third-party tools.

Environment: Python, Spark, PySpark, Airflow, Google Cloud Platform (GCP), Hadoop, Cloud Storage, Dataproc, Dataflow, BigQuery, BigTable, Snowflake Cloud, AWS Redshift, Informatica Intelligent Cloud Services (IICS), Power BI, Neptune, Gremlin, Grafana, Cassandra, Pig, Hive, Glassfish, Tomcat, CGI, SQL Server, Oracle, MySQL, PostgreSQL, MongoDB, Terraform Cloud, Terraform Enterprise, Golang, Kafka.

Data Engineer

Travelport Englewood, CO July 2015 to August 2018

Responsibilities:

Designed SQL structures with the Django framework and worked with databases such as Oracle SQL, MongoDB, MySQL, and MS SQL to ensure efficient storage and retrieval.

Deployed multi-tier applications using services like EC2, Route 53, S3, RDS and DynamoDB on AWS. Conducted stress testing on EC2 instances and configured Spark jobs on Elastic Map Reduce (EMR).

Conducted transformations using Spark Data Frames, SQL, file formats and RDDs. Developed custom Python functions for data transformations, including JSON parsing and data ingestion from various sources.

Utilized Apache Kafka and AWS Kinesis for data collection, aggregation, and automation in transferring data to AWS data lakes such as S3 and Redshift. Employed NiFiflows for data ingestion from diverse sources.

Collaborated with administration teams to optimize SQL and assisted in configuring MongoDB clusters on AWS. Ensured production performance through proper monitoring and tuning.

Worked with various databases, such as Cassandra and wrote SQL queries for stored procedures and other database objects.

Spearheaded testing efforts in big data environments like Hadoop and Spark, verifying data processing, data migration and data analytics operations. Employed testing tools specific to big data like QuerySurge to validate large datasets.

Integrated Kafka with stream processing frameworks such as Apache Flink and Apache Storm, enabling real-time analytics and complex event processing at scale.

Worked with Hadoop components like Hive, Pig, Sqoop and Oozie to manage data distribution, ETL procedures and automate workflows. Utilized Cloudera Manager for workload and performance monitoring.

Partnered with the Data Science team to build machine learning models on Spark EMR clusters, applying feature engineering, normalization and encoding techniques using Python Scikit-learn.

Monitored Spark applications, identified issues through Spark UI and implemented CI/CD pipelines using GitHub and Jenkins. Experience in containerization with Docker.

Worked in an agile environment, utilizing weekly SCRUMs for project and enhancement implementations.

Developed advanced multi-dimensional constructs, including cubes and dimensions, through SQL Server Analysis Services (SSAS), paving the way for sophisticated data analysis and insights.

Environment: Python, Django, Flask, Apache Kafka, AWS (EC2, Route 53, S3, RDS, DynamoDB, EMR, Kinesis, Redshift), Apache Spark (Data Frames, RDDs), Apache Hadoop (Hive, Pig, Sqoop, Oozie), Cloudera Manager, Apache Flink, Apache Storm, NiFiflows, QuerySurge, Oracle SQL, MongoDB, MySQL, MS SQL, Cassandra, GitHub, Jenkins, Docker, Agile (SCRUM), Scikit-learn.

Software Engineer

IBing Software Solutions Private Limited Hyd India June 2014 to March 2015

Responsibilities:

Collaborated with Business and IT groups to address requirements for user stories, process flows, outcomes from UAT, and status updates.

Oversee and direct teams both on-site and offshore.

Developed/Created test strategies, plans, and cases in various settings based on business requirements.

Involved in gathering requirements, collaborating with offshore testing and development teams to identify automation scenarios, and evaluating test automation outcomes with test teams, in addition to providing oversight.

Used Postman to conduct System Integration, UAT, Regression, Accessibility, and API testing.

Performed System Integration Testing, UAT Testing, Regression, Accessibility Testing and API Testing using Postman.

Expertise in using Reporting tool Tableau and Test case creating/executing tools like JIRA, ALM, and Defect Tracking and Bug Reporting Tools like JIRA, ALM and HP Quality Center.

Experience in SQL used to validate the test results and pull Data based on requirement.

Identifying the Automation scenarios to enhance efficiency and effectiveness in the testing process.

Experienced in SDLC process and worked on both Agile/Scrum and Waterfall methodologies.

Actively attended Daily Scrum meetings and participated 213in PI Planning’s.

Expert in test case Creation, Execution, bug tracking and Improved product quality by developing root cause resolution summaries for all the defects.

Lead, Participated and supported in post-implementation reviews and documented testing process lessons learned after major releases.

Tools: JIRA, ALM, test, Tableau, Postman, SDLC Process, TOSCA, SQL /Oracle Developer, Agile Methodology.

Contact this candidate