Etl Developer Data Engineer

Location:

Charlotte, NC

Posted:

March 28, 2024

Contact this candidate

Resume:

Name:Susmitha

Title: ETL Developer

Email:*******************@*****.***

Phone:+1-980-***-****

Professional SUMMARY:

Proficient Analytic Data Engineer and ETL Developer with over 9 years of hands-on experience specializing in Big Data technologies.

Skilled in various stages of data management lifecycle including ingestion, data modeling, querying, processing, analysis, and implementation of enterprise-level systems.

Experienced in working with leading Hadoop distribution platforms such as IBM Big Insights, Hortonworks, and Cloudera, as well as cloud platforms including GCP, AWS, and Azure.

Expertise in Big Data Technologies and Hadoop Ecosystems such as Pyspark, Spark-Scala, HDFS, GPFS, Hive, Sqoop, PIG, Spark-SQL, Kafka, Hue, Yarn, Trifacta and EPIC data sources.

Experienced in building robust Extract, Transform, Load (ETL) pipelines using Azure Data Factory to ingest, transform, and load data from various sources into data warehouses or data lakes.

Hands on experience in Building Data pipelines and Data marts using Hadoop stack.

Hands on experience in Apache Spark creating RDD’s and Data Frames applying Operations Transformation and Actions and concerting RDD’s to Data Frames.

Experienced in data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.

Experience in writing REST APIs in Python for large-scale applications.

Proficient in leveraging Azure data services such as Azure Data Factory, Azure Databricks, Azure SQL Database, Azure Synapse Analytics, and Azure HDInsight to design, implement, and manage scalable data solutions

Developed a data pipeline using Kafka and Spark Streaming to store data into HDFS and performed the real-time analytics on the incoming data.

In depth understanding of Apache Spark job execution components like DAG, Executors, Task Scheduler, Stages and Spark Steaming.

Experience in Creating and executing Data Pipelines in GCP and AWS platforms.

Hands on Experience in GCP, Big query, cloud functions, data proc.

Strong Experience in Control-M Job Scheduler Tool, Apache Air flow, ESP, D-series. Monitored the jobs on call base to close Incident tickets.

Hands-on experience with Amazon EC2, S3, RDS, IAM, Auto Scaling, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, EMR, Redshift, DynamoDB and other services of the AWS family.

Expertise in using CICID JENKINS pipeline to deploy the codes into production

Designed and developed the program paradigm to support data collection, filtering process in data warehouse and Hadoop data mart.

Strong Hadoop and platform support experience with all the entire suite of tools and services in major Hadoop Distributions - Cloudera, Amazon EMR, and Hortonworks.

Deep understanding of Cyber security, pen testing and working with them to get approvals to deploy code into production.

Hands on experience in working Agile environment and follow release management, Golden rules.

Experience in Version control tools such as GIT and Urban Code Deployment (UCD) tools.

Technical Skills

Big Data Tools

Hadoop, Hive, Apache Spark, PySpark, HBase, Kafka, YARN, Sqoop, Impala, Oozie, Pig, Map Reduce, Zookeeper and Flume

Hadoop Distributions

EMR, Cloudera, Hortonworks.

Cloud Services

AWS – EC2, S3, EMR, RDS, Glue, Presto, Lambda, RedShift and Azure – Data Lakes, BLOB

GCP – Cloud Storage, BigQuery, Compute Engine, Cloud Composer, Data Proc, Data Flow, Pub-sub

Data Visualization

ETL -Informatica, SSIS, Talend, Tableau and Power BI

Relational Databases

Oracle, SQL Server, Teradata, MySQL, PostgreSQL and Netezza

No SQL Databases

Cassandra, MongoDB and HBase

Programming Languages

Scala, Python and R

Scripting

Python and Shell scripting

Build Tools

Apache Maven and SBT, Jenkins, Bitbucket

Version Control

GIT and SVN

Operating Systems

Unix, Linux, Mac OS, CentOS, Ubuntu and Windows

Tools

PUTTY, Putty-Gen, Eclipse, IntelliJ and Toad

Professional Experience

Client: Experian, Costa Mesa, CA April 2022 - Present

Role: ETL Developer

Responsibilities:

Responsible for maintaining and managing the existing Enterprise Data Warehouse (EDW) and Operational Data Store (ODS) builds, ensuring the reliability and performance of the data infrastructure.

Extends the data models of the existing Data Warehouse to accommodate new requirements and business needs, ensuring scalability and flexibility.

Develops ETL components according to specifications, focusing on performance optimization, dependency management, auditing, error handling, and data quality assurance.

Creates REST API integrations with vendors and partners to facilitate seamless data exchange and integration between internal and external systems.

Supports post-deployment activities by debugging, fixing issues, and participating in ongoing maintenance to ensure the stability and efficiency of the data infrastructure.

Generates reports to fulfill business requirements and address data analysis needs, leveraging ETL processes and data warehouse resources.

Designs and implements dashboard pages and metrics within a custom dashboard solution tailored to the needs of the school district, ensuring data integrity and quality in the presentation layer.

Participates in code reviews to ensure adherence to architectural specifications, troubleshoots code-related issues, and provides feedback for continuous improvement.

Translates business requirements into functional specifications, develops detailed technical designs, and implements ETL code through build, unit, and system testing phases.

Environments Used:Enterprise Data Warehouse (EDW),Operational Data Store (ODS),REST API integrations with vendors and partners,Custom dashboard solution,SQL Server environments (2012/2019)

Compliance reporting environments (State and DOE),Aspen SIS database,Salesforce environment,Job scheduling environment,ISBE WCF SOAP Protocol environment, Azure Cloud Environment, Azure Active Directory (Azure AD),Windows Server Environments (2019/2016/2012/2008), Azure Site Recovery and Azure Backup,Scripting (PowerShell, JSON, Bicep, ARM templates)

Client:AgFirst Columbia, SC August 2020 - March 2022

Role: Sr. AWS Data Engineer

Responsibilities:

Developed and maintained scalable data pipelines using AWS Data Pipelines, AWS Glue, and Sqoop for efficient data extraction, transformation, and loading (ETL) processes, enhancing data reliability and accessibility.

Implemented AWS Redshift for data warehousing solutions, optimizing query performance and data storage to support analytics and business intelligence (BI) applications.

Utilized AWS S3 as a data lake storage solution, ensuring data availability and security for large-scale data analytics projects.

Engineered real-time data processing solutions using Spark Streaming and Kinesis, enabling timely insights and decision-making capabilities.

Designed, implemented, and maintained CI/CD pipelines using tools like Jenkins, GitLab CI/CD, or CircleCI to automate software delivery processes.

Orchestrated containerized applications using Docker and Kubernetes within CI/CD pipelines for consistent and scalable deployments.

Automated data workflows and job scheduling with oozie and Control-M, improving operational efficiency and data processing reliability.

Developed complex SQL queries on SQL Server and Oracle 12c databases for data analysis and reporting, reducing data retrieval times and improving data quality.

Programmed sophisticated data processing scripts in Python and Scala, enhancing data manipulation capabilities and supporting advanced data analytics needs.

Configured and managed Hadoop YARN clusters to efficiently process large datasets, ensuring high availability and scalability.

Implemented Spark jobs for batch processing of big data, significantly improving processing speed and capacity.

Leveraged Hive for data warehouse queries, facilitating easy access to large datasets for analysis and reporting.

Designed and deployed EMR clusters for processing vast amounts of data, optimizing computing resources and reducing operational costs.

Utilized EC2 and RDS for flexible, scalable, and cost-effective cloud computing and relational database solutions.

Employed DynamoDB for NoSQL database solutions, ensuring fast and predictable performance for large-scale applications.

Enhanced data integration capabilities using Informatica and Talend, supporting diverse data sources and formats for comprehensive data analytics.

Developed Cassandra database models for high scalability and fault tolerance, addressing the needs of high-volume data management.

Orchestrated data pipeline automation with Fivetran, streamlining data integration and ensuring up-to-date data availability for analytics.

Leveraged Linux operating system for server management, scripting, and automation, ensuring system stability and efficiency.

Configured Tableau dashboards for interactive data visualization, enabling business users to derive insights and make informed decisions.

Conducted data quality checks and performance tuning on databases and data processing jobs, ensuring accuracy and efficiency in data handling.

Managed cloud resources and services such as AWS, EMR, EC2, and RDS, optimizing costs and ensuring scalable infrastructure for data processing.

Environments Used: AWS Redshift, AWS S3, AWS Data Pipe Lines, AWS Glue, Hadoop YARN, SQL Server, Spark, Spark Streaming, Scala, Kinesis, Python, Hive, Linux, Sqoop, Informatica, Tableau, Talend, Cassandra, oozie, Control-M, Fivetran, EMR, EC2, RDS, Dynamo DB Oracle 12c.

Client: Chevron Corporation, Santa Rosa, NM April 2018 - July 2020

Role: Sr. AWS Data Engineer

Responsibilities:

Designed and implemented scalable data processing pipelines using Google Dataflow and GCP Dataflow to handle Chevron's extensive datasets, improving data availability and analysis efficiency.

Managed and optimized data storage and architecture on GCP, GCS, and BigQuery, ensuring secure and efficient data handling for all operational and analytical purposes.

Developed and executed data integration processes using GCP Dataprep and GCP Dataproc, enabling seamless data aggregation from diverse sources, enhancing the data quality and reliability.

Automated data workflows with Cloud Composer, significantly reducing manual intervention and accelerating data processing tasks for Chevron's strategic projects.

Conducted regular evaluations and optimizations of CI/CD pipelines to improve performance, reliability, and scalability.

Integrated automated tests (unit tests, integration tests, and end-to-end tests) into CI/CD pipelines to ensure code quality and reliability.

Enabled real-time data ingestion and streaming through Cloud Pub/Sub and Cloud Storage Transfer Service, facilitating immediate data availability for analysis and decision-making.

Managed database services using Cloud Spanner and Cloud SQL, ensuring high availability, scalability, and maintenance of data integrity for Chevron's critical applications.

Implemented data governance and metadata management with Data Catalog, improving data discoverability and compliance across Chevron's data landscape.

Leveraged GCP Databricks and Pyspark for advanced data analytics and machine learning projects, driving insights and innovations in Chevron's exploration and production activities.

Utilized SAS for statistical analysis and data processing, supporting Chevron's data-driven decision-making processes in operational efficiency and risk management.

Implemented infrastructure automation using Terraform to provision and manage cloud resources on AWS improving infrastructure scalability and reducing manual intervention.

Collaborated with development teams to integrate DevOps practices into the software development lifecycle (SDLC), promoting a culture of continuous integration, delivery, and improvement.

Employed Hive and Sqoop for data warehousing and ETL operations, ensuring robust data storage and efficient data transfer between Hadoop clusters and relational databases.

Managed Chevron's data ecosystem on Teradata, enhancing data warehousing capabilities and supporting complex analytical queries for business intelligence.

Optimized GCPs DataProc Big Query implementations for high-performance analytics, enabling faster insights into Chevron's vast datasets.

Integrated Hadoop ecosystem technologies for distributed data processing, improving Chevron's data handling capabilities for large-scale datasets.

Developed and maintained data models in GCS and Python, supporting Chevron's data science initiatives by providing structured, analysis-ready datasets.

Implemented Snowflake for cloud data warehousing, enhancing data sharing and collaboration across Chevron's global operations.

Developed dashboards and reports using Power BI, providing Chevron's management with actionable insights and data-driven decision support.

Streamlined data processing workflows with Data Flow and SQL Database technologies, enhancing Chevron's operational data management and reporting accuracy.

Deployed Data Bricks for collaborative data science and analytics projects, fostering innovation and leveraging Chevron's data assets for competitive advantage.

Designed and executed SQL queries and stored procedures for data manipulation and analysis, ensuring data accuracy and accessibility for Chevron's analysts.

Ensured data security and compliance with Chevron's policies and industry regulations by implementing best practices in data handling and storage.

Collaborated with cross-functional teams to identify data needs and deliver comprehensive data solutions, aligning with Chevron's strategic objectives.

Conducted data quality checks and validations using Python and SQL, maintaining Chevron's data integrity and reliability for critical decision-making.

Environments used: Google Dataflow,GCP, GCS, BigQuery, GCP Dataprep, GCP Dataflow, GCP Dataproc, Cloud Composer, Cloud Pub/Sub, Cloud Storage Transfer Service, Cloud Spanner, Cloud SQL, Data Catalog, Pyspark, SAS, Hive, Sqoop,Teradata, GCPs DataProc Big Query, Hadoop, Hive, GCS, Python, Snowflake, Power Bi, Data Flow, SQL Database, Cloud Composer, Cloud Pub/Sub, Cloud Storage Transfer Service, Cloud Spanner.

Client: WalkinTree Technologies, India October 2016 - January 2018

Role: Azure Data Engineer

Responsibilities:

Developed and maintained data pipelines using Azure Data Factory and Azure Data Lake, ensuring efficient data integration and processing for agricultural and financial analysis.

Implemented robust data storage solutions with Azure Storage, optimizing data accessibility and security for AgFirst's network of local farm credit associations.

Managed big data ecosystems utilizing Hadoop, HDFS, and Hive, facilitating the processing and analysis of large datasets related to farm credit and agricultural services.

Enhanced data analytics capabilities by integrating Pig, HBase, and Big Data technologies, supporting advanced analytical models for credit assessment and risk management.

Automated data workflows using Oozie and Sqoop, streamlining data transfer and transformation processes to improve operational efficiency.

Configured and maintained Zookeeper for cluster management, ensuring high availability and reliability of AgFirst's data processing infrastructure.

Configured and managed containerized applications using Docker and container orchestration platforms like Kubernetes, ensuring high availability, scalability, and resilience of microservices-based architectures.

Leveraged MapReduce and Cassandra for distributed data processing, enabling scalable analysis of agricultural data for insights and forecasting.

Utilized Eclipse and Oracle 10g for database management and application development, supporting AgFirst's financial services and reporting requirements.

Designed and executed SQL queries within Azure Data Bricks for data manipulation and reporting, ensuring data integrity and supporting strategic decisions.

Implemented performance testing frameworks using JMeter, optimizing system performance and reliability for critical data-driven applications.

Integrated Kafka for real-time data streaming, facilitating immediate data processing and enhancing AgFirst's operational responsiveness.

Developed and maintained scripting solutions in Python and Shell Scripting, automating routine data operations and enhancing system efficiency.

Utilized Golang for backend service development, improving the performance and scalability of AgFirst's data processing platforms.

Configured and monitored Web Sphere and Tomcat servers, ensuring optimal application deployment and runtime environment for data services.

Implemented logging and monitoring with Splunk, providing real-time insights into system performance and operational health.

Enhanced API testing and service integration using Soap UI, ensuring robustness and reliability of AgFirst's data interfaces.

Conducted data quality checks and validations, employing rigorous methodologies to maintain high standards of data integrity and accuracy.

Environments Used: Azure Data Factory, Azure, Azure Data lake, Azure Data Factory, Azure storage, Azure, Hadoop, HDFS, Hive, Pig, HBase, Big Data, Oozie, Sqoop, Zookeeper, MapReduce, Cassandra, Scala, Linux, Workbench, Eclipse, Oracle 10g, JMeter, Kafka, Python, Shell Scripting, Golang, Web Sphere, Splunk, Tomcat, Soap UI.

Client : TechMojo Solutions, Hyderabad, India October 2014-September 2016

Role : ETL Developer

Responsibilities:

Oversaw the installation, configuration, and maintenance of Apache Hadoop clusters, managing essential Hadoop Ecosystem components like Hive, Pig, HBase, Sqoop, Flume, Oozie, and Zookeeper to facilitate smooth application development.

Achieved the setup of a high-performance six-node CDH4 Hadoop Cluster on CentOS.

Enabled the efficient transfer of data between HDFS, Hive, and various RDBMS sources using Sqoop, enhancing data integration processes.

Orchestrated complex data processing workflows by defining job flows that automated the execution of multiple MapReduce and Pig jobs with Oozie.

Streamlined data analysis by importing log files into HDFS via Flume and organizing them into Hive tables for effective querying.

Ensured the Hadoop cluster's operational integrity by monitoring active MapReduce tasks, maintaining seamless performance.

Safeguarded data accuracy and integrity during the transfer of information from UNIX file systems to HDFS.

Leveraged the HBase-Hive integration to construct and query multiple Hive UDFs for detailed data analysis.

Engaged in API development for HBase to facilitate data cleansing and inter-table data transfers.

Enhanced data storage and retrieval by creating various Hive tables, applying techniques like Partitioning, Dynamic Partitioning, and Buckets.

Simplified batch processing through the development of Pig scripts and custom Pig UDFs, tailored to meet specific business needs.

Utilized the HBase Client API for effective database communication, ensuring streamlined operations with the HBase database.

Facilitated data uploads to HBase via multiple channels, including the HBase Shell, Client API, Pig, and Sqoop.

Demonstrated proficiency in the design, optimization, and maintenance of NoSQL databases, applying best practices in database management.

Crafted MapReduce programs in Python, making use of the Hadoop streaming API for enhanced data processing.

Created comprehensive unit tests for Hadoop MapReduce jobs with MRunit, confirming the reliability and efficiency of the code.

Specialized in ETL processes, including analysis, design, development, testing, and implementation, focusing on query optimization and performance tuning.

Utilized tools like Cloudera Manager and Web UI for Hadoop cluster monitoring, maintaining optimal system performance and health.

Collaborated closely with application teams to manage operating system installations, Hadoop updates, patches, and version upgrades.

Adopted Maven for build processes and SVN for code versioning, promoting efficient development workflows.

Developed RESTful web services to enhance application functionalities and user experiences.

Implemented rigorous testing scripts to support a culture of test-driven development and continuous integration, ensuring high standards of code quality and reliability.

Environments Used: Hadoop, MapReduce, HDFS, HBase, Hive, Impala, Pig, SQL, Ganglia, Sqoop, Flume, Oozie, UnixMaven, Eclipse.

Contact this candidate