data engineer

Location:

United States

Salary:

Posted:

August 28, 2024

Contact this candidate

Resume:

NIKHIL EATLPACKA Email: **********@*****.******

Phone: 469-***-**** Portfolio: LinkedIn

PROFESSIONAL SUMMARY:

Accomplished Sr. Data Engineer with 9+ years of hands-on experience in designing and implementing robust data processing solutions with Azure and AWS, certified as an Azure Data Engineer.

Skilled in creating ETL pipelines using tools such as Azure Data Factory, Databricks, Azure Synapse Analytics, Azure SQL Database, AWS Glue, AWS Lambda, and Amazon Redshift for both real-time & batch data processing.

Experienced in designing, developing, documenting, and integrating applications using Spark, Hadoop, Hive, PIG, and Apache Airflow across all three big data platforms like MapR, Snowflake, and DataBricks.

Proficient in various programming languages, including Python, Java, PowerShell, and T-SQL.

Certified Spark Developer, proficient in Spark Core, Pair RDD's and Spark YARN, Spark SQL, and Spark Streaming for processing complex data with Scala's in-memory computing capabilities.

Expert in Big Data solutions with Python, Hadoop, Spark, Hive, Azure, AWS, Hortonworks, Cloudera Hadoop distributions.

Strong database management skills with hands-on experience in Aurora, DynamoDB, PostgreSQL, MongoDB, MS SQL, and MySQL, optimizing data storage and retrieval processes.

Proficient in containerization and orchestration tools such as Docker, ECS, EKS, OpenShift, and Kubernetes.

Well versed in ETL processes using tools like Apache NiFi, Microsoft SSIS, Azure Data Factory, and Erwin Data Modelling for smooth extraction, transformation, and data loading.

Proficient in version control and CI/CD practices, utilizing Git & GitLab, BitBucket, GitHub, AWS Code Pipeline, Azure DevOps, and Jenkins for streamlined development workflows.

Strong security and access management background, utilizing AWS IAM, AWS KMS, Azure AD, and OAuth to implement robust security measures.

Proficient in leveraging data visualization and reporting tools, including Tableau, Looker, and Power BI, to facilitate clear and compelling communication of intricate data insights.

In-depth knowledge of real-time data streaming technologies, including Kinesis and Kafka.

Skilled in logging and monitoring tools like CloudWatch, ELK Stack, and Nagios.

Proficient in various data formats, including Avro, JSON, XML, and XSLT.

Expertise in data processing and analysis using tools such as Pandas, NumPy, TensorFlow, PyTorch, and Mahout for efficient data manipulation and extracting valuable insights.

Well-versed in project management & Software Development Lifecycle methodologies such as Jira, Agile, Scrum, Traditional Waterfall, and Kanban.

Exceptional problem-solving skills demonstrated through successfully resolving complex challenges in data engineering projects.

Strong communication and collaboration abilities, fostering positive relationships with cross-functional teams to achieve project goals effectively.

Proven adaptability and flexibility in dynamic work environments, ensuring successful integration of new technologies and methodologies.

Technical Skills:

Programming Languages: Java, Python, R, Spark, PowerShell, and T-SQL.

Cloud Platforms: AWS and Azure.

Data Warehousing: Azure Synapse Analytics, Azure SQL Data Warehouse and Amazon Redshift

ETL Tools: Azure Data Factory, Azure Databricks, Informatica, AWS Glue, AWS Data Pipeline.

Big Data Technologies: Hadoop, Hive, Impala, Sqoop, Pig, Oozie, HBase, Spark, PySpark, Kafka.

Containerization and Orchestration: Docker, ECS, EKS, OpenShift, Kubernetes.

Database Management: Aurora, DynamoDB, PostgreSQL, Oracle, MongoDB, MS SQL, MySQL.

Version Control and CI/CD: Git, BitBucket, GitHub, AWS Code Pipeline, Azure DevOps, Jenkins.

Data Processing and Analysis: Pandas, NumPy, TensorFlow, PyTorch, Mahout.

Data Visualization and Reporting: Power BI, Tableau, Looker.

Microsoft Tools: SSIS, SSAS, SSRS and Microsoft Excel.

Security and Access Management: AWS IAM, AWS KMS, Azure AD, OAuth.

Logging and Monitoring: CloudWatch, ELK Stack, Nagios.

Platforms: Windows, Linux.

CERTIFICATIONS:

MICROSOFT: Azure Data Engineering Certified

SNOWFLAKE: Snowflake SnowPro Core certified

DATABRICKS: Apache Spark Developer Certified

MICROSOFT: Power BI Data Analyst certified

SALESFORCE: Tableau Business Intelligence Analyst Professional Certificate

GOOGLE: Google Data Analytics Professional Certificate

PROFESSIONAL EXPERIENCE:

Client: TransAmerica Corp, Cedar Rapids, Iowa Jul 2023 – Till Date

Role: Sr Data Engineer

Project: Data Nexus (Platform for Modernizing and Integrating Data) – Enhanced outdated systems for accessing a diverse range of applications. Developed fresh possibilities for defining and enhancing data models, leveraging open-source tools and Big Data Stack. Deployed versatile data integration solutions for corporate data warehousing. Managed data pipelines, integration strategies, audits, data platforms, data governance, and PCI compliance to deliver resilient data products and dashboards that supported the success of business and corporate clients.

Created Hive/Spark external tables for each source table in the Data Lake and Written Hive SQL and Spark SQL to parse the logs and structure them in tabular format to facilitate effective querying on the log data.

Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analysing & transforming the data.

Designed and developed ETL & ETL frameworks using Azure Data Factory and Azure Data bricks Notebooks.

Created generic data bricks NOTEBOOKs for performing data cleansing.

Developed high-performance data ingestion pipelines from multiple sources using Azure Data Factory and Azure Databricks.

Developed Spark applications using Scala for easy Hadoop transitions. And Hands experience in writing Spark jobs and Spark streaming API using Scala and Python.

Created ETL process using SSIS to transfer data from heterogeneous data sources into data warehouse systems with various steps.

Created Azure Data factory pipelines to refactor on-prem SSIS packages into Data factory pipelines in Azure.

Created CI/CD pipelines using Azure DevOps.dat

Ingested and transformed source data using Azure Data flows and Azure HDInsight.

Built Data models and aggregated tables that are consumed by reports.

Created Azure Functions to ingest data at regular intervals.

Created Data bricks notebooks for performing complex transformations and integrated them as activities in ADF pipelines.

Extract, transform and create data marts for the BI solution using Azure Data Factory, Matillion, AWS Glue and informatica cloud as ETL tool for Redshift, Snowflake and Azure Synapse.

Used Databricks notebooks, to develop, test and analyze spark jobs before scheduling customized spark jobs.

Worked on estimating the cluster size, monitoring, and troubleshooting of the Azure Databricks cluster.

Written complex SQL queries for data analysis and extraction of data in required format.

Creating data visualizations using Power BI for ad hoc reporting.

Built Talend Cloud generic framework solutions using Java and SQL as needed.

Worked on internal and external aggregated pipelines to make them available to end users.

Migrated the data from SQL Server, Oracle to Delta Lake using ADF.

Enhanced the functionality of existing ADF pipeline by adding new logic to transform the data.

Worked on Spark jobs for data pre-processing, validation, normalization, and transmission.

Optimized code and configurations for performance tuning of Spark jobs.

Worked with highly unstructured and semi structured large data sets to aggregate and report on it.

Acted as a liaison between my team and data owners by responding and gathering requirements etc.,

Environment: Azure Synapse Analytics, Azure Hyperscale, Azure Data Factory, Azure Data bricks, Azure Synapse Studio, SQL Server, Informatica Power Center 9.6/9.5/9.1, Oracle 12c/11g, Toad/SQL Developer, SQL, Unix Shell, scripting/tuning, PL/SQL, JSON, Jira, Slack, Confluence.

Client: Truist Bank Sep 2021 to Jun 2023

Role: Sr Data Engineer

Project: Data Migration and System Integration – The primary objective of this project is to migrate the bank's cost profitability management system to a new Hadoop-based platform, known as the Enterprise Performance Management System. This new system will provide data for performance reporting (MIS reporting). The project encompasses various financial processes such as offshoring, tax provisions, and specific entries needed to compute financial statements like balance sheets and P&L statements. Additionally, it aims to generate performance reports at the downstream level.

Designed and implemented Partitioning (static, dynamic), Bucketing in hive.

Created hive tables on top of parquet files for execution of spark jobs as per business requirements.

Developed various data processing pipelines using Spark Dataset API, Data frame API, Spark SQL.

Implemented Spark jobs that handle data transformations using analytical/window functions.

Analyzed and investigated data variances/issues in month-end or ad-hoc report.

Worked on Airflow DAGs that monitors and alerts on scheduled jobs and data pipelines to ensure that they are operating as expected and take corrective actions when necessary to ensure that data is delivered on time and of high quality.

Understanding business knowledge by collaborating with teams & training people as & when the need arises.

Good working experience in Data Governance and practices.

Environment: Cloudera, Spark, PySpark, Hadoop, HDFS, Hive, Impala, YARN, Java, Python, MySQL, Drools, Maven, JUnit, Jenkins, BitBucket, Git, SonarQube, Jira, MariaDB, Airflow, Tableau, Alluxio, Unravel, Collibra, Parquet, AWS S3, Azure Databricks, Data factory, ADLS Gen2, PyCharm.

Client: MetLife, New York City, NY Nov 2020 to Aug 2021

Role: Data Engineer

Project: Implementation of Revenue management System –This project is about Revenue management system. Revenue system is a distribution model which distributes their revenue among stakeholders and other contributors. Total revenue system includes manufacturing costs and marketing expenses. It ensures continuous support and builds a long-term relationship between the bank and its stakeholders.

Involved in creating External tables, adding Dynamic partitions within Hive.

Handled Schema creation and maintenance.

Strong working experience in SQL and the ability to write, debug and optimize SQL queries and ETL jobs to reduce the execution window or reduce resource utilization.

Transformed stored data by writing Spark/PySpark jobs based on business requirements.

Involved in working with custom methodologies.

Application tuning for high performance and throughput.

Implemented unit test cases using JUnit & performed quality checks using SonarQube.

Written bash scripts to automate the execution of spark jobs.

Collaborate with stakeholders including the Product, Data Engineering, and Agile Delivery teams in an agile environment to assist and resolve data-related issues and support data delivery needs.

Experience performing root cause analysis on internal and external data and processes to answer business questions and identify opportunities for improvement.

Environment: Cloudera, Spark, Hadoop, Hive, Impala, Parquet, Java, MySQL, Maven, Junit, Jenkins, SonarQube, IBM Tivoli, Jira, Maria DB.

Client: Raymond James, West Haven, CT Dec 2019 to Oct 2020

Role: Data Engineer

Project: BigTapp Analytics – The project aims to build a scalable & efficient data processing pipeline that ingests streaming data, processes it in real-time, and loads it into a data warehouse for analytics and business intelligence. The data pipeline will use Kafka for data ingestion, AWS EMR for processing with Hadoop and Spark, and Snowflake for data storage and analytics.

Used Kafka for building real-time streaming data pipelines that reliably get data between systems or applications.

Used AWS EMR to process large data volumes with Apache Hadoop and Apache Spark

Used Snowflake data warehousing platform that allows for the storage and analysis of data.

Engaged with stakeholders to define both functional and non-functional requirements.

Designed and implemented a Star Schema that simplified complex sales data analysis.

Designed & implemented data models in Google BigQuery & orchestrated ETL workflows using google Dataflow.

Developed sentiment analysis models leveraging TensorFlow and BERT (Bidirectional Encoder Representations from Transformers) for enhanced natural language processing.

Processed & analysed over 80,000 customer reviews using Databricks Unified Analytics Platform for machine learning.

Implemented a comprehensive data quality framework using Apache Atlas for metadata management and governance, combined with Apache Kafka for real-time data integration.

Enabled advanced analytics through the data warehouse, providing insights into sales trends and product performance, which supported informed decision-making using Power BI.

Project Overview: Experienced in Developing a Sales Data Warehouse, focusing on designing a Star Schema data model and automating an ETL process using Python to streamline sales analytics and reporting using Power BI.

Environment: Spark, MySQL, SQL scripting, Python & Power BI automated scheduling with cron jobs.

Client: NEOJORA SOLUTION – Bangalore, India Jun 2013 to Jul 2018

Role: Data Engineer/Data Analyst

Project: Amogh E-Commerce Analytics Streaming Pipeline – The project aims to Streaming Pipeline is a comprehensive data infrastructure project designed to empower an e-commerce platform with robust analytics capabilities. By leveraging cutting-edge technologies and methodologies, the project aims to enhance data quality, processing efficiency, and scalability across various data-related operations.

Applied Pandas & NumPy for data manipulation and analysis, contributing to data quality & insights generation.

Utilized Terraform for infrastructure as code (IaC) to automate the provisioning and deployment of data-related resources.

Implemented containerization and orchestration using Openshift and Kubernetes for scalable and portable deployment of data applications.

Managed and optimized relational databases, including PostgreSQL and NoSQL databases such as MongoDB, for diverse data storage requirements.

Worked with Hortonworks and Snowflake to design and implement scalable and high-performance data storage solutions.

Implemented Azure Data Factory (ADF) to orchestrate and manage workflows efficiently for seamless data movement across various platforms.

Collaborated on Azure DevOps projects for version control, continuous integration, and continuous deployment (CI/CD).

Developed and maintained Python and PowerShell scripts for automation, optimizing data processing tasks and enhancing workflow efficiency.

Ensured secure and efficient management of credentials using Azure Key Vault and Azure Active Directory.

Applied agile & Kanban methodologies for project management, adaptability, efficient delivery of data solutions.

Integrated Kafka for real-time data streaming, enabling timely processing and analysis of streaming data.

Collaborated with cross-functional teams to establish and enforce data quality and governance standards and created interactive reports and dashboards using Power BI for data-driven decision-making.

Optimized data processing workflows using Databricks and HDInsight to utilize Hadoop and Spark technologies efficiently.

Utilized TensorFlow and PyTorch for machine learning model development and integration into data pipelines.

Managed project tracking and tasks using JIRA, facilitating effective collaboration & project progress monitoring.

Automated deployment and configuration management using Ansible, streamlining data engineering operations.

Worked with JSON for data interchange between systems, ensuring compatibility and seamless data transfer.

Utilized GIT for version control, ensuring organized and collaborative development practices.

Designed, optimized data storage & retrieval with Azure Data Lake Storage for efficient data organization and accessibility.

Implemented ELK (Elasticsearch, Logstash, Kibana) stack for centralized logging & monitoring of data pipelines.

Environment: Azure, Python, PowerShell, Pandas, NumPy, TensorFlow, PostgreSQL, MongoDB, Snowflake, Spark, Hadoop, Kafka, Azure AD, PyTorch, Agile, Kanban, Terraform, Azure DevOps, OpenShift, Kubernetes, JSON, JIRA, GIT, ELK, Power BI.

EDUCATION

Master of Science in Business Analytics.

Bachelor of Science in Computer Science, Statistics & Mathematics.

Contact this candidate