Data Engineer Senior

Location:

Hyattsville, MD

Posted:

February 13, 2025

Contact this candidate

Resume:

Senior Data Engineer

Bhuvana Pakkiru

Phone: +1-631-***-****

Email: pakkirubhuvana25@gmail com

LinkedIn: https://www.linkedin.com/in/bhuvana-pakkiru-650a2a218 PROFESSIONAL SUMMARY:

Over 10+ years of professional IT experience, with a strong focus on the Big Data ecosystem for the past six years. Proficient in utilizing Hadoop and its ecosystem components, including HDFS, Spark, Python, Scala, Zookeeper, Yarn, Sqoop, Hive, Flume, Kafka, Apache STORM, and Spark streaming. Additionally, possess 4 years of expertise in Cloud platform services on both AWS.

Solid working knowledge of Amazon Web Services (AWS) Cloud Platform, with hands-on experience in services such as EC2, S3, VPC, IAM, DynamoDB, Redshift, AWS Glue, Lambda, Event Bridge, Cloud Watch, Databricks, Auto Scaling, Security Groups, CloudWatch, CloudFormation, Kinesis, IAM, SQS, and SNS.

Extensive experience in Spark technology, including 3 years of working with Spark, proficiency in writing Spark applications using PySpark and Scala, and developing data pipelines using Kafka and Spark streaming to store data into HDFS.

Skilled in cloud technologies including AWS (EC2, S3, Redshift, CloudFormation, Step Functions, Lambda, Glue, EMR, CloudFront), Azure (ADF, Azure DevOps) and efficient data processing and storage solutions.

Spearheaded the design and implementation of comprehensive CI/CD pipelines and ensuring the seamless delivery of high-quality software through automated testing, code analysis, and deployment automation.

Developed and optimized SQL queries for various data manipulation tasks, including extraction, transformation, and loading (ETL) processes.

Conducted regular training sessions for development teams on DevOps and CI/CD practices, contributing to the establishment of a DevOps-centric culture within the organization.

Extensively worked on Hadoop components, including HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN, Spark, and MapReduce programming.

Experienced in Implementing cluster coordination services through Zookeeper.

Enabled cross-cloud analytics by connecting Vantage Cloud with AWS data stores, providing a unified platform for querying and analyzing data across different cloud environments.

Specialized in Spark streaming (Lambda Architecture), and Spark SQL, and adept at tuning and debugging Spark applications for optimal performance.

Skilled in building and optimizing data warehouses (Snowflake, Redshift) and data lakes (AWS S3, Azure Data Lake Storage), utilizing SQL databases (MySQL, SQL Server, PostgreSQL, Oracle, Teradata) and NoSQL databases

(MongoDB, Cassandra) for efficient data retrieval and storage.

Skilled in Python programming, focusing on libraries like Pandas, NumPy, Matplotlib, etc. Well-versed in Machine Learning algorithms and concepts, utilizing MLlib, sci-kit-learn, and Python for data preprocessing, regression, classification, and model selection.

Designed and developed data sourcing routines with a focus on data quality functions, including standardization, transformation, rationalization, linking, and matching.

Extensive systems integration experience, including designing and developing APIs, Adapters, and Connectors and integrating with Hadoop/HDFS, Real-Time Systems, Data Warehouses, and Analytics solutions.

Developed complex tools and solutions for data scientists and analysts implementing machine learning solutions, including orchestration, data pipelines, and infrastructure as code solutions for the Data Engineering team.

Quick learner of modern technologies with the ability to collaborate and communicate effectively with various teams throughout the project lifecycle, from design to production.

Strong experience in using MS Excel and MS Access to dump the data and analyze based on business needs.

Excellent working experience in Scrum/Agile framework and Waterfall project execution methodologies. Good experience in Data Analysis as a Proficient in gathering business requirements and handling requirements management.

TECHNICAL SKILLS:

Big Data Ecosystem: HDFS and Map Reduce, Pig, Hive, Impala, YARN, HUE, Oozie, Zookeeper, Apache Spark, Apache STORM, Apache Kafka, Sqoop, Flume.

Programming Languages: C, C++, Java, Python, SCALA Scripting Languages: Shell Scripting, Java Scripting Databases: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, SQL, PL/SQL, Teradata NoSQL Databases: HBase, Cassandra, and MongoDB

Hadoop Distributions: Cloudera, Hortonworks, EMR

Build Tools: Ant, Maven, Jenkins

Cloud: AWS, Azure, EMR, EC2, Data Catalog, Lambda, DynamoDB, Databricks, Redshift, Glue, Athena, vS3, SNS, CloudWatch, SQS.

Version Control Tools: SVN, Git, GitHub, JIRA

PROFESSIONAL EXPERIENCE:

Broadridge Financials - Newark, NJ April 2022 to Present Senior Data Engineer

Responsibilities:

Automated data processing tasks using PowerShell scripting, ensuring efficient and repeatable workflows.

Designed and orchestrated data workflows using Azure Data Factory (ADF), ensuring seamless data integration.

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.

Created aggregation and partition logic using PySpark in Databricks to optimize query performance.

Utilized Azure Databricks to orchestrate data processing tasks, integrating with Azure services like Azure Data Lake Storage, Azure SQL Database, and Azure Synapse Analytics.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.

Designed and developed data integration pipelines in Azure Data Factory to ingest 50 TB of data daily from on-prem SQL servers to Azure SQL Data Warehouse.

Developed and optimized complex T-SQL queries and stored procedures within Azure Synapse, enhancing query performance and enabling efficient data retrieval and manipulation.

Developed scalable data analytics solutions with Databricks, optimizing performance and collaboration.

Utilized Databricks for big data processing and analytics, including Hadoop, CDH, Apache Spark, MapReduce, and Sqoop integrations.

Integrated Python applications with database connectors (e.g., SQLAlchemy, psycopg2) to ensure efficient data manipulation and interaction with relational and non-relational databases.

Implemented NoSQL solutions using HBase and Pig for efficient data storage and querying.

Integrated data from various sources using Informatica, ensuring comprehensive data management.

Developed data pipelines and workflows using PySpark for data processing and transformation on distributed clusters.

Administered SQL Server databases, ensuring data integrity and high availability.

Worked with NoSQL databases, including Cassandra, to handle large-scale data storage.

Built and maintained Docker containers for Python applications, streamlining the deployment process and reducing environment-related issues in production.

Implemented real-time data streaming solutions with Apache Kafka, ensuring reliable data ingestion.

Deployed infrastructure as code with Terraform, streamlining cloud resource management.

Managed version control and collaborative development using GitHub, ensuring code integrity.

Implemented CI/CD pipelines with Azure DevOps, enabling continuous integration and deployment.

Developed and optimized data processing applications using Apache Spark, enhancing data analytics.

Utilized Python, Pandas, and NumPy for data analysis and manipulation, ensuring accurate data insights.

Managed incident and service requests with ServiceNow, ensuring timely resolution of issues.

Developed and optimized data processing algorithms using Hadoop MapReduce, ensuring efficient data processing.

Implemented machine learning models and algorithms using Spark MLlib and SparkSQL for scalable and distributed computing.

Built and deployed machine learning models with TensorFlow and Kubeflow, enabling advanced analytics.

Implemented data security policies with Apache Ranger, ensuring data governance and compliance.

Monitored and visualized system performance using Grafana, ensuring proactive system management.

Created data visualizations with Matplotlib, enhancing data reporting and insights.

Managed data storage solutions with Azure Data Lake Storage (ADLS Gen2), ensuring scalable and secure data storage.

Implemented secure storage solutions using Azure Key Vault to manage sensitive information such as API keys, passwords, certificates, and cryptographic keys.

Documented Python code using Sphinx and Markdown, ensuring that code was maintainable, well-organized, and easily understandable by other developers and stakeholders.

Containerized applications using Docker and orchestrated with Kubernetes, ensuring scalable deployments.

Developed data warehousing solutions with Snowflake, ensuring efficient data storage and retrieval.

Implemented the ELK Stack (Elasticsearch, Logstash, Kibana) for logging, monitoring, and analytics.

Configured and managed Azure Active Directory (AAD) for secure authentication and access control.

Created interactive dashboards and reports with Power BI, providing actionable business insights.

Used various Teradata load utilities for data load and Unix Shell scripting for file validation process.

Applied Agile methodologies (Scrum) for project management, ensuring timely delivery and team collaboration. MetLife – New York City, NY July 2020 to March 2022 Senior Data Engineer

Responsibilities:

Leveraged Azure Functions to develop serverless applications, integrating HTTP Triggers and Application Insights for enhanced monitoring and load testing via Azure DevOps Services.

Established CI/CD pipelines using Docker, Jenkins, TFS, GitHub, and Azure Container Services, achieving streamlined deployments and operational efficiency.

Automated Azure infrastructure provisioning using Terraform, optimizing resource management for virtual machine scale sets in production environments.

Utilized Ansible for comprehensive configuration management, including infrastructure setup and application deployments. Integrated monitoring solutions using Nagios and ELK stack for real-time operational insights.

Migrating Services from On-premises to Azure Cloud Environments. Collaborate with development and QA teams to maintain high-quality deployment.

Experience in Developing Spark applications using Spark - SQL, PySpark and Delta Lake in Databricks for data extraction.

Developed data pipelines using Python and Apache Airflow, automating data workflows and improving the accuracy and speed of reporting.

Utilized Databricks for collaborative and scalable data analytics and machine learning.

Designed and implemented large-scale ETL pipelines using Databricks on Azure, processing terabytes of data for advanced analytics and reporting.

Deployed the initial Azure components like Azure Virtual Networks, Azure Application Gateway, Azure Storage and Affinity groups.

Developed data processing pipelines using Dataflow, ensuring real-time and batch data processing capabilities.

Automated workflows with Apache Airflow and Luigi, enhancing data pipeline reliability and efficiency.

Implemented stream processing applications using Apache Flink, enabling real-time data analytics.

Implemented and optimized classification, regression, and clustering models in Python, using feature engineering and hyperparameter tuning techniques to increase model performance.

Designed and managed data flows with Apache NiFi, ensuring efficient data ingestion and transformation.

Developed scalable data processing applications using Apache Spark and Scala, optimizing performance.

Administered Cloudera Distribution for Hadoop (CDH), ensuring seamless data management and processing.

Leveraged Databricks for collaborative data analytics and machine learning, enhancing team productivity.

Developed and consumed REST APIs for data integration and exchange, enabling seamless data communication.

Worked with Hadoop ecosystems to process large datasets, ensuring efficient data processing.

Applied Agile methodologies (Kanban) for project management, ensuring timely delivery and team collaboration.

Documented processes and collaborated with teams using Confluence, ensuring transparent project tracking.

Implemented OAuth for secure authentication and authorization, ensuring data security. Molina Healthcare - Bothell, WA Jan 2018 to June 2020 Data Engineer

Responsibilities:

Architected and implemented an AWS-based data infrastructure seamlessly integrated with Snowflake, a cloud- based data warehouse platform.

Utilized various AWS services, including S3, EC2, Glue, Lambda, in conjunction with Snowflake, establishing a comprehensive end-to-end data ecosystem.

Led the data model design and database migration deployment, ensuring the successful implementation of database objects and metadata across production platform environments (Dev, Qual, and Prod) on the AWS Cloud

(Snowflake).

Developed data ingestion pipelines using Apache NiFi to extract, transform, and load data from diverse sources into Snowflake.

Implemented efficient data movement and synchronization between Snowflake and other AWS services using AWS Data Pipeline and Snowpipe.

Automated data transformations and orchestrated data flows between Snowflake and AWS services using AWS Lambda functions.

Configured Snowflake external tables to access data stored in Amazon S3, facilitating seamless data integration and analysis.

Created external and permanent tables in Snowflake on AWS, managing S3 bucket policies for storage and backup.

Implemented security measures, including AWS Identity and Access Management (IAM) roles and Snowflake's native security features, ensuring data privacy and compliance.

Monitored and optimized Snowflake and AWS services' performance in the data architecture using AWS CloudWatch.

Collaborated with stakeholders to define data models, schemas, and analytics requirements in Snowflake, aligning them with business objectives.

Designed and implemented data warehousing and data lake strategies using Snowflake and AWS services for scalable and cost-effective storage and analysis of large datasets.

Implemented data governance and data cataloging solutions using AWS Glue and Snowflake's metadata capabilities to enhance data discoverability and lineage.

Created Snowflake Resource monitoring and redesigned views in Snowflake to improve performance.

Utilized AWS Quick Sight, a business intelligence tool, to generate interactive visualizations and reports on data stored in Snowflake, facilitating data-driven decision-making.

Implemented disaster recovery and high availability measures for Snowflake, leveraging AWS services such as AWS Backup

Collaborated with Snowflake support and AWS support teams to address technical issues and optimize performance within the AWS-Snowflake ecosystem.

Stayed abreast of the latest AWS and Snowflake technologies and best practices to ensure the data architecture remained current and aligned with industry standards. Info Logitech Systems - Hyderabad, India May 2014 to Sept 2017 Data Engineer

Responsibilities:

Worked on designing and implementing large-scale, highly scalable, and fault-tolerant Big Data systems using a variety of technologies such as Apache Spark (PySpark), Hadoop, Kafka, Sqoop, HDFS, Hue, Hive, HBase, and other NoSQL databases.

Worked with AWS services, including S3, EC2, EMR, Kinesis, Glue Data Catalog, Redshift, DynamoDB, Lambda Functions, Cloud Formation, and CloudWatch.

Designed and developed ETL processes in AWS Glue to migrate Campaign data from external sources in S3

(Parquet/Text Files) into AWS Redshift.

Performed data extraction, aggregations, and consolidation of Manufacturer/Retailer data within AWS Glue Data Catalog using Zeppelin Notebook and PySpark.

Effectively utilized Amazon S3 as a robust and cost-effective storage solution for ETL processes, ensuring secure and scalable data storage.

Implemented and designed external tables with partitions using various tools, including Hive for on-premises solutions and AWS Athena and Redshift for cloud-based architectures.

Developed and deployed distributed computing Big Data applications using Spark, Hive, Sqoop, and Kafka on AWS Cloud.

Demonstrated proficiency in AWS Glue for serverless ETL, encompassing tasks such as data cataloging, data transformation, and orchestrating ETL workflows in a scalable manner.

Implemented an on-demand EMR launcher with custom spark-submit steps using S3 Event, SNS, KMS, Databricks, and Lambda functions.

Utilized CloudWatch logs to move application logs to S3 and create alarms based on exceptions raised by applications.

Demonstrated expertise in using AWS Glue for ETL processes, where Python scripts played a crucial role in orchestrating data transformations and data cataloging.

Utilized Python scripts for event-driven processing within AWS Lambda functions, responding to triggers and executing tasks seamlessly.

Conducted data scraping projects to gather and process external data, contributing to the enrichment of datasets used in AWS Data Engineering solutions.

Developed multiple Kafka Producers and Consumers tailored to precise requirements, ensuring seamless integration within the Kafka ecosystem.

Used Kafka for log accumulation, gathering physical log documents off servers, and placing them in a central location like HDFS for processing.

Developed and maintained DBT models to transform raw data into structured and analytically valuable formats.

Integrated Teradata Vantage Cloud with AWS services like S3, Redshift, and RDS, enabling seamless data transfer and integration across cloud environments.

Extracted real-time feed using Kafka and Spark Streaming, converting it to RDD, processing data in the form of Data Frame, and saving the data as Parquet format in HDFS/S3.

Employed Spark and Spark-SQL to read Parquet data and create tables in Hive using the Scala API.

Conducted regular training sessions for development teams on DevOps and CI/CD practices, contributing to the establishment of a DevOps-centric culture within the organization.

Implemented Spark with Scala, leveraging the power of Data Frames and Spark SQL API to achieve accelerated data processing. This approach facilitated efficient data manipulation, analysis, and retrieval, enhancing overall performance in large-scale data solutions and applications.

Designed and implemented ETL pipelines using Vantage Cloud, ensuring efficient data ingestion, transformation, and loading from AWS data lakes and warehouses.

Education:

Bachelor's of Technology in Electronics and Communication Engineering at Sathyabama Institute of Science and Technology 2014.

Contact this candidate