Data Engineer Processing

Location:

Cincinnati, OH

Posted:

May 01, 2024

Contact this candidate

Resume:

Vinod yadav Kannaboina Data Engineer

ć ad5en0@r.postjobfree.com Ħ +1-913-***-**** ] Linkedin

SUMMARY

• Seasoned Data Engineer with over 5+ years of experience designing, implementing, and managing end-to-end ETL pipelines for big data processing using both on-premises Hadoop frameworks and Azure, AWS cloud settings. Profi- cient in optimizing Hadoop clusters for maximum performance, guaranteeing data security compliance, and improving data storage performance in Azure, AWS environments. Skilled at troubleshooting and resolving issues with Azure, AWS monitoring tools. Excels at working with cross-functional teams and is fluent in a variety of programming languages and big data processing frameworks.

EXPERIENCE

Cloversoft - Raleigh, NC January 2024 - Present

Data Engineer

• Experienced IT professional proficient in Data Engineering and analytical programming with Java, Python, SQL, R, Scala, HQL. Developed UNIX Shell scripts to automate repetitive database processes and for validating data files.

• Utilized AWS and Azure services such as EMR, S3, RDS, Lambda, Redshift, Glue, Athena, Databricks, Data Factory, Azure Functions and Airflow, along with Jupyter notebooks, to address diverse data engineering challenges.

• Developed serverless functions using AWS Lambda in Python and Node.js to handle various tasks including data processing, file manipulation, and API integrations.

• Proven expertise in Agile development methodologies, contributing actively to sprint planning, retrospective reviews, and daily stand-ups. Proficient in utilizing Agile tools like JIRA and Confluence to ensure efficient collaboration and prioritize tasks within cross-functional teams.

• Leveraged Python core and Django middleware to build a web application and utilized Pyspark and Python to create core engines for data validation and analysis.

• Implemented data transformation solutions using IBM DataStage, including designing, developing, and deploying ETL

(Extract, Transform, Load) processes to integrate and transform data from heterogeneous sources.

• Developed and maintained PostgreSQL, MySQL, SQL Server, and Oracle databases to store and manage large vol- umes of structured data efficiently. Designed database schemas, optimized queries, and ensured data integrity to meet business requirements.

• Experienced in developing stored procedures, functions, triggers, and packages using PL/SQL in SQL-based environ- ments to implement business logic and enhance database functionality.

• Implemented and optimized real-time data processing workflows on Snowflake, leveraging its unique architecture for seamless integration with streaming data sources.

• Proven proficiency in ETL development using SSIS, excelling in designing and optimizing data workflows for efficiency and data integrity.

• Implemented event-driven architectures by integrating AWS Lambda with various AWS services such as S3, DynamoDB, SNS, and SQS.

• Developed and maintained Kafka producer and consumer applications using Kafka client libraries (Java, Python, or Scala) to publish and consume messages from Kafka topics, implementing message serialization, error handling, and batching mechanisms for optimal performance.

• Implemented and optimized AWS infrastructure using services such as Amazon EC2, Amazon S3, Amazon RDS, Ama- zon VPC, and Amazon Route 53, aligning with best practices for security, reliability, and performance.

• Developed and maintained ETL pipelines in Snowflake using Snowpipe, SnowSQL, or third-party integration tools, ensuring efficient data ingestion, transformation, and loading processes from various source systems.

• Leveraged Glue Data Catalog for metadata management, enabling seamless integration with other AWS services such as Amazon Redshift, Athena, and S3.

• Developed interactive dashboards and reports using Power BI to visualize key performance metrics and trends, enabling stakeholders to make data-driven decisions.

• Designed and implemented end-to-end ETL pipelines on both on-premises Hadoop clusters and Azure cloud, utilizing Apache Hadoop tools (HDFS, MapReduce, Hive, Sqoop) for data processing and storage, alongside Azure Data Lake Storage and Azure Blob Storage for scalable and reliable cloud-based storage solutions.

• Developed and deployed data pipelines in Azure Data Factory to orchestrate and automate data workflows, ensuring seamless data integration across diverse sources.

• Developed Scala scripts, UDFs using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.

• Developed Python and PySpark applications for data analysis. Developed the PySpark code for AWS Glue jobs and for EMR.

• Proficient in utilizing Python libraries including Pandas, NumPy, SQLAlchemy, and PySpark for data manipulation, analysis, and integration.

Accenture - Hyderabad, India June 2022 - December 2022 Data Engineer

• Expertise in designing and maintaining data warehousing solutions, including leveraging Azure Synapse, Azure Data Factory, Azure SQL Database and Azure Blob Storage.

• Managed and maintained large-scale Hadoop clusters, overseeing all aspects of cluster operations, performance opti- mization, and resource utilization to ensure seamless data processing.

• Used Azure Data Factory extensively for ingesting data from different source systems like Relational and Non- Relational to meet business functional requirements. Implementing data pipelines using Azure Data Factory.

• Employed data masking, tokenization, and anonymization techniques to protect sensitive data elements within databases, ensuring compliance with regulatory requirements and minimizing exposure to unauthorized access.

• Implemented CI/CD pipelines integrating Apache Kafka, Jenkins, Splunk, Maven, Docker, Kubernetes, Terraform and Gradle for seamless deployment and monitoring of applications. Leveraged DataStage ETL for efficient data pro- cessing within the pipelines, ensuring reliable and scalable deployment workflows.

• Implemented and configured Hadoop ecosystem components including HDFS, YARN, MapReduce, Apache Hive, Apache Spark, and HBase, ensuring proper integration and interoperability within the cluster environment.

• Developed Spark code using Scala and Spark-SQL for faster testing and data processing. Written programs in Spark using Scala and Python for Data quality check.

• Implemented Unix shell scripts to copy/move data from the local file system to HDFS and Azure Blob storage. Also to perform Hadoop ETL functions like Sqoop, create external/internal Hive tables, and initiate HQL scripts.

• Proficient in implementing data security measures and compliance controls to meet regulatory requirements such as GDPR, CCPA, and PCI DSS. Experience with encryption technologies including TLS/SSL, AES, and RSA for securing data in transit and at rest. Strong understanding of data security policies, procedures, and documentation.

• Implemented the Big Data solution using Hadoop, Hive and Informatica to pull/load the data into the HDFS.

• Used Hive Queries to import data into Microsoft Azure cloud and analyzed the data using Hive Scripts.

• Performed ETL jobs to integrate the data to HDFS using Informatica. Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.

• Stored structured and semi-structured data in tabular formats within the Hadoop ecosystem using Hive tables and Hive SerDes (Serializer/Deserializer).

• Utilized Informatica PowerCenter to extract data from various sources, transform it according to business rules and requirements, and load it into target systems, ensuring data quality, consistency, and reliability throughout the process.

• Integrated Informatica PowerExchange with enterprise databases, mainframe systems, and external data sources to extract and ingest data in real-time/batch mode, enabling timely and accurate data synchronization and replication.

• Leveraged workflow management tools like Apache Airflow to design and manage complex data pipelines, orches- trating the execution of ETL tasks, and data loading processes in a scalable and fault-tolerant manner.

• Proficient in performing data extraction, ingestion, and processing of large datasets, as well as data modeling and schema design, leveraging appropriate tools such as Apache Spark, Flink, Apache Kafka, and Apache Airflow (DAGs) for efficient and scalable data engineering workflows.

• Implemented and configured Azure Monitor and Azure Log Analytics to monitor and analyze the performance, avail- ability, and security of cloud-based infrastructure and services within the Azure environment.

• Wrote complex SQL scripts and PL/SQL packages, to extract data from various source tables of Data warehouse.

• Developed and optimized HQL (Hive Query Language) and SQL queries to extract, transform, and analyze data stored in Azure environments, ensuring efficient data processing and retrieval for analytical and reporting purposes.

• Good understanding and hands-on experience in setting up and maintaining NoSQL Databases like MongoDB, Cas- sandra, Elasticsearch, DynamoDB, and HBase and SQL databases like MYSQL, PostgreSQL, SQL server, Oracle, DB2, Amazon RDS, Google Cloud SQL, and Snowflake.

• Worked with highly structured and semi-structured data sets of 45 TB in size (135 TB with replication factor of 3).

• Expertise in designing and maintaining data warehousing solutions leveraging AWS Redshift, AWS Glue, Amazon RDS

(Relational Database Service), and Amazon S3 for storage.

• Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

• Implemented best practices for Terraform code organization, versioning, and documentation to facilitate collaboration and maintainability.

• Designed, developed, and maintained DAGs (Directed Acyclic Graphs) to schedule and monitor data workflows, en- suring data integrity and timely execution using Airflow. AirFi.in - Hyderabad, India January 2021 - May 2022 Data Engineer

• Proficient in performing data extraction, ingestion, and processing of large datasets, as well as data modeling and schema design, leveraging appropriate tools such as Apache Spark, Apache Kafka, and Apache Airflow (DAGs) for efficient and scalable data engineering workflows.

• Managed and administered Hadoop clusters using Cloudera and Hortonworks distributions, ensuring optimal perfor- mance, reliability, and scalability of big data processing environments.

• Optimized cluster performance by fine-tuning Hadoop ecosystem components including HDFS, YARN, MapReduce, Hive, and Spark, adjusting parameters such as block size, memory allocation, and parallelism to meet requirements.

• Developed Scala scripts, UDFs using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.

• Utilized advanced performance Tuning techniques such as adjusting JVM settings, optimizing memory allocation, and tuning MapReduce parameters to enhance cluster performance and job execution efficiency.

• Conducted cluster scaling activities For Hadoop Cluster to accommodate growing data volumes and user demands, leveraging tools like Apache Ambari, Apache Zookeeper and Cloudera Manager for cluster management and Mon- itoring.

• Implemented robust security measures including Kerberos authentication, role-based access control (RBAC), and encryption mechanisms to safeguard sensitive data stored within the Hadoop cluster.

• Led Cloud Migration initiatives for bank and financial data, successfully transitioning legacy systems to cloud-based solutions while ensuring compliance and security. Utilized Agile Scrum for efficient collaboration and timely delivery.

• Experience in waterfall SDLC(Software development life cycle), Agile/Scrum Methodologies/processes.

• Managed and administered a diverse range of databases including MongoDB, Cassandra, DB2, Oracle, and SQL Server, ensuring optimal performance, security, and availability.

• Designed and implemented end-to-end ETL pipelines on both on-premises and AWS cloud, utilizing Big Data tools for data processing and storage, alongside Amazon S3, EFS for scalable and reliable cloud-based storage solutions.

• Expertise in designing and maintaining data warehousing solutions leveraging AWS Redshift, AWS Glue, Amazon RDS

(Relational Database Service), and Amazon S3, Aurora for storage.

• Implemented AWS Organizations for centralized management and governance of multiple AWS Accounts. Established Service Control Policies (SCPs) to enforce security and compliance requirements, while utilizing AWS Config for real- time monitoring and automated remediation of policy violations.

• Demonstrated advanced-level expertise in cloud security, utilizing WAF, security groups, and NACLs to safeguard cloud environments. Ensured compliance with industry standards and proactively mitigate security risks through continuous monitoring and threat detection.

• Integrated with AWS CloudWatch for monitoring and logging, ensuring reliable performance and operational visibility. Employed AWS SAM (Serverless Application Model) for streamlined deployment and management of serverless ap- plications, optimizing resource utilization and minimizing operational overhead.

• Deployed Production Applications in Amazon Web Services (AWS) ensuring seamless integration with existing in- frastructure and meeting performance and scalability requirements.

• Assumes full stack ownership, consistently delivering production-ready, testable code. Leads end-to-end product lifecycle, including design, development, testing, deployment, and maintenance. Conducts code reviews to enforce best practices and development standards in the AWS Cloud.

• Led the adoption of Terraform and CloudFormation for automating infrastructure provisioning, configuration, and man- agement. Designed and constructed intricate modules and stacks to deploy and manage complex AWS infrastructure, ensuring scalability, reliability, and repeatability.

• Implemented robust security measures, including IAM roles, fine-grained access control policies, and encryption mechanisms such as AWS KMS, to safeguard sensitive data stored within the AWS environment.

• Hands-on experience with Git, Gitlab, AWSserverless technologies(Lambda, APIGateway, StepFunctions,S3,SQS, SNS), containerized workloads(Kubernetes), and Jenkins Orchestration, contributing to streamlined development workflows and enhanced deployment processes in real-world projects.

• Collaborated with cross-functional teams including data engineers, developers, and business stakeholders to under- stand requirements, troubleshoot issues, and optimize AWS Cloud configurations for optimal performance and reliability.

• Designed, developed, and implemented ETL solutions using IBM DataStage to facilitate data integration between various enterprise systems, including CRM, ERP, and data warehouses.

• Developed and maintained DataStage jobs for data extraction, transformation, and loading ETL processes, ensuring data accuracy, consistency, and timeliness.

PRAVEEN Technologies - Hyderabad, India January 2018 - December 2020 Hadoop Developer/Data Engineer

• Extensive hands-on experience leveraging Python within AWS Environments to architect, develop, and optimize vari- ous cloud solutions. Proficient in utilizing Python for automation, infrastructure provisioning, serverless application de- velopment, data processing, and integrating AWS services. Demonstrated ability to design and implement scalable and efficient solutions using Python in real-world AWS projects.

• Drove AWS CloudFormation adoption to automate infrastructure, facilitating rapid cloud program acceleration. Built cloud-native services with AWS Lambda, API Gateway, and DynamoDB, Aurora, empowering teams. Provided expert support for AWS product utilization, enhancing internal customer experiences.

• Managed end-to-end data processing tasks proficiently using Python, utilizing key libraries such as pandas, NumPy, Matplotlib, Machine Learning, and conducting PCA Analysis.

• Set up Jenkins server and build jobs to provide continuous automated builds based on polling the Git source control system during the day and periodic scheduled builds overnight to support development needs using DevOps tools such as Jenkins, Git, Junit and Maven.

• Collaborated with cross-functional teams including data engineers, developers, and business stakeholders to under- stand requirements, troubleshoot issues, and optimize Hadoop cluster configurations for optimal performance and reli- ability. Worked on Hortonworks-HDP distribution of Hadoop.

• Working with an Agile development team to deliver an end-to-end continuous integration/continuous delivery (CI/CD) product in an open source environment using Jenkins.

• Led Python-based web application development using frameworks like Django, demonstrating expertise in backend development, database integration, and RESTful API design for efficient and responsive solutions.

• Developed and maintained Glue jobs, crawlers, and workflows to automate data ingestion and processing pipelines, reducing manual effort, and improving data reliability.

• Integrated Amazon Aurora with other AWS services such as AWS Lambda, AWS Glue, and Amazon S3 to build end- to-end data pipelines and analytical solutions.

• Managed and maintained a diverse range of Linux servers (Red Hat and Ubuntu) ensuring 99.9 Percent uptime for critical business applications.

• Designed, implemented, and maintained batch scheduling and job automation solutions using Autosys, ensuring timely and accurate execution of critical business processes. Developed and managed complex Autosys job definitions, and dependencies.

• Designed, implemented, and maintained data pipelines using Apache Hudi for real-time and batch data ingestion, trans- formation, and processing in a Hadoop ecosystem.

• Utilized AWS and Azure services such as Amazon EMR, ECS, Glue, Athena, Databricks, Data Factory, Azure Func- tions and Airflow, along with Jupyter notebooks, to address diverse data engineering challenges. CERTIFICATIONS

• Microsoft Certified: Azure Data Engineer Associate

• AWS Cloud Technical Essentials

• PYTHON-DATA STRUCTURES-MACHINE LEARNING

• Databases and SQL with Python

SKILLS

• Programming Languages: Python, Java, C++, SQL, HQL, R, Go, Pyspark, Scala.

• Hadoop: HDFS, MapReduce, Hive, Zookeeper, YARN, HBase, Sqoop, Hortonworks, Cloudera.

• Cloud: Microsoft Azure, Amazon Web Services.

• Web Technologies: HTML, CSS, JavaScript, Bootstrap, JQuery, JSON, Snowflake, Agile, Scrum.

• Databases: MySQL, DB2, PostgreSQL, SQL Server, Oracle NoSQL(HBase, MongoDB, Cassandra).

• Tools: Git, Bitbucket, JIRA, Confluence, PostMan, Kafka, Tableau, PowerBI, Informatica, Flink, Terraform, Docker, Con- fluence, Kubernetes, Airflow, Elastic Search, Jupyter Notebooks, Jenkins, Maven, Gradle, Gitlab, Selenium.

• Numpy, Pandas, Scikit Learn, Matplotlib, Machine Learning, Unix Shell Scripting, Databricks, Datastage, Linux, EDUCATION

UNIVERSITY OF CENTRAL MISSOURI Warrensburg, Missouri Master’s in Computer Science CGPA: 3.5

LOVELY PROFESSIONAL UNIVERSITY Hyderabad, India

Bachelor’s in Technology (Computer Science And Engineering) CGPA: 3.5

Contact this candidate