Resume

Sr Data Engineer

Location:

Kitchener, ON, Canada

Posted:

September 07, 2023

Contact this candidate

Resume:

Senior Data Engineer

Name: Akhil Galla Cell No: 226-***-**** Email: adzjgv@r.postjobfree.com

PROFESSIONAL SUMMARY:

* ***** ** ********** ** a data engineer with expertise in Big Data technologies such as Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala, HDFS, MapReduce, Hive, Pig, BDM, Sqoop, Flume, Oozie, and Zookeeper.

Experience working with Hadoop distributions such as Cloudera CDH, Apache, AWS, and Horton Works HDP.

Proficient in programming languages such as SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Shell, Python Scripting, and Regular Expressions. Well-versed in Spark components such as RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming.

Experience working with Cloud Infrastructure such as AWS, Azure, and GCP. Experience working with databases such as Oracle, Teradata, My SQL, SQL Server, and NoSQL databases such as HBase and MongoDB.

Experience working with version control tools such as CVS, SVN and Clear Case, and GIT. Experience working with build tools such as Maven Jenkins pipeline. Skilled in scripting and query languages such as Shell scripting and SQL.

Experience working with containerization tools such as Kubernetes, Docker, and Docker Swarm.

Proficient in reporting tools such as Junit, Eclipse, Visual Studio, Net Beans, Azure Databricks, UNIX Eclipse, Visual Studio, Net Beans, Junit, CI/CD, Linux, Google Shell, Unix, Power BI, SAS, and Tableau. Extensive knowledge in working with Amazon EC2 to provide a solution for computing, query processing, and storage across a wide range of applications.

Expertise in using AWS S3 to stage data and to support data transfer and data archival. Experience in using AWS Redshift for large scale data migrations using AWS DMS and implementing CDC (change data capture).

Created Pipelines in AD using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like python SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.

Strong Hadoop and platform support experience with all the entire suite of tools and services in major Hadoop

Distributions - Cloudera, Amazon EMR, Azure HDInsight, and Hortonworks.

Extensive working experience with big data ecosystem - Hadoop (HDFS, MapReduce, Yarn), Spark, Kafka, Hive, Impala, HBase, Sqoop, Pig, Airflow, Oozie, Zookeeper, Ambari, Flume, NiFi.

Working knowledge on Azure cloud components (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, Cosmos DB).

Experienced in working with various s IDE's using PyCharm, PyScripter, Spyder, PyStudio, PyDev, IDLE, NetBeans and Sublime Text.

Worked with Cloudera and Hortonworks distributions. Hands on experience on Spark with Scala, PySpark.

Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.

Experience in Cloud Computing (Azure and AWS) and Big Data analytics tools like Hadoop, HDFS, Map - Reduce, Hive, HBase, Spark, Spark Streaming, Azure Cloud, Amazon EC2, DynamoDB, Amazon S3, Kafka, Flume, Avro, Sqoop, PySpark.

Experience building Data pipeline for Realtime streaming data and Data Analytics using Azure cloud components like Azure Data Factory, HDInsight (spark cluster), Azure ML Studio, Azure stream Analytics, Azure Blob Storage, Microsoft SQL DB,

Experience working in SQL Server and My SQL database. good experience working with Parquet files and parsing, validating JSON format files. Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.

Sound knowledge in developing highly scalable and resilient Restful APIs, ETL solutions, and third-party platform integrations as part of Enterprise Site platform.

Experience in using various IDEs like PyCharm, IntelliJ, and repositories SVN and Git version control systems. TECHNICAL SKILLS:

Big Data Technologies: Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala, HDFS, MapReduce, Hive, Pig, BDM, Sqoop, Flume, Oozie, Zookeeper

GCP Native Services: Big Query, Data Flow, Composer, Cloud Data Proc, Cloud Pub/Sub, Cloud SQL-Postgres Hadoop Distribution: Cloudera CDH, Apache, AWS, Horton Works HDP Programming Languages: SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Shell, Python Scripting, Regular Expressions Spark components: RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming Cloud Infrastructure: AWS, Azure, GCP

Databases: Oracle, Teradata, My SQL, SQL Server, NoSQL Database (HBase, MongoDB) Scripting &Query Languages: Shell scripting, SQL

Senior Data Engineer

Version Control: CVS, SVN and Clear Case, GIT

Build Tools: Maven, SBT

Containerization Tools: Kubernetes, Docker, Docker Swarm Reporting Tools: Junit, Eclipse, Visual Studio, Net Beans, Azure Databricks, UNIX Eclipse, Visual Studio, Net Beans, Junit, CI/CD, Linux, Google Shell, Unix, Power BI, SAS, and Tableau PROFESSIONAL EXPERIENCE

Client: Capital One - Toronto Feb 2022 – Present

Role: Senior Data Engineer

Responsibilities:

Participated in Agile delivery, SAFe, and DevOps frameworks, and use DevOps tools for planning, building, testing, releasing, and monitoring.

Develop, implement, and maintain CI/CD pipelines for Big Data solutions using Azure DevOps. Work closely with development teams to ensure that code is deployed successfully to the production environment.

Designed, developed and deployed high volume REST APIs using Python, GraphQL, SQL, Junit, Spring Boot, OpenAPI, Spark, Flink, Kafka.

Worked with cloud data platforms like Azure, Snowflake, Yellowbrick, Singlestore, GBQ, Hosted and managed platforms like Hadoop, Spark, Flink, Kafka, Spring Boot, Tableau, Alteryx, Callibra, Soda, Amazon DeeQu.

Followed the Twelve-Factor App Methodology while designing and developing the APIs. Implemented API layer requirements like security, throttling, OAuth 2.0, TLS, certificates, Azure KeyVault, caching, logging, request, and response modifications using API management platform.

Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.

Optimized workflows by building DAGs in Apache Airflow to schedule the ETL jobs and implemented additional components in Apache Airflow like Pool, Executors, and multi-node functionality.

Created custom policies in XML, Python Script, Node JS in API management platform. Implemented test-driven development and API testing automation.

Installed and configured an automated tool Puppet that included the installation and configuration of the Puppet master, agent nodes and an admin control, Involved in Chef and Puppet for Deployment on Multiple platforms.

Virtualized the servers using Docker for the test environments and dev-environments needs, also configuration automation using Docker containers.

Experience in creating Docker Containers leveraging existing Linux Containers and AMs in addition to creating Docker Containers from scratch on both Linux and Windows servers. Container clustering with Docker Swan Mesos/Kubernetes.

Creating Amazon EC2 instances using command line calls and troubleshoot the most common problems with instances and monitor the health of Amazon EC2 instances and other AWS services.

Developed rest API's using python with flask and Django framework and done the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.

Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.

Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and MLlib.

Involved in supporting cloud instances running Linux and Windows on AWS, experience with Elastic IP, Security Groups and Virtual Private Cloud in AWS.

Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer.

Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary queries or python scripts based on source.

Involved in supporting cloud instances running Linux and Windows on AWS, experience with Elastic IP, Security Groups and Virtual Private Cloud in AWS.

Generated SQL and PL/SQL scripts to install, create and drop database objects, including tables, views, primary keys, indexes, sequences.

Created Amazon EC2 instances using command line calls and troubleshoot the most common problems with instances and monitor the health of Amazon EC2 instances and other AWS services.

Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and MLlib.

Senior Data Engineer

Involved in supporting cloud instances running Linux and Windows on AWS, experience with Elastic IP, Security Groups and Virtual Private Cloud in AWS.

Extensive experience on configuring Amazon EC2, Amazon S3, Amazon Elastic Load Balancing AM and Security Groups in Public and Private Subnets in VPC and other services in the AWS Managed network security using Load balancer, Auto-scaling. Security groups and NACL.

Utilized AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS and create nightly AMIs for mission critical production servers as backups.

Worked with Amazon AWS/EC2, and Google's Docker based cluster management environment Kubernetes.

Creating Jenkins jobs and distributing load on Jenkins server by configuring Jenkins nodes which will enable parallel builds.

Extensively worked on Jenkins CI/CD pipeline jobs for end-to-end automation to build, test and deliver artifacts and troubleshoot the build issue during the Jenkins build process.

Managing Jenkins artifacts in Nexus repository and versioning the artifacts with time stamp, Deploying artifacts into servers in AWS cloud with ansible and Jenkins.

Created continuous integration system using Ant, Jenkins, Puppet full automation, Continuous Integration, faster and flawless deployments.

Installed and administered GIT source code tool and ensured the reliability of the application as well as designed the branching strategies for GIT.

Environment: Python, HDFS, PySpark, Yarn, Pandas, NumPy, Spark, EMR, Spectrum, Glue, Netezza, Active Batch, Hadoop 2.7.7, HDFS 2.7.7, Apache Hive 2.3, Apache Kafka 0.8.2.X, Apache Spark 2.3, Spark-SQL, Spark-Streaming, Zookeeper, Pig, Oozie, Java 8, Python3, S3, EMR, EC2, Redshift, Cassandra, NiFi, Talend, HBase, Cloudera (CHD 5 .X), snowflake, Power BI, Tableau. Client: Thomson Reuters - Toronto Feb 2021 – Jan 2022 Role: Data Engineer

Responsibilities:

Participated in Agile delivery, SAFe, and DevOps frameworks, and use DevOps tools for planning, building, testing, releasing, and monitoring.

Develop, implement, and maintain CI/CD pipelines for Big Data solutions using DevOps. Work closely with development teams to ensure that code is deployed successfully to the production environment.

Configured Spark streaming to receive real-time data from Apache Flume and store the stream data using Scala to Azure Table and Data Lake is used to store and do all types of processing and analytics. Created data frames using Spark Data frames.

Designed cloud architecture and implementation plans for hosting complex app workloads on MS Azure.

Performed operations on the transformation layer using Apache Drill, Spark RDD, Data frame APIs, and Spark SQL and applied various aggregations provided by Spark framework.

Provided real-time insights and reports by mining data using Spark Scala functions. Optimized existing Scala code and improved the cluster performance. Processed huge datasets by leveraging Spark Context, Spark SQL, and Spark Streaming.

Enhanced reliability of Spark cluster by continuous monitoring using Log Analytics and Ambari WEB UI.

Improved the query performance by transitioning log storage from Cassandra to Azure SQL Datawarehouse.

Implemented custom-built input adapters using Spark, Hive, and Sqoop to ingest data for analytics from various sources (Snowflake, MS SQL, MongoDB) into HDFS. Imported data from web servers and Teradata using Sqoop, Flume, and Spark Streaming API.

Improved efficiency of large datasets processing using Scala for concurrency support and parallel processing.

Developed map-reduce jobs using Scala for compiling program code into bytecode for the JVM for data processing. Ensured faster data processing by developing Spark jobs using Python in a test environment and used Spark SQL for querying.

Improved processing time and efficiency by using Spark applications like batch interval time, level of parallelism, memory tuning. Monitored workflows for daily incremental loads from RDBMSs (MongoDB, MS SQL, MySQL).

Created reusable YAML pipelines for data processing and data storage using Azure Data Factory, Data Lake, and Databricks. Used Git flow branching strategy to manage code changes and code deployments.

Worked with PowerShell scripting, Bash, YAML, Json, GIT, Rest API, and Azure Resource Management (ARM) templates to build and manage CI/CD pipelines.

Implemented standards and best practices for the CI/CD framework, including version control, code review, and testing.

Used security scanning/monitoring tools to ensure that code is free from vulnerabilities and integrated those tools with the CI/CD pipeline.

Collaborated with development teams to troubleshoot issues and debug code on Windows environments. Provided guidance to junior engineers on best practices for CI/CD pipelines and cloud-native architectures.

Developed solutions in Databricks for Data Extraction, transformation, and aggregation from multiple data sources. Designed and implemented highly performant data ingestion pipelines from multiple sources using Azure Data Factory and Azure Databricks.

Built SCD Type 2 Dimensions and Facts using Delta Lake and Databricks.

Developed custom-built ETL solution, batch processing, and real-time data ingestion pipeline to move data in and out of the Hadoop cluster using PySpark and Shell Scripting.

Senior Data Engineer

Created Azure Databricks (Spark) notebook to extract the data from Data Lake storage accounts and Blob storage accounts to load on premises SQL server database.

Performed statistical analysis using SQL, Python, R Programming and Excel. Worked extensively with Excel VBA Macros, Microsoft Access Forms

Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions. Created Python functions to transform the data from Azure storage to Azure SQL on Azure Databricks platform.

Involved in Automation scheduling Azure Databricks jobs and build SSIS packages to push data from Azure SQL to on-premises server.

Built ETL solutions using Databricks by executing code in Notebooks against data in Data Lake and Delta Lake and loading data into Azure DW following the bronze, silver, and gold layer architecture.

Used Azure Data Factory to orchestrate Databricks data preparation and load them into SQL Data warehouse. Responsible for ingesting data from Data Lake to Data Warehouse using Azure services such as Azure Data Factory, Azure Databricks

Integrated on-premises data (MySQL, HBase) with cloud (Blob Storage, Azure SQL DB) and applied transformations to load back to Azure Synapse using Azure Data Factory.

Built and published Docker container images using Azure Container Registry and deployed them into Azure Kubernetes Service (AKS).

Imported metadata into Hive and migrated existing tables and applications to work on Hive and Azure, created complex data transformations and manipulations using ADF and Python.

Configured Azure Data Factory (ADF) to ingest data from different sources like relational and non-relational databases to meet business functional requirements. Improved performance of Airflow by exploring and implementing the most suitable configurations.

Optimized workflows by building DAGs in Apache Airflow to schedule the ETL jobs and implemented additional components in Apache Airflow like Pool, Executors, and multi-node functionality.

Implemented indexing to data ingestion using Flume sink to write directly to indexers deployed on a cluster.

Delivered data for analytics and Business intelligence needs by managing workloads using Azure Synapse.

Improved security by using Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory, and Apache Ranger for authentication. Managed resources and scheduling across the cluster using Azure Kubernetes Service. Environment: Hadoop, Spark, Hive, Sqoop, HBase, Flume, Ambari, Scala, MS SQL, MySQL, Snowflake, MongoDB, Git, Data Storage Explorer, Python, Azure (Data Storage Explorer, ADF, AKS, Blob Storage) Client: Cisco – Hyderabad Jan 2018 – Jan 2021

Data Engineer

Responsibilities:

Participated in the analysis, design, and development phase of the Software Development Lifecycle (SDLC). Experience in the agile environment, used to have sprint planning meeting, scrum calls and retro meetings for every sprint, Used JIRA for the project management and GitHub for the version control.

Creating Amazon EC2 instances using command line calls and troubleshoot the most common problems with instances and monitor the health of Amazon EC2 instances and other AWS services.

Developed rest API's using python with flask and Django framework and done the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.

Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and MLlib.

Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.

Involved in implementing nine node CDH4 Hadoop cluster on red hat LINUX. Imported data from RDBMS to HDFS and Hive using Sqoop on regular basis.

Created Hive tables and worked on them using Hive QL, which will automatically invoke and run MapReduce, jobs in the backend. Responsible for developing PIG Latin scripts. Managing, scheduling batch Jobs on a Hadoop Cluster using Oozie and reviewing Hadoop Log files.

Involved in supporting cloud instances running Linux and Windows on AWS, experience with Elastic IP, Security Groups and Virtual Private Cloud in AWS.

Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customers.

Involved in supporting cloud instances running Linux and Windows on AWS, experience with Elastic IP, Security Groups and Virtual Private Cloud in AWS.

Generated SQL and PL/SQL scripts to install, create and drop database objects, including tables, views, primary keys, indexes, sequences. Senior Data Engineer

Created Amazon EC2 instances using command line calls and troubleshoot the most common problems with instances and monitor the health of Amazon EC2 instances and other AWS services.

Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and MLlib.

Involved in supporting cloud instances running Linux and Windows on AWS, experience with Elastic IP, Security Groups and Virtual Private Cloud in AWS.

Utilized AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS and create nightly AMIs for mission critical production servers as backups.

Worked on OpenShift Pass product architecture and worked on creating OpenShift namespaces for on-prem applications migrating to cloud.

Virtualized the servers using Docker for the test environments and dev-environments needs, also configuration automation using Docker containers.

Experience in creating Docker Containers leveraging existing Linux Containers and AM's in addition to creating Docker Containers from scratch on both Linux and Windows servers. Container clustering with Docker Swan Mesos/Kubernetes.

Worked with Amazon AWS/EC2, and Google's Docker based cluster management environment Kubernetes.

Creating Jenkins jobs and distributing load on Jenkins server by configuring Jenkins nodes which will enable parallel builds.

Extensively worked on Jenkins CI/CD pipeline jobs for end-to-end automation to build, test and deliver artifacts and troubleshoot the build issue during the Jenkins build process.

Managing Jenkins artifacts in Nexus repository and versioning the artifacts with time stamp, Deploying artifacts into servers in AWS cloud with ansible and Jenkins.

Created continuous integration system using Ant, Jenkins, Puppet full automation, Continuous Integration, faster and flawless deployments.

Installed and administered GIT source code tool and ensured the reliability of the application as well as designed the branching strategies for GIT.

Installed/Configured and Managed Nexus Repository Manager and all the Repositories, Created the Release process of the artifacts.

Experience on working with on-premises network, application, server monitoring tools like Naglos, Splunk, AppDynamics and on AWS with CloudWatch monitoring tool.

Involved in setting up JIRA as defect tracking system and configured various workflows, customizations, and plugins for the JIRA bug/issue tracker.

Environment: AWS, Ansible, ANT, MAVEN, Jenkins, Bamboo, Splunk, Confluence, Bitbucket, GIT, JIRA, Python, SSH, Shell Scripting, Docker, JSON, JAVA/J2EE, Kubernetes, Nagios, Red Hat Enterprise Linux, Terraform, Kibana.

Contact this candidate