Machine Learning Data Engineer

Location:

Covington, GA

Posted:

February 25, 2025

Contact this candidate

Resume:

Senior Data Engineer

Anilkumar Sunkara

Phone: +* (404) – 913 -5126

Email: ******************@*****.***

PROFESSIONAL SUMMARY:

Overall 9+ years of experience in designing and developing data engineering solutions, Big Data Analytics and Development, and administering database projects that includes installing, upgrading, configuring databases, performing deployments, working on capacity planning, and tuning database to optimize the application performance.

Proficiency in Machine Learning with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization.

Expertise working with varied forms of data infrastructure inclusive of relational databases such as SQL, Hadoop, Spark, and column-oriented databases such as MySQL.

In depth knowledge of building Spark applications in Python in cluster/client mode and monitor and perform application health check using spark UI.

Good understanding, hands-on working with Spark SQL, RDDs, and data frames and datasets.

Excellent knowledge and hands-on mastery with Hadoop Ecosystem components such as Map-Reduce, Spark, Pig, Hive, HBase, Sqoop, Yarn, Kafka, Zookeeper, and HDFS.

Proficiency in AWS cloud services like EC2, S3, Glue, Athena, Dynamo DB, RedShift, and hands-on experience with Hadoop ecosystem tools.

Skilled with installing Hadoop cluster and configuring different components as well as on AWS EC2 using Cloudera’s distribution. Good understanding of Data Mining and Machine Learning techniques.

Proficient in developing Spark applications using Spark - SQL in Azure Data bricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

Capable working with Snowflake Multi cluster and virtual warehouses in Snowflake.

Proficiency in data warehousing inclusive of dimensional modeling concepts and in scripting languages like Python, Scala, and JavaScript.

Worked on implementing CRUD operations using NoSQL Rest APIs.

Expertise in GCP data services including Cloud Composer for orchestrating data tasks and profound ETL experience using GCP services like Dataflow.

Expert in GCP services such as Big Query, Google Cloud Storage (GCS) buckets, Cloud Functions, and Dataflow for seamless data analytics solutions.

Mastery in importing and exporting data using Sqoop from HDFS to Relational Database Systems.

Substantial practice working with big data infrastructure tools such as Python and Redshift also proficient in Scala, Spark, Spark Streaming, and C++.

Possess an ability to perform complex data analyses with large data volumes. Also, an expert in SQL and have a keen understanding and data warehouse concepts.

Strong knowledge in Linux, OS tools, and file-system level troubleshooting.

Work with data and analytics experts to strive for greater functionality in our data systems. Expertise in developing Shell scripts and Python Scripts for system management.

Very good understanding and working knowledge in cloud technology and big data, specifically Azure Database and Azure Data Warehouse.

ETL expertise using Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics, along with ingestion and processing within Azure Data bricks.

Expertise in designing and deploying SSIS packages for data extraction, transformation, and loading into Azure SQL Database and Data Lake Storage.

Exceptional skills in SQL server reporting services, analysis services, Tableau, and data visualization tools.

Designed and Developed Shell Scripts and Sqoop Scripts to migrate data in and out of HDFS

Designed and Developed Oozie workflows to execute MapReduce jobs, Hive scripts, shell scripts, and sending email notifications.

Proficient in scripting and debugging within Windows environments, familiarity with container orchestration, Kubernetes, Docker, and Azure Kubernetes Service (AKS).

Deployed different partitioning methods like Hash by field, Round Robin, Entire, Modulus, and Range for bulk data loading.

TECHNICAL SKILLS:

Operating Systems UNIX, Linux, Windows

MS Azure Services

Azure data bricks, Azure Data Factory (ADF), Machine Learning, Azure SQL, Azure Data lake, Azure Synapse Analytics

SDLC Agile, Scrum, waterfall

Application/Web Servers Oracle WebLogic Server 11g, Apache Tomcat, Oracle Application Server 10g BEA WebLogic 8.1/9.2, WebSphere, JBoss, Tomcat, IIS PROFESSIONAL EXPERIENCE:

Ally Bank Sandy, Utah Sep 2022 - Till Date

Role: Senior Data Engineer

Responsibilities:

Created Pipelines in Azure Data Factory using Linked Services, Datasets, Pipeline to Extract, Transform and load (ETL) data in Azure data bricks from different sources like Azure SQL, Data lake storage, Azure SQL Data warehouse, write-back tool and backwards.

Worked on Azure BLOB and data lake storage and loading data into Azure SQL Synapse analytics.

Developed ETL workflow which pushes webserver logs to an Azure Blob Storage.

Developed Azure functions to automate workflows, reducing manual intervention, and improving the efficiency of data processing tasks.

Successfully integrated Azure Data Factory (ADF) with various Azure services, including Azure Data bricks, Azure Synapse Analytics, and Azure SQL Data Warehouse, creating seamless end-to-end data solutions.

Deployed Azure Purview to manage and govern enterprise data estate, improving data discovery and sensitivity classification across diverse data sources.

Developed Spark applications using spark-SQL in databricks to move data into Cassandra tables from various sources like Relational Database or Hive.

Build feature data model and by using Pyspark which is input for data science team to build chatbot.

Developed ML models to predict data trends and anomalies, integrating with Azure Machine Learning for automated decision-making.

Developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.

Implemented effective data integration patterns, including star schema and snowflake schema, optimizing data flow and storage structures within Azure Data Factory (ADF) pipelines.

Implemented comprehensive monitoring and logging using Azure monitor, enabling proactive identification of issues and ensuring system reliability.

Responsible for estimating the cluster size, monitoring and troubleshooting of spark data bricks cluster.

Implemented Azure Log Analytics for real-time performance monitoring and predictive maintenance of data pipelines, significantly reducing downtime.

Leveraged machine learning algorithms to enhance data quality and integrity, resulting in a 20% improvement in data accuracy.

Load the data into Spark RDD and performed in-memory data computation to generate the output response.

Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

Utilized log data to optimize query performance and system reliability, applying advanced analytics to streamline operations.

Implemented Spark Storm builder topologies to perform cleansing operations before moving data into Cassandra.

Developed Spark scripts by using Scala shell commands as per the requirement.

Worked on Spark streaming collects the data from Kafka in near real time and performs necessary transformations and aggregations on the fly to build the common learner data model and persists the data in Cassandra.

Employed testing frameworks such as PyTest and JUnit for code validation.

Worked extensively on design, development and deployment of talend jobs to extract data, filter the data and load them into Data Lake.

Amazon Web Services EC2, VPN, Elastic Load Balancer, Auto Scaling Services, Glacier, Elastic Beanstalk, Cloud Front, Relational Database, Dynamo DB. Virtual Private Cloud, Route 53, Cloud Watch, Identity and Access Management(IAM), EMR, SNS, SQS, Cloud Formation, Lambda

Scripting Languages Bash, Perl, Ruby, power Shell, Python, HTML, Groovy Build Tools Docker, Maven

Cloud Platforms AWS, Azure, GCP, Rackspace, Open stack, Kubernetes Big Data Technologies HDFS, Map Reduce, Pig, Hive, Sqoop, Oozie, Scala, Spark, Kafka, Nifi, Airflow, Flume, Snowflake

Databases Oracle, Mongo DB, MySQL, RDS, DB2, HBase, Cassandra Version Controls Subversion, Git, GitHub, TFS, Bit Bucket BI Tools Tableau, SSRS, Power BI,

Issue Tracking Tools Jira, Remedy, service now

Secured sensitive data across cloud services by implementing Azure Key Vault, enabling controlled access to tokens, passwords, and encryption keys, enhancing overall security posture.

Enhanced data governance frameworks by integrating Azure Purview, which facilitated data lineage and metadata management at scale, ensuring regulatory compliance

Exploring with Spark improving the performance and optimization of the Existing algorithms in Hadoop using Spark context, Spark-SQl, data frame pair RDD’s, Spark YARN.

Created multiple interactive filters, parameters, and calculations for dashboards in Power BI. Built Power BI Reports and dashboards.

Worked on migration of data from on-prem SQL server to Cloud database (Azure Synapse Analytics (DW)

&Azure SQL DB).

Worked with highly unstructured and semi structured data of 70 TB in size 210 TB.X.

Maintained open communication channels to update the team on progress and address challenges promptly. Environment: Azure, Azure Data Lake Storage, Azure Databricks, Azure Synapse Analytics, Python, Azure Data Factory, Hadoop, Spark, Spark Streaming, Spark SQL, HDFS,Pig, Hive, Apache Kafka, Sqoop, Power BI. Liberty Mutual Insurance Boston, MA Aug 2020 - Jun 2022 Role: Senior Data Engineer

Responsibilities:

Developed CI/CD pipelines for Azure Big Data solutions, focusing on seamless code deployment to production.

Utilized PowerShell scripting, Bash, YAML, JSON, Git, Rest API, and Azure Resource Management (ARM) templates for robust CI/CD pipeline construction, following best practices.

Implemented security monitoring using Azure Security Center and Log Analytics.

Implemented SAP-specific test cases, validating data consistency and ensuring synchronization between the application and SAP ECC.

Explored machine learning integration for predictive analytics in log monitoring.

Developed Terraform modules to define and provision infrastructure components for AKS and HDInsight.

Designed custom dashboards in Azure Monitor for real-time visualization of critical metrics.

Developed REST APIs using Python with Flask and Django, integrating diverse data sources.

Implemented SQL queries and scripts for data validation and verification, ensuring the accuracy and reliability of data stored in HANA. Implemented Azure Monitor Logs for centralized log management and analysis.

Optimized PySpark jobs for enhanced data processing speed and near real-time insights.

Demonstrated proficiency in Azure services such as Azure DevOps, contributing to streamlined development and deployment processes.

Designed and deployed SSIS packages for Azure service data extraction, transformation, and loading.

Automated the management of cryptographic keys and secrets used in cloud applications and services with Azure Key Vault to ensure compliance with data protection standards.

Implemented event-driven architecture using Azure Event Grid for real-time event processing.

Established monitoring mechanisms to track the performance of ETL processes and Cognos reports.

Leveraged Azure Data bricks and HDInsight for large-scale data processing and analytics using PySpark.

Utilized Docker for server virtualization in testing and development environments.

Developed robust data preprocessing pipelines to clean and transform raw data into a format compatible with Facets.

Orchestrated data pipelines utilizing Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.

Designed ADF Pipelines for data extraction, transformation, and loading.

Specialized in Data Migration using SQL, SQL Azure, Azure Storage, Azure Data Factory, SSIS, and PowerShell.

Effectively employed Arcadia to interface with Impala for interactive dashboards and reports. Expertly utilized PowerShell and UNIX scripts for file management and transfer tasks. Environment: Hadoop, ETL operations, Data Warehousing, Data Modelling, Cassandra, Advanced SQL methods, NiFi, Python, Linux, Apache Spark, Scala, Spark-SQL, HBase. Humana Louisville, Kentucky March 2018 - July 2020 Role: Data Engineer

Responsibilities:

Involved in data from various sources like Oracle Database, XML, CSV files and loaded to target warehouse.

Utilized the clinical data to generate features to describe the different illnesses by using LDA Topic Modeling.

Used PCA to reduce dimension and compute eigenvalue and eigenvector and used OpenCV to analysis the CT scan pictures to figure out the disease in CT scan.

Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.

Developed and supported the extraction, transformation, and load process (ETL) for a Data.

Performed data management projects and fulfilling ad-hoc requests according to user specifications by utilizing data management software programs and tools like TOAD, MS Access, Excel, XLS and SQL Server.

Worked with requirements management, workflow analysis, source data analysis, data mapping, Metadata management, data quality, testing strategy and maintenance of the model. Used DVO to validate the data moving from Source to Target.

Assisted in production OLAP cubes, wrote queries to produce reports using SQL Server Analysis Services

(SSAS) and Reporting service (SSRS) Editing, upgrading and maintaining ASP.NET website and IIS Server.

Used SQL Profiler for troubleshooting, monitoring, and optimization of SQL Server and non-production database code as well as T-SQL code from developers and QA.

Designed the ER diagrams, logical model (relationship, cardinality, attributes, and candidate keys) and physical database (capacity planning, object creation and aggregation strategies) for Oracle and Teradata as per business requirements using Erwin.

Designed Power View and Power Pivot reports and designed and developed the Reports using SSRS.

Designed and created MDX queries to retrieve data from cubes using SSIS.

Created SSIS Packages using SSIS Designer for exporting heterogeneous data from OLE DB Source, Excel Spreadsheets to SQL Server.

Environment: ERWIN9.1, Netezza, Oracle8.x, SQL, PL/SQL, SQL Plus, SQL Loader, Informatica, CSV, Taradata13, T-SQL, SQL Server, SharePoint, Pivot tables, Power view, DB2, SSIS, DVO, LINUX, MDM, PL/SQL, ETL, Excel, Pivot tables, SAS, SSAS, SPSS, SSRS.

EDS Group Karnataka, India Aug 2014 to Nov 2017

Role: Data Engineer

Responsibilities:

Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.

Developed stored procedure/views in DB2 and use in Talend for loading dimensions and Facts.

Expert in developing scalable & secure data pipelines for large datasets.

Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.

Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.

Implemented data streaming capability using Kafka and Talend for multiple data sources.

Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala).

S3 – Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform.

Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS.

Knowledge on implementing the JILs to automate the jobs in production cluster.

Troubleshooter user's analyses bugs (JIRA and IRIS Ticket).

Worked with SCRUM team in delivering agreed user stories on time for every Sprint.

Worked on analyzing and resolving the production job failures in several scenarios.

Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.

Knowledge on implementing the JILs to automate the jobs in production cluster. Environment: Hadoop, Hive, Impala, Kafka, Talend, Avro, Parquet, Kerberos, Sentry, SSL, JILs, Jira, UNIX.

Contact this candidate