Data Engineer Machine Learning

Location:

Hyderabad, Telangana, India

Posted:

September 10, 2025

Contact this candidate

Resume:

Pranay Ammula Data Engineer

Email: ******.**********@*****.***

Mobile: +1-443-***-****

SUMMARY

• Having around 5 years of extensive experience in the Software Development Life Cycle (SDLC) including Database Design, Data Modeling, Data Ingestion, Data Cleansing, Transformation, Data Warehousing, and applications for different domains.

•Expertise in cloud-based technologies such as AWS, GCP, and Azure.

•Experience in Hadoop Ecosystems HDFS (Storage), Spark, Map Reduce (Processing), Hive, Pig, Sqoop, YARN, and AWS.

•Hands on experience on Unified Data Analytics with Databricks, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL.

•Setting Up AWS, GCP and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters in Databricks, Managing the Machine Learning Lifecycle.

•Excellent in High-Level Design of ETL & SSIS Packages for integrating data using OLE DB connection from heterogeneous sources like (Oracle, Excel, CSV, Oracle, flat file, and Text Format Data) by using multiple transformations provided by SSIS.

•Solid understanding of Relational Database Systems (RDBMS), Normalization, OLTP/OLAP systems, E-R Modeling, Multi-Dimensional (Star), and Snowflake schema.

•Experience in architecting, designing, installation, configuration and management of Apache Hadoop Clusters, MapReduce, Hortonworks & Cloudera Hadoop Distribution.

•Very capable of managing large datasets that are structured, semi-structured, or unstructured, supporting the use of Big Data and machine learning applications.

•Experience with HBase, Cassandra, and MongoDB NoSQL databases and creation of Sqoop scripts for Teradata and Oracle to Bigdata Environment data transfer.

•Very proficient in Spark architecture including Spark core, Spark SQL, Data frames, Spark streaming with PySpark, and PANDA library.

•Experience in Microsoft BI tools (SSIS, SSRS, Azure Data Factory, Power BI) – using integration services (SSIS)and Data Factory for ETL (Extraction, Transformation, and Loading) reporting services (SSRS)and Power BI.

•I have a good command of version control systems using CVS, SVN, GIT/GitHub, Bit-bucket, and issue-tracking tools like Jira and Bugzilla.

•Expert in creating SSIS packages and Data Factory Pipelines for extracting data from various sources like DB2, Oracle, Excel, Flat File, CSV, MS Access, and other OLE DB data sources.

•Experience in Migration projects from on-premises Data warehouse to Cloud.

•Worked with various formats of files like delimited text files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats.

•Working knowledge of AWS Cloud Services, EC2, EBS, VPC, RDS, EMR, RedShift, DynamoDB, SES, ELB, Auto scaling, CloudFront, Cloud Formation, Elastic Cache, API Gateway, Route 53, Cloud Watch, SNS, and Elastic Cache.

•Hands-on experience building a serverless architecture utilizing Amazon Lambda, API gateway, Route 53, and S3 buckets to convert existing AWS infrastructure to serverless architecture (AWS Lambda, AWS Kinesis).

•Hands-on expertise with importing and exporting data from Relational databases to HDFS, Hive, and HBase using Sqoop.

•Proficiency with transporting and processing real-time event streaming using Kafka and Spark Streaming.

•Has experience with Kafka 0.10.1 producers and stream processors to process real-time data.

•In-depth knowledge of and practical expertise with Python-based Tensor Flow and Scikit Learn frameworks for machine learning and AI.

•Good experience in designing and deploying ML models as RESTful APIs using Flask and Docker on GCP AI Platform, integrating PySpark pipelines and Kafka streams for real-time data processing and inference.

•Implemented a variety of analytics methods using Cassandra with Spark and Scala.

•Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.

•Proficient with big data technologies like Hadoop 2, HAD, HDFS, MapReduce, and Spark as well as statistical programming languages like R and Python.

•Proficient in SQL query writing and optimization across Oracle, SQL Server, and Teradata platforms.

•Worked extensively with version control systems like Git and Bitbucket, build management with Maven.

•Experienced in Agile Methodologies, Scrum stories and sprints experience using tools like Jira, Confluence.

TECHNICAL SKILLS

Programming Languages

Python, SQL, Java, R, Scala, Shell Script, Perl, JavaScript, HTML, CSS

Databases

MS SQL Server, Oracle, Snowflake, Exadata, Cosmos DB, PostgreSQL, Cassandra, Redis, S3, MongoDB, MariaDB, AlloyDB, HBase, MySQL, SQLite, T-SQL

Big Data Ecosystem

Apache Kafka, Apache NiFi, Apache Spark, Spark Streaming, Hive, Impala, Hadoop, Yarn, Flume, Sqoop, Oozie, Talend, MapReduce, Zookeeper, Delta Lake, Iceberg, Apache Airflow

Hadoop Distributions

Cloudera, Databricks, Hortonworks, Presto

Data Warehousing

Redshift, BigQuery, Snowflake, Synapse Analytics, Teradata

Frameworks

Django, Spring, Flask, Tornado, Pyramid, Kubernetes, Airflow

Python Libraries

NumPy, SciPy, Pandas, Scikit-learn, Matplotlib, PySpark, Dask, BeautifulSoup, Pickle, HTTPLib2, PyPI, ReportLab, Seaborn, Statsmodels

Cloud Technologies

AWS (S3, Glue, Athena, Redshift, EMR, Lambda, Kinesis, DynamoDB, SNS, SQS, CloudWatch, Beanstalk, EC2, ECS), Azure (Data Factory, Data Lake, ADO, Synapse, NSG, ANF), GCP (BigQuery, Cloud Storage, Pub/Sub, Cloud Composer, Looker)

ETL Tools

SSIS, ODI, Talend, Informatica, Apache NiFi, DataStage, Fivetran, dbt, Alteryx

Operating Systems

Linux (Red Hat, Ubuntu, CentOS), UNIX, Windows, macOS

Servers & APIs

Apache, Tomcat, SOAP, REST API, FastAPI, RabbitMQ, WebSphere, JBoss

Development Process

Agile, Scrum, Waterfall, CI/CD

Version Control & DevOps

Git, GitHub, GitLab, Bitbucket, SVN, Flyway, Terraform, Docker, Jenkins, Kubernetes, Helm

Business Intelligence & Visualization

Power BI, Tableau, Looker, SSRS, SSAS, OBIEE, Kibana, Google Data Studio, Excel (Power Query, Power Pivot), MS Visio, JIRA, Confluence

Statistical & Data Analysis

A/B Testing, Regression Analysis, Hypothesis Testing, Time Series Analysis, Predictive Modeling, Data Wrangling, Data Cleaning

RDBMS & NoSQL Databases

Oracle (9i/10g/11g/12c), DB2, SQL Server, MySQL, HBase, CouchDB, RBAC, SCIM

Data Mining & Machine Learning

Scikit-learn, TensorFlow, Keras, NLP, Clustering, Classification, Sentiment Analysis

RELATED EXPERIENCE

Client: Scotiabank, New York, NY Feb 2025 to Present

Role: Data Engineer

Project Overview: Developed a cloud-based platform prioritizing 1,000+ high-value transactions daily using GCP, Kafka, and Spark for real-time ETL, integrating Python Cloud Functions and SQL Server for data storage, automating anomaly detection.

Responsibilities:

•Built ETL pipelines using Airflow on GCP, writing Spark applications in Scala and Python, and integrating Kafka for real-time data ingestion, processing, and storage in HDFS and BigQuery.

•Utilized BigQuery, Dataproc, Cloud Functions, HDFS, Hive, HBase, and Spark to manage large datasets, writing Spark SQL and custom aggregate functions for transformation and analysis.

•Migrated an on-premises Hadoop MapR system to GCP using SSIS packages, Cloud Shell SDK, and PySpark, optimizing ETL processes and integrating diverse data formats like CSV and JSON.

•Developed Spark apps in Databricks to analyze customer usage patterns, monitor cluster performance, and secure data using Databricks encryption and Spark SQL for data transformation.

•Configured Dataproc clusters and GCP storage services, analyzed SQL scripts, and planned PySpark implementations for scalable data integration, automating daily ad hoc reports from BigQuery.

•Extracted and transformed data using Hive, Cassandra, and HBase with PySpark, loading CSV, JSON, and flat files into BigQuery, Teradata, and Oracle for efficient querying and analytics.

•Wrote Pig scripts for ETL jobs, migrated Teradata objects to Snowflake, and used Sqoop to export processed data into Teradata for BI reports and business visualizations.

•Automated nightly builds using Jenkins and Maven, deployed microservices to Docker registries, and processed message streams using Kafka to populate Hive and HDFS with large datasets.

•Created Java data formatting scripts for Hadoop MapReduce, converted HQL-based code to PySpark, and developed Kafka models to populate external Hive tables for large-scale data analytics.

•Developed RESTful APIs using Flask to serve ML models, enabling real-time predictions; deployed models via Docker on GCP AI Platform with secure access and monitoring.

•Integrated ML pipeline with PySpark on Dataproc, automating model training and deployment; consumed streaming data using Scala Kafka consumers to support real-time inference APIs.

•Orchestrated Airflow DAGs to automate multi-source ETL workflows, incorporating Cloud Storage, Dataproc, and Pub/Sub to manage real-time streaming and batch processing for high-value transactions.

•Enhanced data governance by configuring IAM roles, VPC Service Controls, and Data Loss Prevention (DLP) on GCP.

•Implemented schema evolution for JSON, Avro, and Parquet formats in BigQuery and automated anomaly forecasting workflows using Spark MLlib models, leveraging Kafka streams and BigQuery ML to predict transaction anomalies.

•Built hybrid cloud data synchronization pipelines between on-prem SQL Server and GCP, leveraging Cloud Dataflow and Cloud Functions.

•Developed a Kafka stream deduplication mechanism using PySpark and HBase, ensuring accurate real-time ingestion and consistent BigQuery and HDFS storage for high-frequency transaction data streams.

•Imported Teradata files into HDFS and loaded data into Hive and Impala, managing data integration from Oracle and using Scala Kafka consumers for real-time data processing.

•Facilitated cross-team code reviews for Spark, PySpark, and SQL scripts and maintained Jira boards and project documentation, tracking progress on data engineering tasks.

•Designed CI/CD pipelines with Jenkins, GitLab, and Docker to minimize human interaction in the automatic deployment of Spark applications.

•Managed source code repositories using Bitbucket and Git, promoting version control and adopted frameworks for unit testing and agile processes to guarantee iterative development and delivery of high-quality data solutions.

•Collaborated with the Scrum Master and product owners to break down data engineering tasks into user stories, refine backlog items, and ensure sprint deliverables for real-time ETL pipelines in agile environments.

Environment: Python, Scala, Apache Spark, PySpark, Apache Kafka, Hive, HBase, Hadoop, HDFS, Dataproc, BigQuery, Databricks, GCP, Cloud Functions, Snowflake, Teradata, Oracle, Cassandra, Impala, Apache Pig, SSIS, Cloud Shell SDK, Jenkins, Maven, Docker, Kafka Consumer API, Sqoop, Pig Scripts, Spark SQL, Hive UDF, MapReduce, Data Lake, Git, Tableau, Agile, JIRA, Scrum.

Client: BCBS, Chicago, IL Mar 2024 to Jan 2025

Role: Data Engineer

Project Overview: Migrated on-prem financial data to Azure and Snowflake using ADF pipelines and Databricks ETL, ensuring compliance, data quality, and automating resource deployments with ARM templates to accelerate setup by 40%.

Responsibilities:

•Designed ARM templates in Azure and custom build PowerShell scripts to automate resource creation and deployment which saved 140 hours of effort for creating every new environment.

•Designed and Implemented Big Data Analytics architecture, transferring data from Oracle.

•Created pipelines in Azure Data Factory using linked services/datasets to extract, transform, and load data between different storage systems like Azure SQL, Blob storage, Azure DW, and Azure Data Lake

•Managed user provisioning requests and migrated on-prem SQL databases to Azure SQL using SSIS.

•Created Azure HD Insights cluster using PowerShell scripts to automate the process.

•Drafted scripts to transfer data from FTP server to the ingestion layer using Azure CLI commands.

•Developed reusable SSIS packages for multi-format data and Databricks ETL workflows, staging 500GB of business data.

•Landed diverse datasets in ADLS as Parquet files and implemented Agile standards to improve collaboration by 15%.

•Used Performed Regression testing for Golden Test Cases from State (end-to-end test cases) and automated the process using Python scripts.

•Experience in Developing Spark applications using Spark/PySpark, SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats.

•Created DDL's for tables and executed them to create tables in the warehouse for ETL data loads.

•Involved in creating fact and dimension tables in the OLAP database and created cubes using MS SQL Server Analysis Services (SSAS) and used Azure Data Lake as Source and pulled data using Azure Poly base.

•Developed business intelligence solutions using SQL server data tools 2015 & 2017 versions and load data to SQL & Azure Cloud databases.

•Involved in Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.

•Used Azure Data Lake storage gen2 to store excel files, parquet files and retrieve user data using Blob API.

•Adding new users and groups of users as per the requests from the client.

•Performed Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model.

•Managed the data lakes data movements involving Hadoop, NOSQL databases like HBase, Cassandra.

•As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to Snowflake.

•Cleaned and transformed financial data using SQL in Snowflake worksheets, improving data accuracy by 25%.

•Built Snowflake analytical warehouses and performed data quality analysis, ensuring consistency across reporting datasets.

•Loaded data from Azure Blob and Data Lake into Azure Synapse, automating retrieval via Spark.

•Written multiple MapReduce programs for data extraction, transformation, and aggregation from numerous file formats, including XML, JSON, CSV & other compressed file formats.

•Created Snowflake metric tables and views for Tableau, importing 1TB of legacy data to S3.

•Participated actively in scrum meetings, managed tasks every day, and made sure that iteration cycles were followed to guarantee project delivery on time.

•Worked in an Agile environment, collaborating with cross-functional teams to develop, test, and deploy data pipelines.

Environment: Azure, ARM Templates, PowerShell, SSIS, Azure Data Factory, Azure SQL, Azure Blob Storage, Azure Data Lake, Azure Synapse, SQL Server, Snowflake, Teradata, HBase, Cassandra, Python, Spark, Parquet, Blob API, Agile, Tableau, SQL, Regression Testing, Data Migration, Logistic Regression, Linear Regression, Hadoop, SQL Scripts, Data Warehousing, Cloud Databases, S3, Data Lake Management, Big Data Analytics, Agile, Scrum, Sprints.

Client: Softic Digitech, India April 2020 – July 2022

Role: Data and analytics Engineer

Project Overview: Developed ETL pipelines with Apache Spark, Python, and SQL to process healthcare data, integrated data from APIs and Kafka into AWS (Redshift, S3) using Airflow and Glue, ensuring HIPAA compliance and scalability.

Responsibilities:

•Developed ETL pipelines using Apache Spark, Python, and SQL for data ingestion, transformation, and loading into cloud-based data warehouses (AWS Redshift, Snowflake).

•Designed and optimized relational and NoSQL databases (PostgreSQL, MongoDB) with indexing, partitioning, and query performance tuning for large-scale analytics workloads.

•Created scalable ETL pipelines in Databricks with Apache Spark and connected them to big data processing capabilities offered by AWS, including Redshift and S3.

•Implemented data pipelines using Apache Airflow for workflow orchestration and scheduling, integrating data from APIs, Kafka, and batch processes into centralized storage with AWS Glue, Lambda, and Step Functions.

•Developed and deployed machine learning data pipelines using PySpark and TensorFlow, ensuring scalable feature engineering while automating data quality checks with DBT for integrity and schema validation.

•Managed cloud-based data infrastructure on AWS, utilizing Terraform for Infrastructure as Code (IaC) and deploying containerized data processing workloads using Kubernetes.

•Created Snowflake data models, tuned for quick query execution, and coupled with AWS S3 for effective data archiving and retrieval.

•Used libraries like Pandas and NumPy to manipulate and analyze big datasets while showcasing proficiency with Python for exploratory data research.

•Built interactive dashboards and reports using Power BI, while implementing CI/CD pipelines with Git, Docker, and Jenkins to automate testing and deployment of ETL jobs.

•Created and enhanced intricate PL/SQL searches in MySQL, enhancing database efficiency and guaranteeing data consistency for applications involving transactions.

•Created physical data models for enterprise data warehousing and business intelligence utilizing programs like Microsoft Visio and ER Studio.

•Optimized big data processing workflows by leveraging Hadoop and Spark for distributed computing.

Environment: Apache Spark, Python, SQL, Snowflake, PostgreSQL, MongoDB, Apache Airflow, Kafka, AWS Glue, AWS Lambda, TensorFlow, PySpark, DBT, Terraform, Kubernetes, Tableau, Power BI, Git, Docker, Jenkins, Hadoop

Contact this candidate