Data Engineer

Location:

Posted:

February 22, 2025

Resume:

ABOUT ME:

Overall, *+ years of experience in Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification, and Testing as per Cycle in both Waterfall and Agile methodologies. Expertise includes optimizing Data Flows, ensuring high performance in Cloud Environments, and implementing complex Data Transformations.

SQL Python Spark Data Modeling Orchestration Tools AWS GCP Azure

Highly Skilled and results-driven Data Engineering with 8+years of expertise in Data Pipeline Design, Development, and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.

Designed and implemented Groovy-based test automation frameworks.

Conducted extensive REST API testing in AWS, ensuring SLA adherence.

Developed Spark Scala jobs in Hadoop, optimizing data processing efficiency, and engaged in CI/CD practices, collaborating with cross-functional teams for quality assurance throughout the Software Development lifecycle.

Created, managed, and optimized automated test scripts using Groovy for enhanced performance and accuracy.

Implemented automated test frameworks and conducted comprehensive testing, including unit, integration, and regression.

Strong experience in writing scripts using Python API, PySpsark API, and Spark API for analyzing data.

Hands-on use of Spark and Scala APIs to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.

Expertise in Python and Scala, user-defined functions (UDF) for Hive using Python.

Experience in developing Map Reduce Programs using Apache Hadoop for analyzing big data as per the requirement.

Experience in developing web applications by using Python, Django, C++, XML, CSS, HTML, JavaScript, and Query.

Hands-on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, and dimensionality reduction.

Proficient in handling complex processes using SAS/ Base, SAS/ SQL, SAS/ STAT SAS/Graph, and SAS/ ODS.

Experience in working with Flume and NiFi for loading log files into Hadoop.

Utilized Agile and Scrum methodology for team and project management.

Extensive experience in ETL tools like Teradata Utilities, Informatica, and Oracle.

Experience in working with NoSQL databases like HBase and Cassandra.

Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)

Experience in designing star schemas, Snowflake schema for Data Warehouse, and ODS architecture.

Education Details:

I completed my bachelor’s from Koneru Lakshmaiah University from Vijayawada, Andhra Pradesh, India in the year 2015.

Languages/Utilities

C, C++, Python, PL/SQL, XML.

Big Data Technologies

Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, HDFS, MapReduce, Hive, BDM, Sqoop, Flume, Oozie, Zookeeper.

Hadoop Distribution

Cloudera CDH, Apache, AWS, Horton Works HDP.

Programming Languages

SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Shell, Python Scripting, Regular Expressions.

Spark components

RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming.

Databases

MY SQL, Oracle, Teradata, SQL server, PostgreSQL, NoSQL Database(HBase, MongoDB)

IDEs and Tools

Eclipse, NetBeans, Text pad, Maven, UML, Log4j, ANT.

Version Control Tools

Subversion, GitHub, CVS.

Web/Application Servers

Tomcat, IBM WebSphere, JBoss, Apache.

Methodologies

Waterfall, Agile.

Operating Systems

Windows 7/8/XP, Linux, UNIX.

Cloud Technologies

AWS Cloud Components S3, EC2, Lambda, RDS, Azure, GCP.

Containerization Tools

Kubernetes, Docker, Docker Swarm

Reporting Tools

Junit, Eclipse, Visual Studio, Net Beans, Azure Databricks, UNIX Eclipse, Visual Studio, Net Beans, Junit, CI/CD, Google Shell, Power BI, SAS and Tableau

State of Texas, Austin, TX Apr 2024 to Present

Sr. Data Engineer

Responsibilities:

Developed custom multi-threaded Java-based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and Data Warehouses.

Developed Scala-based Spark applications for performing Data Cleansing, Event Enrichment, Data Aggregation, De-Normalization, and Data Preparation needed for machine learning and reporting teams to consume.

Worked on troubleshooting spark applications to make them more error tolerant.

Working on Docker Hub, Docker Swarm, and Docker Container network, creating Image files primarily for middleware installation & domain configurations. Evaluated Kubernetes for Docker Container Orchestration.

Developed rest Apl’s using Python with Flask and Django framework and the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.

Analyzed SQL scripts and designed the solutions to implement using PySpark.

Wrote Kafka producers to stream the data from external rest API to Kafka topics.

Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.

Experienced in handling large datasets using Spark in Memory Capabilities, using broadcast variables in Spark, effective & Efficient Joins, transformations, and other Capabilities.

Architect & implement medium to large-scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics).

Worked extensively with Sqoop for importing data from Oracle.

Batch scripts have been created to retrieve data from AWS S3 storage and to make appropriate transformations in Scala using the Spark Framework.

Involved in creating Hive Tables, Loading, and analyzing data using Hive scripts. Implemented Partitioning, Dynamic Partitions, and Buckets in HIVE.

Developed a Python script to Transfer Data, REST APs and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot.

Good experience with continuous Integration of applications using Bamboo.

Designed and documented operational problems by following standards and procedures using JIRA.

Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.

Developed Oozie work processes for planning and arranging the ETL cycle. Associated with composing Python scripts to computerize the way towards extricating weblogs utilizing Airflow DAGs.

Using Python the ETL pipeline was developed and programmed to collect data from the Redshift data warehouse.

Used MongoDB to store data in JSON format and developed and tested many features of a dashboard using Python, Bootstrap, CSS, and JavaScript.

Worked on SSIS creating all the interfaces between front-end application and SQL Server database, then from legacy database to SQL Server Database and vice versa.

Good hands-on participation in the development and modification of SQL-stored procedure techniques, Functions, Views, Indexes, and Triggers.

Migrate data into RV Data Pipeline using Data Bricks, Spark SQL, and Scala.

Used Databricks for encrypting data using server-side encryption.

Delta Lake is an open-source data storage layer that delivers reliability to data lakes.

Experience with Snowflake Virtual Warehouses.

Responsible for ingesting large volumes of IOT data to Kafka.

Wrote Kafka producers to stream the data from external rest APIs to Kafka topics.

Conducted REST API testing in AWS using tools like Postman and REST Assured, ensuring high reliability and performance.

Verified data integrity and consistency across diverse Data Services, conducting ETL testing to ensure quality and accuracy of transformations.

Designed and developed data pipelines using AWS Glue, S3, and Redshift for data extraction, transformation and loading (ETL) of large-scale data sets.

Implemented AWS Lambda functions for serverless compute to automate data processing tasks, reducing infrastructure costs and improving processing efficiency.

Optimized ETL workflows in AWS by leveraging Glue and EMR to handle batch and real-time streaming data from various cloud services and on-prem databases.

Managed data security using AWS IAM roles and Policies, ensuring that only authorized users can access sensitive data in S3 and Redshift.

Configured AWS Kinesis for real-time Data Streaming, creating data ingestion pipelines to process and store streaming data in HBase and Redshift.

Automated infrastructure provisioning and deployment of AWS-based solutions using AWS CloudFormation templates for scalable and reliable architectures.

Built and deployed custom Data Processing applications using Python, AWS SDK, and SQL to transform and validate data from multiple data sources.

Utilized AWS CloudWatch for monitoring data pipeline performance, troubleshooting issues, and ensuring system reliability through automated alerts.

Collaborated with cross-functional teams to create and deploy CI/CD pipelines with AWS CodePipeline, CodeDeploy, and CodeBuild to enable continuous delivery of data solutions.

Integrated AWS Athena with AWS S3 for interactive querying of large datasets to support business intelligence and data analytics teams.

Implemented data modeling practices, creating efficient schemas for Data Lakes and Data Warehouses, ensuring data availability and usability for analytics.

Enhanced system performance by optimizing SQL queries and Data Transformations, improving the speed of data extraction, transformation, and loading in Redshift.

Developed automated Data Validation processes to ensure the integrity of data flows and the quality of Data Migrations between AWS services.

Supported business intelligence by integrating Redshift and Athena with tools like Tableau and Power BI for reporting and dashboard creation.

Environment: AWS, Azure, Agile, Jenkins, EMR, Spark, Hive, S3, Athena, Sqoop, Kafka, HBase, Redshift, ETL, Pig, Oozie, Spark Streaming, Docker, Kubernetes, Hue, Scala, Python, Apache NIFI, GIT, Micro Services, Snowflakes, S3, AWS Lambda, AWS Glue, AWS Redshift, AWS Athena, AWS EMR, AWS CloudWatch, AWS EC2, AWS RDS, DynamoDB, Kinesis, Step Functions, SQS, SNS, Data Pipeline, Apache Spark, PySpark, Scala, SQL, Python, Apache Kafka, Amazon MSK, IAM, CloudFormation, Amazon RDS, Amazon Aurora, CloudTrail, CodePipeline, CI/CD, Redshift Spectrum, AWS Elastic Beanstalk, QuickSight.

Humana, Brooklyn, NY Aug 2021 to Apr 2024

Sr. Data Engineer/Big Data Engineer

Responsibilities:

Involvement in working with Azure Cloud Stage (HDInsight, Databricks, Data Lake, Blob, Data Factory, Synapse, SQL DB, and SQL DWH).

Used Azure Data Factory, SQL API, and MongoDB API and integrated data from MongoDB, MS SQL, and cloud Blob, Azure.

Using Linked Services/Datasets/Pipeline/to Extract, Transform, and Load Data from various sources such as Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backward ADF pipelines were created.

Performed information purging and applied changes utilizing Databricks and Spark information analysis.

Extensively utilized Databricks notebooks for interactive analysis utilizing Spark APIs.

Delta Lak supports Merge, Update, and Delete operations to enable complex use cases.

Developed Spark Scala scripts for mining information and performed changes on huge datasets to handle ongoing insights and reports.

Implemented versatile microservices to deal with simultaneousness and high traffic. Advanced existing Scala code and improved the Cluster Execution.

Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.

Reduced access time by refactoring Information Models, Query Streamlining, and Actualizing the Redis store to help Snowflake.

Extensive information in Data Changes, Mapping, Cleansing, Monitoring, Debugging, execution tuning and investigating.

Hadoop clusters. Involvement in working with Azure cloud stage (HDInsight, Databricks, Data Lake, Blob, Data Factory, Synapse, SQL DB, and SQL DWH).

Developed spark applications in Python (PySpark) on a distributed environment to load a huge number of CSV files with different schema into Hive ORC tables.

Developed Python application for Google Analytics aggregation and reporting and used Django configuration to manage URLs and application parameters.

Create several types of data visualizations using Python and Tableau.

Worked on reading and writing multiple data formats like JSON, ORC, and Parquet on HDFS using PySpark.

Guide the development team working on PySpark as an ETL platform.

Included myself in making database components like Tables, Views, and Triggers utilizing T-SQL to give structure and keep up information effectively.

Conducted statistical analysis on healthcare data using Python and various tools.

Broad involvement in working with SQL, with profound knowledge of T-SQL (MS SQL Server).

Worked with data science group to do preprocessing and include feature engineering, helped Machine Learning algorithm in production.

Developed and tested Spark Scala jobs in the Hadoop ecosystem, optimizing applications for enhanced performance and resource efficiency.

Designed and implemented scalable data pipelines using Azure Data Factory, Azure Databricks, and Azure Synapse Analytics, optimizing data movement and transformation processes.

Developed and optimized ETL processes for structured and unstructured data using Azure Data Lake and Azure SQL Database, enabling seamless integration with business systems.

Built real-time streaming solutions with Azure Stream Analytics, Event Hubs, and Apache Kafka, ensuring high throughput and low-latency data processing.

Managed data storage solutions in Azure Blob Storage and Azure Data Lake Storage Gen2, ensuring efficient and secure storage for large datasets.

Utilized Azure Databricks for big data analytics and machine learning workflows, leveraging Apache Spark for high-performance data processing.

Created and deployed customized data transformation jobs using PySpark, Scala, and Spark SQL in Azure Databricks, improving data processing speed and efficiency.

Developed data ingestion workflows from on-premise and cloud data sources into Azure Data Lake and Azure SQL Database using Azure Data Factory pipelines.

Optimized and monitored data pipelines in Azure using Azure Monitor, Log Analytics, and Application Insights, ensuring seamless performance and troubleshooting capabilities.

Implemented data validation checks and automated error-handling mechanisms to ensure high data quality and integrity during processing and transfer.

Collaborated with data scientists and business intelligence teams to design end-to-end Data Workflows, meeting the analytical needs of the organization.

Used Azure Blob Storage and Azure SQL Database for data warehousing and optimized SQL queries to ensure efficient data retrieval and processing.

Developed real-time data analytics dashboards and reports using Power BI and Azure Synapse Analytics, delivering actionable insights to stakeholders.

Automated deployment of data engineering solutions with Azure DevOps, ensuring CI/CD pipelines and version control for consistent and reliable deployment.

Integrated Azure Key Vault for secure management of sensitive credentials, ensuring data security and compliance across cloud data environments.

Migrated on-premises data systems to Azure Cloud, utilizing Azure Data Migration Service, ensuring minimal downtime and maintaining data integrity throughout the process.

Maintained and documented data engineering solutions, enabling efficient knowledge sharing and onboarding for cross-functional teams.

Environment: Azure (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, AKS), Scala, Python, Hadoop 2. x, Spark v2:0.2, NLP, Redshift Airflow v1.8.2, Hive v2.0.1, Sqoop v1.4.6, HBase, Oozie, Talend, Cosmos DB, MS SQL, MongoDB, Ambari, PowerBI, Azure DevOps, Ranger, Git, Microservices, K-Means, KNN, Azure Data Factory, Databricks, Synapse Analytics, Blob Storage, Data Lake Storage Gen2, SQL Database, Event Hubs, Stream Analytics, Monitor, DevOps, Power BI, PySpark, Scala, Apache Kafka, Spark SQL, Apache Spark, Key Vault, Data Migration Service, Python, CI/CD.

Hewlett-Packard, Hyderabad, IND Jan 2020 to Aug 2021 Azure Data Engineer

Responsibilities:

Configured Spark streaming to get ongoing information from Kafka and stored the stream information to HDFS and HBase.

Worked on the development of data ingestion pipelines using ETL tool, Talend & Bash Scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.

Used Spark-Streaming APIs to perform necessary transformations and actions on the data from Kafka and Persists into Cassandra.

Designed and developed data integration programs in a Hadoop environment with NoSQL data store Base for data access and analysis.

Experience in working with Flume and NiFi for loading log files into Hadoop.

Used various Spark Transformations and Actions for cleansing the input data.

Used Python and Django to create graphics, XML processing, data exchange, and business logic implementation.

Used Spark Streaming APIs to perform transformations and actions on the fly for building a common learner data model that gets the data from Kafka in near real-time and persists it to HBase.

Processed the real-time streaming data using Kafka, and Flume integrating with Spark streaming API.

Consumed JSON messages using Kafka and processed the JSON file using Spark Streaming to capture Ul updates.

Experienced in working with Spark ecosystem using Spark SQL and Scala queries on different formats like Text Files, and CSV Files.

Developed Scala functional programs for streaming data and gathered JSON and XML data and passed them to Flume

Involved in creating Hive scripts for performing data analysis required by the business teams.

Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).

Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipeline system.

Wrote and executed various MYSQL database queries from Python using Python-MySQL connector and MySQL dB package.

Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Scala, Data Frame, Impala, OpenShift, Talend, and pair RDDs.

Build machine learning models to showcase big data capabilities using Pyspark and MLlib.

Used Kafka Streams to Configure Spark streaming to get information and then store it in HDFS.

Worked extensively with Sqoop for importing metadata from MySQL and assisted in exporting analyzed data to relational databases using Sqoop.

Involved in the migration from On-Premises to Azure Cloud.

Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).

Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipeline system.

Worked on utilizing AWS cloud services like S3, EMR, Redshift, Athena, and GlueMetastore.

Involved in Continuous Integration of applications using Jenkins.

Worked on design and development of Informatica mappings, workflows to load data into the Staging Area, Data Warehouse, and Data Marts SQL Server and Oracle.

Involved in the development of Informatica mappings and preparation of Design Documents (DD), and technical design documents.

Worked on Hive optimization techniques using Joins, and Queries and used various functions to improve the performance of long-running jobs.

Troubleshooted user's analyses bugs (JIRA and IRIS Ticket).

Implemented UNIX scripts to define the use case workflow to process the data files and automate the jobs.

Optimized Hive QL by using an execution engine like Spark.

Environment: Azure HDInsight, Apache Spark, Apache Kafka, EMR, Scala, Talend, PySpark, HBase, Hive, Sqoop, Flume, Informatica, Glue, Hadoop, Nifi, HDFS, Scala, Oozie, MySQL, Oracle 10g, UNIX, Sheila

Red Bus, India Jan 2017 to Dec 2019

Data & Reporting Analyst

Responsibilities:

Performed data transformations like Filtering, Sorting, and Aggregation using Pig.

Creating Sqoop to import data from SQL, Oracle, and Teradata to HDFS

Created Hive tables to push the data to MongoDB.

Wrote complex aggregate queries in Mongo for report generation.

Developed scripts to run scheduled batch cycles using Oozie and present data for reports.

Worked on a POC to build a movie recommendation engine based on Fandango ticket sales data using Scala and Spark Machine Learning library.

Performed data validation and transformation using Python and Hadoop Streaming.

Loading the data from the different Data sources like (Teradata and DB2) into HDFS using SQOOP and Loading into Hive tables, which are partitioned.

Developed bash scripts to bring the TLOG file from the ft server and then process it to load into hive tables.

Automated workflows using Shell Scripts and Control-M jobs to pull data from various databases into Hadoop Data Lake.

Extensively used DB2 Database to support SQL.

Involved in story-driven Agile development methodology and actively participated in daily scrum meetings.

Inserted Overwriting the HIVE fata with Base data daily to get fresh data every day and used Sqoop to load data from DB2 into the HBASE environment...

Involved in converting Hive/SQL queries into Spark transformations using SparkRDDs, and Scala and have good experience in using Spark-Shell and Spark Streaming.

Created Hive, Phoenix, HBase tables and Base integrated Hive tables as per the design using ORC file format and Snappy compression.

Developed Oozie Workflow for daily incremental loads, which gets data from Teradata and then imported into hive tables.

Sqoop jobs, PIG, and Hive scripts were created for data ingestion from Relational Databases to compare with Historical data.

Created HBase tables to load large Sets of structured, semi-structured, and unstructured data coming from UNIX, NoSQL, and a variety of portfolios.

Developed pig scripts to transform the data into a structured format, which is automated through Oozie coordinators.

Environment: Hadoop, HDFS, Spark, Hive, Pig, Sqoop, Oozie, DB2, Java, Python, Oracle, SQL, Splunk, Unix, Shell Scripting.

Chaitanya

Sr. Data Engineer

Email: **************@*****.***

Phone: 203-***-****

PROFESSIONAL SUMMARY

TECHNICAL SKILLS

PREFESSIONAL EXPERIENCE

Contact this candidate