Sr.Data Engineer

Location:

Charlotte, NC

Posted:

March 25, 2025

Contact this candidate

Resume:

Ashok Macha

Email: ************@*****.***

Phone: 603-***-****

Sr. Data Engineer

PROFESSIONAL SUMMARY:

•Around 9 years of professional experience involving project development, implementation, deployment, and maintenance using Big Data technologies in designing and implementing complete end-to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase.

•Experience in working with different Hadoop distributions like CDH and Hortonworks

• Experience in working with SQL Server Integration Services (SSIS) packages.

• Expert in utilizing ETL tools including SSIS and Import/Export Wizard. Worked with different types of control

flow tasks and data flow tasks performing data management and data enrichment.

•Created SSIS packages for data movement and managing historical data from various heterogeneous data.

• Integrated SSIS with SQL Server Agent jobs to automate data workflows and ensure scheduled data refresh.

•Collaborated with stakeholders to gather requirements, design data models, and deliver customized ETL solutions that meet business objectives.

•Hands-on experience in deploying and troubleshooting SSIS packages in development, test, and production environments.

•Expertise in working with data warehousing concepts such as star schema, snowflake schema, and slowly changing dimensions (SCD).

•Extensive experience in working with AWS cloud platform (EC2, S3, EMR, Redshift, Lambda and Glue).

•Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage.

•Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.

•Having proficient experience in various Big Data technologies like Hadoop, Apache Nifi, Hive Query Language, HBase NoSQL database, Sqoop, Spark, Scala, OOZIE and Pig. Oracle Database and Unix shell Scripting technologies.

•Implemented Enterprise Data Lakes using Apache Nifi.

•3+ years of experience as Azure Cloud Data Engineer in Microsoft Azure Cloud technologies including Azure Data Factory (ADF), Azure Data Lake Storage (ADLS).

•Developed and designed Microservices components for the business by using Spring Boot.

•Experience developing Pig Latin and HiveQL scripts for Data Analysis and ETL purposes and extended the default functionality by writing User Defined Functions (UDFs), User Defined Aggregate Function (UDAFs) for custom data specific processing.

•Capable in using Amazon S3 to support data transfer over SSL and the data gets encrypted automatically once it is uploaded.

•Strong Knowledge on Architecture of Distributed systems and parallel processing, In-depth understanding of MapReduce programing paradigm and Spark execution framework.

•Good experience in creating data ingestion Pipelines, Data Transformations, Data Management, Data Governance, and real time streaming at an enterprise level.

•Involved in data migration to Snowflake using AWS S3 buckets.

•Hands on expertise with AWS Databases such as RDS(Aurora), Redshift, DynamoDB and Elastic Cache (Memcached & Redis).

•Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.

•Experience in using SDLC methodologies like Waterfall, Agile Scrum for design and development.

•Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing Partitioning and Bucketing, Writing, and Optimizing the HiveQL queries.

•In depth understanding of Hadoop Architecture and its various components such as Resource Manager, Application Master, Name Node, Data Node, HBase design principles etc.

•Experience in various distributions: Cloudera distributions like (CDH4/CDH5).

•Experience in implementing OLAP multi-dimensional cube functionality using Azure SQL Data Warehouse, AZURE PowerShell

•Experience developing iterative Algorithms using Spark Streaming in Scala and Python to builds near realtime dashboards.

•Experienced in building and optimizing AWS data pipelines, architectures and data sets.

•Experience with migrating data to and from RDBMS and unstructured sources into HDFS using Sqoop.

•Excellent Programming skills at a higher level of abstraction using Scala, Java, and Python.

•Experience in job workflow scheduling and monitoring tools like Oozie and good knowledge on Zookeeper.

•Profound understanding of Partitions and Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.

•Experience in writing real time query processing using ClouderaImpala.

•Strong working experience in planning and carrying out of Teradata system extraction using Informatica, Loading Process and Data warehousing, Large-scale Database Management and Reengineering.

•Highly experienced in creating complex Informatica mappings and workflows working with major transformations.

•Utilized Python Libraries like Boto3, Numpy for AWS.

•Worked on NoSQL databases including HBase and Mongo DB.

•Experienced with performing CRUD operations using HBase Java Client API and Solr API.

•Good experience in working with cloud environment like Amazon Web Services (AWS) EC2 and S3.

•Experience in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins, and AWS.

•Experience writing Shell scripts in Linux OS and integrating them with other solutions.

•Strong Experience in working with Databases like Oracle 10g, DB2, SQL Server 2008 and MySQL and proficiency in writing complex SQL queries.

•Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.

•Experience in automation and building CI/CD pipelines by using Jenkins and Chef.

•Experience on agile methodologies Scrum.

TECHNICAL SKILLS:

Hadoop/Big Data Ecosystem

Apache Spark, HDFS, MapReduce, HIVE, Kafka, Sqoop YARN, Pig, HBASE, Impala, Zookeeper, OOZIE,, Cassandra, Flume, Spark, AWS, EC2

Programming & Scripting

Python, PySpark, SQL, Scala

NoSQL Databases

Mongo DB, Dynamo DB

SQL Databases

MS-SQL Server, MySQL, Oracle, Postgresql

Cloud Computing

AWS, Azure, Google Cloud

Operating Systems

Ubuntu (Linux), Mac OS-X, Windows 10, 8

Reporting

PowerBI, Tableau

Version Control

Git, GitHub, SVN

Methodologies

Agile/ Scrum, Rational Unified Process and Waterfall

PROFESSIONAL EXPERIENCE:

Client: Gainwell Technologies- Irving, TX (remote) February 2022 to Present

Role: Sr. Big Data/Data Engineer Responsibilities:

•Responsible for design, development and delivery of data from Operational systems and files into ODS, downstream Data Marts and Files.

•Developed Python scripts to find the vulnerabilities with SQL Queries by doing SQL injection, permission checks and analysis. Worked on complex SQL Queries, PL/SQL procedures and covert them to ETL tasks.

•Used Cloud watch for monitoring the server’s (AWS EC2 Instances) CPU utilization and system memory.

•Involved in running Hadoop streaming jobs to process Terabytes of data.

•Integrated services like Bitbucket AWS Code Pipeline and AWS Elastic Beanstalk to create a deployment pipeline.

•Loaded the tables from the azure data lake to azure blob storage for pushing them to Snowflake.

•Built the code efficiently and worked with business analyst, end users and architects.

•Developed and deployed stacks using AWS Cloud Formation Templates (CFT) and AWS Terraform.

•Involved in scheduling Oozie workflow engine to run multiple Hive jobs.

•Developed new ETL process in Spark and converted the existing Hive scripts into Spark.

•Implemented custom UDF using Spark RDDs, Data frames, and Spark SQL.

•Implemented Azure data pipelines to migrate data from different sources to Azure Data Lake using ADF.

•Designed the optimal performance strategy and managed the technical metadata across all ETL jobs.

•Implemented scripts in PySpark to validate the data and automated the scripts.

•Written PySpark job in AWS Glue to merge data from multiple tables and utilized AWS Glue to run ETL jobs and run aggregation on Pyspark code.

•Involved on designing, developing, testing, tuning and building a large-scale data processing system.

•Involved in data migration using NIFI.

•Scheduled the jobs and Data-bricks workflows using Airflow.

•Responsible for building solutions involving large data sets using SQL methodologies, Data Integration Tools like Informatica in any database.

•Used JIRA for bug tracking and GIT for version control.

•Followed agile methodology for the entire project.

Client: Premier Inc. -Albertsons– Charlotte, NC March 2021 to January 2022

Role: Sr.Data Engineer

Responsibilities:

•Replaced the existing MapReduce programs and Hive Queries into Spark application using Scala.

•Designed and implemented complex SSIS packages for Premier, performing ETL processes, data cleansing, transformations, and integrations from multiple sources like SQL Server databases, Excel, flat files, and external APIs Developed and deployed optimized SSIS packages into production, ensuring robust error handling, logging, and monitoring capabilities, significantly enhancing Premier's data accuracy and reliability.

•Configured automated execution of SSIS Jobs via SQL Server Agent, managing scheduled batch processes and data refresh cycles critical to Premier's reporting and analytics.

•Provided end-to-end support, from development through deployment, maintaining high standards in data quality and compliance with Premier’s data governance policies.

•Developed data pipeline using Flume, Sqoop and Pig to extract the data from weblogs and store in HDFS.

•Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.

•Developed Shell scripts for scheduling and automating the job flow.

•Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage.

•Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR.

•Performing Hive tuning techniques like partitioning and bucketing and memory optimization.

•Hands on experience on Sqoop import, export and eval.

•Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using Python (PySpark).

•Performed Distcp while loading the historic data in to Hive.

•Worked on defining virtual warehouse sizing for Snowflake for different type of workloads.

•Used Spark SQL to load data and created schema RDD on top of that which loads into Hive tables and handled structured using Spark SQL.

•Involved in converting the Hql into Spark transformations using Spark RDD with support of Python and Scala.

•Performed Pig script which picks the data from one hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted this script into a jar and passed as parameter in Oozie script.

•Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.

•Build an ETL which utilizes spark jar inside which executes the business analytical model.

•Hands on experiences on Git bash commands like git pull to pull the code from source and developing it as per the requirements, Git add to add files, git commit after the code build and git push to the pre prod environment for the code review and later used screwdriver. Yaml which build the code, generates artifacts which releases into production.

•Performed data validation which does the record wise counts between the source and destination.

•Have knowledge on Apache Hue.

•Good hands-on experience with Git and GitHub.

•Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks etc.

•Involved in the data support team as role of bug fixes, schedule change, memory tuning, schema changes loading the historic data.

•Worked on implementation of some check points like Hive count check, Sqoop records check, done file create check, done file check and touch file lookup.

•Documented the workflow action process, bug fixes.

•Communicate and collaborate with the team in clearing of the blockers also have good communication with stakeholders.

•Worked on both Agile and Kanban Methodologies.

Environment: Hadoop, Map Reduce, HDFS, Hive, Impala, UNIX, Linux, Tableau, Teradata, Cassandra, Sqoop, Oozie, SQL, Kafka, Spark, Scala, Java, Azure, Azure PowerShell, Azure Data Lake, Spark SQL, Spark-Streaming, Hive, Scala, pig, NoSQL, Solr, GIT.

Client:Furniture. – Houston, TX November 2019 to February 2021

Role: Sr. Big Data/Data Engineer

Responsibilities:

•Developed ETL data pipelines using Spark and PySpark.

•Analyzed SQL scripts and designed the solutions to implement using PySpark.

•Developing data processing tasks using PySpark such as reading data from external sources, merge data, perform data enrichment and load in to target data destinations.

•Used Pandas, Numpy, Spark in Python for developing Data Pipelines.

•Perform Data Cleaning, features scaling, features engineering using Pandas and Numpy packages in Python.

•Worked collaboratively to manage build outs of large data clusters and real time streaming with Spark.

•Implement the Kafka to hive streaming process flow and batch loading of data into MariaDB using Apache.

Nifi.

•Implement end-end data flow using Apache Nifi.

•Responsible for loading Data pipelines from web servers using Kafka and Spark Streaming API.

•Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.

•Developed the batch scripts to fetch the data from Google cloud S3 storage and do required transformations in Scala using Spark framework.

•Implemented Spark using Scala and SparkSQL for faster testing and processing of data.

•Data Processing: Processed data using Map Reduce and Yarn. Worked on Kafka as a proof of concept for log processing.

•Monitoring the Hive Meta store and the cluster nodes with the help of Hue.

•Created Google cloud EC2 instances and used JIT Servers.

•Developed various UDFs in Map-Reduce and Python for Pig and Hive.

•Data Integrity checks have been handled using Hive queries, Hadoop, and Spark.

•Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala.

•Implemented the Machine learning algorithms using Spark with Python.

•Defined job flows and develops simple to complex Map Reduce jobs as per the requirement.

•Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.

•Developed PIG UDFs for manipulating the data according to Business Requirements and worked on developing custom PIG Loaders.

•Responsible in handling Streaming data from web server console logs.

•Installed Oozie workflow engine to run multiple Hive and Pig Jobs.

•Developed PIG Latin Scripts for the analysis of semi structured data.

•Used Hive and created Hive Tables and involved in data loading and writing Hive UDFs.

•Used Sqoop to import data into HDFS and Hive from other data systems.

•Installed and configured Apache Hadoop to test the maintenance of log files in Hadoop cluster.

•Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.

•Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.

•Involved in NoSQL database design, integration, and implementation.

•Loaded data into NoSQL database HBase.

•Developed Kafka producer and consumers, HBase clients, Spark, and Hadoop MapReduce jobs along with components on HDFS, Hive.

•Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.

Environment: Spark, Spark Streaming, Apache Kafka, Apache NiFi, Hive, Tez, AWS, ETL, PIG, UNIX, Linux, Tableau, Teradata, Pig, Sqoop, Hue, Oozie, Java, Scala, Python, GIT.

Client: Aditya Birla Group, Chennai, India July 2017 to October 2019

Role: Hadoop Engineer/Developer

Responsibilities:

•Experienced in development using Cloudera Distribution System.

•Design and develop ETL integration patterns using Python on Spark.

•Optimize the Pyspark jobs to run on Secured Clusters for faster data processing.

•Used Python for SQL/CRUD operations in DB, file extraction/transformation/generation.

•Developed spark applications in Python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.

•Designing and Developing Apache NiFi jobs to get the files from transaction systems into data lake raw zone.

•Analyzed the user requirements and implemented the use cases using Apache Nifi.

•Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.

•As a Hadoop Developer, my role is to manage the Data Pipelines and Data Lake.

•Have experience of working on Snow -flake data warehouse.

•Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.

•Designed custom Spark REPL application to handle similar datasets.

•Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation.

•Performed Hive test queries on local sample files and HDFS files.

•Used AWS services like EC2 and S3 for small data sets.

•Developed the application on Eclipse IDE.

•Developed Hive queries to analyze data and generate results.

•Used Spark Streaming to divide streaming data into batches as an input to Spark Engine for batch processing.

•Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, hive, HBase, Spark and Sqoop.

•Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization, and user report generation.

•Used Scala to write code for all Spark use cases.

•Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL.

•Assigned name to each of the columns using case class option in Scala.

•Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala, initially done using Python (PySpark).

•Involved in converting the hql’s in to spark transformations using spark RDD with support of Python and Scala.

•Developed multiple Spark SQL jobs for data cleaning.

•Created Hive tables and worked on them using Hive QL.

•Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS.

•Developed Spark SQL to load tables into HDFS to run select queries on top.

•Developed analytical component using Scala, Spark, and Spark Stream.

•Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.

•Worked on the NoSQL databases HBase and mongo DB.

Environment: Hadoop, Hive, Oozie, Java, Linux, Maven, Apache NiFi, Oracle 11g/10g, Zookeeper, MySQL, Spark.

DCS Technologies- Bangalore, India January 2015 to June 2017

Role: Data Analyst

Responsibilities:

•Using Sqoop to import and export data from Oracle and PostgreSQL into HDFS to use it for the analysis.

•Migrated Existing MapReduce programs to Spark Models using Python.

•Migrating the data from Data Lake (hive) into S3 Bucket.

•Done data validation between data present in Data Lake and S3 bucket.

•Used Spark Data Frame API over Cloudera platform to perform analytics on hive data.

•Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.

•Used Kafka for real time data ingestion.

•Created different topic for reading the data in Kafka.

•Read data from different topics in Kafka.

•Involved in converting the hql’s in to spark transformations using Spark RDD with support of python and Scala.

•Moved data from S3 bucket to Snowflake Data Warehouse for generating the reports. • Written Hive queries for data analysis to meet the business requirements

•Migrated an existing on premises application to AWS.

•Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS.

•Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.

•Created many Sparks UDF and UDAFs in Hive for functions that were not preexisting in Hive and Spark Sql.

•Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.

•Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.

•Good knowledge on Spark platform parameters like memory, cores, and executors

•By using Zookeeper implementation in the cluster, provided concurrent access for Hive Tables with shared and exclusive locking.

Educational Details:

Bachelors in Computer Science, Andhra University, AP, India May 2014

Contact this candidate