Big Data Engineer

Location:

Dayton, OH

Posted:

October 27, 2023

Contact this candidate

Resume:

Manisha Goud

Data Engineer

Mail: *************@*****.*** Contact: 740-***-****

PROFESSIONAL SUMMARY:

•Over 4 years of extensive experience in Information Technology with expertise on Data Analytics, Design, Development, Implementation, Testing and Deployment of Software Applications.

•Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, Amazon EMR) to fully implement and leverage new Hadoop features.

•Extensive familiarity with Hive and Spark User Defined Functions (UDFs). Using Apache Sqoop, I was able to import and export data from HDFS to RDBMS/NoSQL databases and vice versa.

•Worked with HBase, Cassandra, Mongo DB, Azure Cosmos DB and Spark-Redis are examples of NoSQL databases.

•Experience in writing complex SQL queries, creating reports and dashboards.

•Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.

•Knowledge of automated deployments using Azure Resource Manager Templates, DevOps, and Git repositories for automation, as well as Continuous Integration and Continuous Delivery (CI/CD).

•Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance

•Experience in Developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns

•Used Azure Data platform capabilities such as Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server, Azure Machine Learning and Power BI to build huge Lambda systems.

•Experience in Implement frameworks to import and export data from Hadoop to RDBMS.

•Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.

•Strong experience with ETL and orchestration tools.

•Experience in developing Spark Applications using Spark RDD, Spark - SQL and Data frame APIs.

•Worked with real-time data processing and streaming techniques using Spark streaming and Kafka.

•Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.

•Using Linked Services/Datasets/Pipeline/, created Pipelines in ADF to extract, transform, and load data from various sources such as Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards.

•Database design, modelling, migration, and development experience in using stored procedures, triggers, cursor, constraints, and functions. Used My SQL, MS SQL Server, DB2, and Oracle

•Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase.

•Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing, and tuning the HQL queries.

•Experience in building, deploying, troubleshooting, and data extraction for huge number of records using Azure Data Factory (ADF).

•Good Knowledge on Azure Synapse Analytics architecture and component integrations

•Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD)

TECHNICAL SKILLS:

Big Data /Hadoop

MapReduce, Spark, Spark Streaming, Kafka, Pig, Hive, HBase, Oozie, Zookeeper

Programming Languages

Python (NumPy, SciPy, Pandas, Genism, Keras), Java, Pyspark

NOSQL DB

Cassandra, HBase, MongoDB

Dev Tools

Microsoft SQL Studio, IntelliJ, Azure Databricks, Eclipse, Net Beans.

Cloud

Azure, AWS

Build Tools

Jenkins, SQL Loader, PostgreSQL, Talend, Maven, Oozie

Reporting Tools

MS Office, Power BI, Tableau, SSRS

Databases

MS SQL Server, MySQL, Oracle, DB2, Teradata, PostgreSQL

Operating System

Windows, UNIX, LINUX

WORK EXPERIENCE:

Food Lion- Charlotte, NC July 2022 – Present

Data Engineer

Responsibilities:

•Implemented Hadoop framework to capture user navigation across the application to validate the user interface and provide analytic feedback/result to the UI team.

•Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.

•Involved in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism, and memory tuning.

•Built a real-time streaming pipeline utilizing Kafka, Spark Streaming, and Redshift.

•Developed logical and physical data flow models for Informatica ETL applications.

•Worked on the creation of customer Docker container images, tagging, and pushing of data images.

•Written Hive queries for data analysis to meet the business requirements.

•Implemented and analyzed SQL query performance issues in databases.

•Responsible for the design development of Spark SQL Scripts based on Functional Requirements and Specifications.

•Hands-on experience in loading data from UNIX file system to HDFS.

•Experienced on loading and transforming large sets of structured, semi-structured, and unstructured data from HBase through Sqoop and placed in HDFS for further processing.

•Managing and scheduling Jobs on a Hadoop cluster using Oozie.

•Involved in creating Hive tables, loading data, and running hive queries on that data.

•Worked on AWS services like EMR clusters, S3, Glue, and done POC on AWS DMS (Database Migration Service).

•Monitored AWS Redshift using performance tuning techniques to identify bottlenecks.

•Created AWS Glue scripts which pull data from upstream structured streams and include business logic & transformations to meet the requirements.

•Experience in developing Spark applications using Spark-SQL in AWS Glue for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data to uncover insights into customer usage patterns.

•Extract, Transform and Load data from Sources Systems to AWS Data Storage services using a combination of AWS Glue, SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more AWS Services - (S3, RDS, Redshift) and processing the data in AWS Glue.

•Created Pipelines in AWS Glue using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like AWS RDS, S3, AWS Redshift, and write-back tool and backwards.

•Developed JSON Scripts for deploying the Pipeline in AWS Glue that process the data using the SQL Activity.

•Designed end-to-end scalable architecture to solve business problems using various AWS Components like EMR, Glue, S3, RDS, and Machine Learning Studio. Developed JSON Scripts for deploying the Pipeline in AWS Glue that process the data using the SQL Activity.

•Worked alone on Google Cloud Migration project from AWS PaaS and converted their ETLs through AWS Glue and AWS Step Functions. Scheduled their jobs using AWS Glue Jobs and AWS Lambda functions.

Environment: AWS, Glue, Redshift, Apache Hadoop, MapReduce, HDFS, CentOS 6.4, Spark, Kafka, Apache Airflow, HBase, Hive, Pig, Oozie, Flume, Java (jdk 1.6), Eclipse, ELK.

Xchain Technologies- Hyderabad, India May 2018 – Aug 2021

Data Engineer

Responsibilities:

•Experience with Data Ingestion, Data Storage, Data Analysis, and Visualization using Azure cloud technologies and building Data pipelines

•Design robust, reusable, and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and unstructured batch and real time data streaming data using Python Programming.

•Experienced with real-time data processing with Azure Synapse Analytics.

•Built migration of several databases Applications and Web Servers from on-premises to the Azure Cloud.

•Created DAGS to run Apache Airflow and installed and configured it for Azure Blob storage data warehouse.

•Experienced in managing Azure Data Lake Storage, Databricks, and Data Lake and an understanding of how to integrate with other Azure Services.

•Hands-on with Apache Spark pools to clean, transform, and analyze the streaming data, and combined it with structured data from operational databases.

•Created Python notebooks on Azure Databricks for processing the datasets and loading them into Azure SQL databases

•Used Apache Spark pool and Synapse Pipelines in Azure Synapse Analytics to access and transfer data at large.

•Built analytics dashboards and embedded reports in a dedicated SQL pool to share business insights with internal teams within the organization and used Azure Analysis Services to perform business analysis.

•Worked on Data Factory and Synapse Analytics for large Data transformation.

•Experienced with Azure Data Factory for Data Integration for large datasets.

•SQL server Integration with Azure Data Factory to create ETL/ELT pipelines.

•Evaluated and worked on Azure Data Factory as an ETL tool to process business-critical data into aggregated tables in Hive Cloud. Deployed and Development in Bigdata applications like Spark, and Hive in Azure cloud.

•Designed and developed FLINK pipelines to consume streaming data from Data Lakes and applied business logic to message and transform and serialize raw data.

•Processed Large amounts of Data using Azure Lake Analytics and stored it in the Data Lake store.

•Connected several data sources, simplify data, drive business analysis insights, and Produce reports using Power BI.

•Used SQL for structuring and processing data to store it in databases and Data warehouses to do further analysis.

•Experienced using SQL and python to find Data Patterns and Data anomalies in python Jupyter notebook.

•Expertise in using different file formats like Text files, CSV, and Parquet.

•Hands-on experience in custom compute functions using Spark SQL and performed interactive querying.

•worked in an Agile Development environment.

Environment: Data Ingestion, Data Storage, Data Analysis, Data Visualization, HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, Azure Functions, flink, Spark Hive, SQL API, Pipelines, Spark Data Frame.

Education Details:

Master’s in Computer science- University of Dayton, OHIO

Bachelors in Computer Science - Jawaharlal Nehru Technological University, Hyderabad

Contact this candidate