Data Analyst Engineer

Location:

Hyderabad, Telangana, India

Posted:

June 26, 2023

Contact this candidate

Resume:

AMULYA

Email: *******************@*****.*** PH: 405-***-****

Sr. Data Engineer

Professional Summary:

8+ years of overall experience as a Data Engineer, Data Analyst, and ETL developer, with expertise in comprising designing, developing, and implementing data models for enterprise-level applications using Big Data tools and technologies such as Hadoop, Sqoop, Hive, Spark, Flume.

Experienced in working with Azure cloud platforms (HDInsight, DataLake, Databricks, Blob Storage, Data Factory, Azure Functions, Azure SQL data warehouse, and Synapse).

Extensive professional experience in full Software Development Life Cycle (SDLC), Agile Methodology and analysis, design, development, testing, implementation and maintenance in SPARK, Hadoop, Data Warehousing, Linux and Java/Scala.

Extensive experience in Oracle PL/SQL, PostgreSQL pl/pgsql, Microsoft T - SQL.

Good experience in BI using TIBCO Spotfire Professional, and TIBCO Spotfire Web player.

Proficient in migrating on-premises data sources to Azure data lake, Azure SQL Database, Databricks, and Azure SQL Data warehouse using Azure Data factory and granting access to the users.

Experienced in Developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into customer usage patterns.

Player Server,Spotfire Professional Server and Spotfire Automation Services Server.

Expertise in creating reports in Power BI preview portal utilizing the SSAS Tabular via Analysis connector.

Implemented a 'server less' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets. Created a Lambda Deployment function, and configured it to receive events from your S3 bucket.

Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements from Aurora.

Experience in publishing Power BI Desktop reports created in Report view to the Power BI service.

Experience working on AWS RDS/Aurora for Oracle/PostgreSQL databases.

Strong experience in migrating other databases to Snowflake.

Experience installing Spotfire Server, Spotfire Web Player & Spotfire professional. Performing user management, deployment & migration.

Experience in writing distributed Scala code for efficient big data processing

3 years of experience on BIG DATA using HADOOP and SPARK framework and related technologies such as HDFS, HBASE, MapReduce, HIVE, PIG, FLUME, OOZIE, SQOOP, and ZOOKEEPER.

Experienced with Azure Data Factory (ADF), Integration Run Time (IR), File System Data Ingestion, and Relational Data Ingestion.

Have strong knowledge of Spotfire Admin roles and responsibilities, migration procedures .

Experience in creating Power BI Dashboards (Power View, Power Query, Power Pivot, Power Maps).

Experience withSnowflake Multi - Cluster Warehouses.

Experience inSplunkreporting system.

Understanding of SnowFlake cloud technology.

Experience in using SnowflakeCloneandTime Travel.

In - depth understanding of SnowFlake cloud technology.

In-Depth understanding of SnowFlakeMulti-cluster Size and Credit Usage.

Experience withSnowflake Multi-Cluster Warehouses.

Experience withSnowflake Virtual Warehouses.

Participates in the development improvement and maintenance of snowflake database applications.

Experience in writing Unit Test and Smoke Test for testing the code (modules) using Scala Test Framework.

Good experience in writing Spark applications using Java and Scala.

Developed Scala applications on Hadoop and Spark SQL for high - volume and real-time data processing.

Experienced with the use of AWS services including S3, EC2, SQS, RDS, Neptune, EMR, Kinesis, Lambda, Step Functions, Terraform, Glue, Redshift, Athena, DynamoDB, Elasticsearch, Service Catalog, CloudWatch, IAM and administering AWS resources using Console and CLI.

Hands-on experience in building the infrastructure necessary for the best data extraction, transformation, and loading from a range of data sources using NoSQL and SQL from AWS & Big Data technologies(Dynamo, Kinesis, S3, HIVE/Spark)

Developed and deployed a variety of Lambda functions using the built-in AWS Lambda Libraries and Lambda functions written in Scala and using custom libraries.

Capable of using AWS utilities such as EMR, S3, and cloud watch to run and monitor Hadoop and spark jobs on Amazon Web Services (AWS).

Strong knowledge in working with Amazon EC2 to provide a complete solution for computing, query processing, and storage across a wide range of applications.

Experienced in configuring Spark Streaming to receive real-time data from Apache Kafka and store the stream data to HDFS and expertise in using spark-SQL with various data sources like JSON, Parquet, and Hive.

Extensively used Spark Data Frames API over the Cloudera platform to perform analytics on Hive data and used Spark Data Frame Operations to perform required Validations in the data.

Expertise in developing production-ready Spark applications utilizing Spark-Core, Data Frames, Spark-SQL, Spark-ML, and Spark-Streaming API.

Expert in using Azure Databricks to compute large volumes of data to uncover insights into business goals.

Developed Python scripts to do file validations in Databricks and automated the process using Azure DataFactory.

A solid experience and understanding of architecting, designing and operationalization of large scale data and analytics solutions on Snowflake Cloud Data Warehouse.

Experienced in integrating data from diverse sources, including loading nested JSON-formatted data into Snowflake tables, using the AWS S3 bucket and the Snowflake cloud data warehouse.

Configured Snow pipe to pull the data from S3 buckets into the Snowflake table and stored incoming data in the Snowflake staging area.

Expertise in developing Python scripts to build ETL pipelines and Directed Acyclic Graph (DAG) workflows using Airflow.

Orchestration experience in scheduling Apache Airflow DAGs to run multiple Hive and spark jobs, which independently run with time and data availability.

Strong experience working with Hadoop ecosystem components like HDFS, Map Reduce, Spark, HBase, Oozie, Hive, Sqoop, Pig, Flume, and Kafka.

Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and Hadoop MapReduce programming.

Practical understanding of Data modeling (Dimensional & Relational) concepts like Star-Schema Modelling, Fact, and Dimension tables.

Strong experience troubleshooting failures in spark applications and fine-tuning spark applications and hive queries for better performance.

Strong experience writing complex map-reduce jobs including the development of custom Input Formats and custom Record Readers.

Good exposure to usage of NoSQL databases column-oriented Cassandra and MongoDB (Document Based DB).

Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems(RDBMS) and vice-versa.

Experience in analyzing data using Python, R, SQL, Microsoft Excel, Hive, PySpark, and Spark SQL for Data Mining, Data Cleansing, Data Munging, and Machine Learning.

Expertise in the development of various reports, and dashboards using various Power BI, and Tableau.

Excellent Communication skills, Interpersonal skills, problem-solving skills, and a team player. Ability to quickly adapt to new environments and technologies.

Professional Experience

Chewy Dania Beach, FL March 2022 to Present

Sr.Data Engineer

Responsibilities:

Extensively worked with Azure cloud platform (HDInsight, Data Lake, Databricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer).

Ingested data to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processed the data in In Azure Databricks.

Hands on development assisting users in creating, modifying Spotfire visualization dashboard and getting data into Spotfire from different data source.

Involved in End to End migration of 800+ Object with 4TB Size from Sql server to Snowflake.

Experience building distributed high-performance systems using Spark and Scala.

Experience developing Scala applications for loading/streaming data into NoSQL databases (MongoDB) and HDFS.

Designed and configured Azure Cloud relational servers and databases, analyzing current and future business requirements.

Made Power BI reports more interact and activate by using storytelling features such as bookmarks, selection panes, drill through filters also created custom visualizations using “R-Programming Language”.

Utilized Power BI to create various analytical dashboards that helps business users to get quick insight of the data.

Installed and Configured the Spotfire Statistics Services on a standalone Installation.

Created, inserted and extracted Json data from columns in Postgres databases.

Data moved from Sql Server Azure snowflake internal stage Snowflake with copy options.

Developed data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL.

Configured Input & Output bindings of Azure Function with Azure Cosmos DB collection to read and write data from the container whenever the function executes.

Converted 230 views query’s from Sql server snowflake compatibility.

Developed robust ETL pipelines in Azure Data Factory (ADF) using Linked Services from different sources and loaded them into Azure SQL Datawarehouse.

Collected the Plants Data in Excel sheets from different Regions around the World and combined them in Power BI to know the Loss and Profit of the Product.

Used Spark and Scala for developing machine learning algorithms which analyses click stream data.

Developed Elastic pool databases and scheduled Elastic jobs to execute T-SQL procedures.

Developed Spark applications in azure Databricks using Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into customer usage patterns.

Worked on data migration (full load and change data capture (CDC)) to AWS Amazon Aurora/ RDS with PostgreSQL compatibility using Data Migration Services (DMS).

ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.

Implemented security using TIBCO Spotfire section access technology that dynamically hides selected areas of the dashboard depending upon user privileges.

Designed and Developed Scala workflows for data pull from cloud based systems and applying transformations on it.

Proficient in performing ETL operations in Azure Databricks by connecting to different relational database source systems using JDBC connectors.

Migrated data from Azure Blob storage data to Azure Data Lake using Azure Data Factory (ADF).

Developed the robust & scalable ETL Azure Data Lake to Data warehouse applications for Medicaid and Medicare data using the Azure Databricks.

Built and automated data engineering ETL pipeline over Snowflake DB using Apache Spark and integrated data from disparate sources with Python APIs like PySpark and consolidated them in a data mart (Star schema).

Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.

Worked on Power Shell scripts to automate the Azure cloud system creation of Resource groups, Web Applications, Azure Storage Blobs & Tables, and firewall rules.

Designed and provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.

Experience in working with Spark applications like batch interval time, level of parallelism, and memory tuning to improve processing time and efficiency.

Worked on Machine Learning Algorithms Development for analyzing click stream data using Spark and Scala.

Orchestrate the airflow to migrate the data from Hive external table to azure blob storage and optimized the existing hive jobs using the concepts like partition and bucketing.

Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for authentication, and Apache Ranger for authorization.

Experience in working with Spark applications like batch interval time, level of parallelism, and memory tuning to improve processing time and efficiency.

Used Scala for amazing concurrency support, and Scala plays a key role in parallelizing processing of the large data sets.

Worked on Cloud Database SnowFlake Cloud Datawarehouse and Integrated Automated Generic Python Framework to Process XML, CSV, JSON,TSV,TXT files.

Used Enterprise GitHub and Azure DevOps Repos for version control.

Created branching strategies while collaborating with peer groups and other teams on shared repositories.

Developed various interactive reports using Power BI based on Client specifications with row-level security features.

Environment: Scala,Azure (Data Lake, HDInsight, Postgres, Spotfire Professional, Spotfire Web Player, Spotfire Autiomation services, Power BI Desktop, Power BI Service Pro,Snowflake, SQL, Data Factory), Databricks, Cosmos DB, Git, Blob Storage,Power BI, Scala, Hadoop, Spark, PySpark, Airflow.

Nationwide, Columbus, OH December 2019 to February 2022

Data Engineer

Responsibilities:

Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53,S3, RDS, Dynamo DB, SNS, SQS, and IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.

Also Having knowledge On IIS and Spotfire Service Restart .

Created python scripts for migration of data from Oracle database to Postgres database.

Responsible for building scalable distributed data solutions using Hadoop.

Developed the AWS Data pipelines from various data resources in AWS including AWS API Gateway to receive responses from AWS Lambda and retrieve data and converted responses into JSON format and stored them in AWS redshift.

Worked on Power BI reports using multiple types of visualizations including line charts, doughnut charts, tables, matrix, KPI, scatter plots, box plots, etc.

Evaluate Snowflake Design considerations for any change in the application.

Made Power BI reports more interact and activate by using storytelling features such as bookmarks, selection panes, drill through filters, etc.

Developed the scalable AWS Lambda code in Python for nested JSON files, converting, comparing, sorting, etc.

Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.

Overloading of functions is performed in Postgres DB based on different number of input parameters.

Optimized the performance and efficiency of existing spark jobs and converted the Map-reduce script to spark SQL.

Integrated Spotfire Server with an existing LDAP environment.

Build the Logical and Physical data model for snowflake as per the changes required.

Utilized Power BI to create various analytical dashboards that helps business users to get quick insight of the data.

Develop a bigdata web application using Agile methodology in Scala as Scala has the capability of combining functional and object-oriented programming.

Define virtual warehouse sizing for Snowflake for different type of workloads.

Experienced in collecting data from an AWS S3 bucket in real time using Spark Streaming, doing the appropriate transformations and aggregations, and persisting the data in HDFS.

Use Spark to process the data before ingesting the data into the HBase. Both Batch and real-time spark jobs were created using Scala.

Implemented AWS glue catalog with crawler to get the data from S3 and perform SQL query operations.

Developed robust and scalable data integration pipelines to transfer data from the S3 bucket to the RedShift database using Python and AWS Glue.

Built and maintained the Hadoop cluster on AWS EMR and has used AWS services like EC2 and S3 for small data sets processing and storage.

Use HBase as the database to store application data, as HBase offers features like high scalability, distributed NoSQL, column oriented and real-time data querying to name a few.

Developed Python code for different tasks, dependencies, and time sensors for each job for workflow management and automation using the Airflow tool.

Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and generated detailed design documentation for the source-to-target transformations.

Use SBT to build the Scala project.

Designed the reports and dashboards to utilize data for interactive dashboards in Tableau based on business requirements.

Environment: AWS EMR, S3, Power BI Desktop, Power BI Service, Scala,, Postgres 9.x,10.x, EC2, Lambda, Snowflake,Apache Spark, Spotfire Professional, Spotfire Web Player,Spark-Streaming, Spark SQL, Python, Scala, Shell scripting, Snowflake, AWS Glue, Oracle, Git, Tableau.

Edward Jones St.Louis, MO April 2017 to November 2019

Big Data Engineer

Responsibilities:

Extracted data from HDFS, including customer behaviour, sales and revenue data, supply chain, and logistics data.

Transferred the data to AWS S3 using Apache Nifi, which is an open-source data integration tool that enables powerful and scalable dataflows.

Validated and cleaned the data using Python scripts before storing it in S3.

Used PySpark to process and transform the data, which is a distributed computing framework for big data processing with Python API.

Loaded the transformed data into AWS RedShift data warehousing to analyze the data.

Scheduled the pipeline using Apache Oozie, which is a workflow scheduler system to manage Apache Hadoop jobs.

Developed and maintained a library of custom Airflow DAG templates and operators, which improved consistency and code quality across the team.

Led a team of three data engineers in designing and implementing a complex data ingestion and processing pipeline for a new data source, which reduced time to insights by 50%.

Analyzed the data in HDFS using Apache Hive, which is a data warehouse software that facilitates querying and managing large datasets.

Converted Hive queries into PySpark transformations using PySpark RDDs and Data Frame API.

Monitored the data pipeline and applications using Grafana.

Configured Zookeeper to support distributed applications.

Used functional programming concepts and the collection framework of Scala to store and process complex data.

Used GitHub as a version control system for managing code changes.

Developed visualizations and dashboards using Tableau for reporting and business intelligence purposes.

Environment: S3 buckets, red shift, Apache flume, PySpark, Oozie, Tableau, Scala, Spark RDDs, Hive, HiveQL, HDFS, HQL, Scala, Zookeeper, Grafana MapReduce, Sqoop, GitHub

Avon Technologies Pvt Ltd Hyd India October 2015 to January 2017

ETL Developer

Responsibilities:

Extensively used Informatica Client tools Power Center Designer, Workflow Manager, Workflow Monitor, and Repository Manager.

Used Kafka for live streaming data and performed analytics on it. Worked on Sqoop to transfer the data from relational database and Hadoop.

Configured in building real-time data pipelines using Kafka for streaming data ingestion and Spark Streaming for real-time consumption and processing.

Loaded data from Web servers and Teradata using Sqoop, Flume, and Spark Streaming API.

Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.

Written multiple MapReduce programs for data extraction, transformation, and aggregation from multiple file-formats including XML, JSON, CSV & other compressed file formats.

Extracted data from various heterogeneous sources like Oracle, and Flat Files.

Developed complex mapping using the Informatica Power Center tool.

Extracted data from Oracle and Flat files, Excel files, and performed complex joiner, Expression, Aggregate, Lookup, Stored procedure, Filter, Router transformation, and Update strategy transformations to load data into the target systems.

Worked with Data modeler in developing STAR Schemas.

Involved in analyzing the existence of the source feed in the existing CSDR database.

Handling a high volume of day-to-day Informatica workflow migrations.

Review Informatica ETL design documents and work closely with development to ensure correct standards are followed.

Worked on SQL queries to query the Repository DB to find the deviations from Company's ETL Standards for the objects created by users such as Sources, Targets, Transformations, Log Files, Mappings, Sessions, and Workflows.

Leveraging the existing PL/SQL scripts for the daily ETL operation.

Experience in ensuring that all support requests are properly approved, documented, and communicated using the QMC tool. Documenting common issues and resolution procedures

Extensively involved in enhancing and managing Unix Shell Scripts.

In converting the business requirement into a technical design document.

Documenting the macro logic and working closely with Business Analyst to prepare BRD.

Involved in setting up SFTP setup with the internal bank management.

Building UNIX scripts in cleaning up the source files.

Involved in loading all the sample source data using SQL loader and scripts.

Maisa Solutions Private Limited Hyderabad, India May 2014 to September 2015

Data Analyst- Python

Responsibilities:

Experience working on projects with machine learning, big data, data visualization, R and Python development, Unix, and SQL.

Replaced the existing MapReduce programs and Hive Queries into Spark application using Scala.

Performed exploratory data analysis using NumPy, matplotlib, and pandas.

Expertise in quantitative analysis, data mining, and the presentation of data to see beyond the numbers and understand trends and insights.

To meet specific business requirements wrote UDF’s in Scala and Store procedures.

Experience analyzing data with the help of Python libraries including Pandas, NumPy, SciPy, and Matplotlib.

Created complex SQL queries and scripts to extract and aggregate data to validate the accuracy of the data and Business requirements gathering and translating them into clear and concise specifications and queries.

Prepared high-level analysis reports with Excel and Tableau. Provides feedback on the quality of Data including identification of billing patterns and outliers.

Worked on sort & filters of tableau like Basic Sorting, basic filters, quick filters, context filters, condition filters, top filters, and filter operations.

Identify and document limitations in data quality that jeopardize the ability of internal and external data analysts' ability; wrote standard SQL Queries to perform data validation; created excel summary reports (Pivot tables and Charts); and gathered analytical data to develop functional requirements using data modeling and ETL tools.

Read data from different sources like CSV files, Excel, HTML pages, and SQL and performed data analysis and wrote to any data source like CSV file, Excel, or database.

Experience in using Lambda functions like filter, map, and reduce with pandas Data Frame and performing various operations.

Used Pandas API for analyzing time series. Creating regression test framework for new code.

Developed and handled business logic through backend Python code.

Environment: Python, SQL, Scala,UNIX, Linux, Oracle, NoSQL, PostgreSQL, and python libraries sch as PySpark, and NumPy.

Contact this candidate