Big Data Analysis

Location:

Monroe, NC

Salary:

120000

Posted:

October 16, 2023

Contact this candidate

Resume:

SPANDANA R

CONTACT: +1-980-***-****

EMAIL: ************@*****.***

Professional Summary:

9+ years of IT experience in analysis, design, development, testing, delivery, and production support in Python along with Big Data experience in Hadoop ecosystem such as Hive, Pig, Flume, Sqoop, HBase, SPARK, Kafka, Python, AWS and Azure.

Hands-on experience in Data Modeling, Dimensional Modeling, implementation, and support of various applications in OLTP and Data Warehousing.

Experience with snowflake multi-cluster warehouses.

Strong experience in migrating from other databases to snowflake.

Experience in building snow pipe, snowflake clone and time travel.

Strong Snowflake experience in SQL Development and Data Analysis to develop a new complex data warehouse.

Hands on experience on AWS, S3, EMR, GLUE and knowledge in Microsoft Azure, ADF, ADLS.

Experience working with Horton works distribution and Cloudera Hadoop

Distribution.

Experience in importing and exporting data using Sqoop from HDFS to Relational database systems and Vice-versa.

Worked on data processing and transformations and actions in spark by using Python (Pyspark) language.

Experience in Creating real-time data streaming solutions using Spark Core, Spark SQL, Kafka, Spark Streaming, Apache Storm.

Worked on several python packages like numpy, scipy, pytables etc.

Developed UDFs using both Data Frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.

Experience in Extraction, Transformation and Loading (ETL) of data from multiple sources like Flat files, Databases and integration with popular NoSQL database for huge volume of data.

Expertise in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.

Hands on experience Using Hive Tables by Spark, performing transformations and Creating Data Frames on Hive tables using Spark SQL.

Experience in converting Hive/SQL queries into RDD transformations using Spark, Scala and Pyspark.

Experience in developing applications on different platforms like Windows and LINUX.

Experienced with Complex Source to Target Data Mappings.

Strong understanding and hands-on experience in database design, development, and administration.

Proficiency in writing complex SQL queries, stored procedures, and optimizing database performance. Familiarity with NoSQL databases such as MongoDB, Cassandra, or DynamoDB.

Transform and analyze the data using Pyspark, HIVE, based on ETL mappings.

Good knowledge of various scripting languages like Linux/Unix shell scripting.

Strong in Data Warehousing concepts, Star schema and Snowflake schema methodologies, understanding Business process/requirements.

Good understanding of Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters in Databricks, Managing the Machine Learning Lifecycle

Good exposure to CI/CD tools like Jenkins, Gitlab pipelines, Docker, Kubernetes.

Experience in developing and designing POCs deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Oracle

Experienced in using IDEs and Tools like Eclipse, GitHub, Jenkins, Maven and IntelliJ.

Well versed with different SDLC methodologies like Agile and waterfall models.

Experience in Amazon AWS concepts like EMR, S3, EC2, RedShift, Cloud Watch, Athena.

Expertise in Oozie for configuring job workflows Scheduling, Automation and Managing based on time driven and data driven.

Strong Organizational Skills with ability to work with individuals as well as teams from different backgrounds.

Technical Skills :

Big Data Technologies

Hadoop, MapReduce, HDFS, Hive, Sqoop, Apache Spark,

Yarn, and Apache Kafka.

Programming Languages

Python, Java, SQL, PL/SQL, and UML.

Web Servers

Web Logic, Web Sphere, Apache Tomcat.

Cloud Platform

AWS (EMR, EC2, S3, APP Flow, Cloud Watch, Redshift), Microsoft Azure (ADF, ADL & ADB), Snowflake

Scripting Languages

Shell Scripting, Java script.

Databases

Oracle, SQL Server, MySQL, Teradata, NoSQL, Snowflake, Mark Logic

Version Control System

SVN, GitHub.

Data Modelling Tools

Erwin

ETL Tools

Informatica (IICS & Power Center), Glue, ADF

Reporting Tools

Tableau

Professional Experience:

Heartland Payment System April 2022-Tilldate

Role: Sr Data Engineer

Responsibilities:

·Involved in Designing and developing applications using PySpark and Hadoop technologies to read the csv files and dynamically create the hive tables.

·Extracting data from different RDBMS sources and source systems.

·Designed applications to source data from RDBMS dynamically, getting the credentials and sourcing the data in HDFS.

·Using ER studio, developed logical data models and physical data models that capture current state/future state data elements and data flows.

·Used Erwin for reverse engineering to connect to existing database and ODS to create graphical representation in the form of entity relationships and elicit more information.

·Designed and implemented complex data workflows using Apache Airflow, creating efficient Directed Acyclic Graphs (DAGs) to automate ETL processes and data transformations.

·Used forward engineering to create a physical data model with DDL that best suits the requirements from the logical data model.

·Participated in daily standups and Scrum calls and collaborated with Product Owners, Developers and other cross functional teams about project dependencies and backlogs.

·Integrated Airflow with cloud platforms such as AWS, leveraging services like S3 and EC2 for scalable and automated data processing. Implemented horizontal and vertical scaling strategies to handle large datasets.

·Experienced in using Snowflake utilities, SnowSQL, SnowPipe using Python.

·Implemented Spark using Pyspark and Spark SQL for faster testing and processing of data.

·Implemented robust monitoring and error handling mechanisms within Airflow, ensuring fault tolerance and reliability. Optimized workflows for performance, resource utilization, and timely delivery of data insights.

·Developed Spark, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.

·Experienced in using Python programming and SQL queries to extract, transform, and load data from diverse sources into CSV data files.

·Extensively worked on creating data pipeline integrating Kafka with spark streaming application used Scala for writing applications.

·Utilized Apache Spark with Python to develop and execute Big Data Analytics.

·Implemented error handling and data validation techniques to ensure data accuracy and completeness in AWS AppFlow pipelines.

·Collaborated with cross-functional teams to gather requirements and design data integration solutions using AWS AppFlow.

·Developed python and shell scripts to perform spark jobs for batch processing data from various sources.

·Working knowledge of Amazon’s Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) to store objects.

·Imported data from AWS S3 into Spark Data frames.

·Optimized CloudWatch configurations to improve performance, reduce costs, and align with best practices.

·Performed transformations and actions on Data frames.

·Implemented performance tuning strategies for on-premise Hadoop clusters, optimizing MapReduce jobs and Spark applications to achieve significant improvements in data processing speed.

·Conducted regular monitoring and analysis of cluster health, identifying and resolving performance bottlenecks to enhance overall system efficiency.

·Created AWS Lambda functions using python for deployment management in AWS and designed, investigated and implemented public facing websites on Amazon Web Services and integrated it with other applications infrastructure.

·Extracted data using spark from AWS Redshift and performed Data Analysis.

·Created Data Quality Scripts using SQL and Hive to validate successful data loading and quality of the data.

·Strong experience in Data Migration from RDBMS to Snowflake Cloud data warehouse.

·Dimensional modeling experience using Snowflake schema methodologies of Data Warehouse and integration projects.

·Experienced with snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source systems which include loading nested JSON formatted data into snowflake table.

·Written Shell scripts to create wrappers for executing the Sqoop commands dynamically.

·Successfully designed, deployed, and managed an on-premise Hadoop cluster infrastructure to support data processing and analytics needs.

·Configured and optimized the Hadoop ecosystem components, including HDFS, YARN, and MapReduce, ensuring high availability, scalability, and efficient resource allocation.

·Participated in data migration from RDBMS to NoSQL database, gaining a comprehensive view of data deployed across multiple data systems.

·Involved in MarkLogic XQuery unit tests, and MarkLogic Administration.

·Assisted the client in addressing daily problems/issues of any scope.

·Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

·Prepared test data and executed the detailed test plans. Completed any required debugging.

·Created Dashboards with interactive views, trends, and drill downs in tableau.

Environment: AWS, Snowflake, Hadoop, Hive, HDFS, Spark, Spark-SQL, Hive, Sqoop, EC2, S3, IAM, VPC, Athena, Glue, Data catalog, CloudTrail.

July 2017- Feb 2022

Unify Technologies, India

Role: Data Engineer

Responsibilities:

Involved in linux Data Architecture and Application Design using Cloud and Big Data solutions.

Worked on migration of Legacy-system to Microsoft Azure cloud-based solution. Worked on re-designing the Legacy Application solutions with minimal changes to run on cloud platform.

Designed and developed ETL pipelines using Databricks, automating data ingestion, transformation, and loading, resulting in reduction in data processing time.

Built the data pipeline using Azure Services like Data Factory to load the data from Legacy SQL server to Azure Data Base using Data Factories, API Gateway Services, SSIS Packages, Talend Jobs, custom .Net and Python codes.

Built Azure Web Job for Product Management teams to connect to different APIs and sources to extract the data and load into Azure Data Warehouse using Azure Web Job and Functions.

Worked on setting up the Hadoop and Spark cluster for the various POCs, specifically to load the Cookie level data and real-time streaming.

Utilized Databricks Delta Lake to improve data quality and reliability, leading to a decrease in data-related issues and enhancing overall data governance.

Established a Spark Cluster for processing over 2 Tb of data and subsequently loaded it into SQL Server. In addition, built various Spark jobs to run Data Transformations and Actions.

Writing different APIs to connect with the different Media Data feeds like, Prisma, Double Click Management, Twitter, Facebook and Instagram to get the Data using Azure Web Job and Functions integration with Cosmos DB.

Built trigger-based Mechanism to reduce the cost of different resources like Web Job and Data Factories using Azure Logic Apps and Functions.

Extensively worked on Relational Databases, Postgres SQL as well as MPP databases like Redshift.

Experience in custom process design of Transformation via Azure Data Factory & Automation Pipelines.

Extensively used the Azure Services like Azure Data Factory and Logic App for ETL, to push in/out the data from DB to Blob storage, HDInsight -HDFS, Hive Tables.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory.

Hands-on experience on developing SQL Scripts for automation purposes.

Worked on installing the Applications Insights tool on the web services and configuring an Application Insight workspace in Azure. Configured Application Insights to perform web tests and alerts.

Configured continuous integration from source control, setting up build definition within Visual Studio Team Services (VSTS) and configured continuous delivery to automate the deployment of ASP .NET MVC applications to Azure web apps.

Worked on modification of Certs, Password and Storage Accounts on the cloud platform Setting up and administered service accounts.

Tech Environment/Skills: Microsoft Azure, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, SSRS, Power BI, JavaScript, ASP. Net, C#.net, Microsoft SQL Server.

Mindtree, India Jun 2013- July 2017

Role: Data Engineer

Responsibilities:

Created and implemented a test environment on Azure.

Migrated SQL Server and Oracle database to Microsoft Azure Cloud.

Analyzed, designed and built modern data solutions using Azure PaaS service to support visualization of data. Analyzed the existing production state of the application and assessed the potential impact of new implementations on current business processes.

Extracted, Transformed and Loaded data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

Experienced in converting SSIS jobs to IICS jobs with help of BRD document and Flow chart from Visio.

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.

Implemented REST APIs using Python and Django framework.

Experienced in utilizing Python libraries such as Scikit-Learn, Pandas, NumPy for carrying out preprocessing operations such as data cleaning, correlation analysis, visualization, feature scaling, dimension reduction techniques etc.

Knowledge in creating machine learning models with the help of algorithms such as Linear Regression, Logistic Regression, Naïve Bayes, Support Vector Machines (SVM), Decision Tress, kNN, K-Means Clustering, Ensemble Methods.

Developed web-based applications using Python, Django, XML, CSS, HTML.

Designed Data Quality Framework to perform schema validation and data profiling on Spark (PySpark).

Wrote Python scripts to parse JSON documents and load the data in database.

Implemented web applications in Flask frame works following MVC architecture.

Leveraged spark (PySpark) to manipulate unstructured data and applied text mining on user's table utilization data.

Successfully migrated the Django database from SQLite to MySQL to PostgreSQL with complete data integrity.

Implemented monitoring and established best practices around using elastic search.

Followed AGILE development methodology to develop the application.

Used Test driven approach (TDD) for developing services required for the application.

Used Git for version control while resolving Python and portlet coding tasks.

Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure DataLake, Azure BLOB Storage, SQL server, Python, Django 1.9, PHP7, Perl, HTMLS, CSS3, JavaScript, jQuery, MySQL, AWS, MongoDB, Angular JS, JIRA, RabbitMQ, Selenium, Web Services, Jenkins, Git, Linux.

CERTIFICATIONS:

SNOWPRO® CORE CERTIFIED.

Education:

Bachelor of Technology- Vignan Institute of Technology and Science (JNTUH), Hyderabad, India.

Completed and passed “Python for Data Science”, UCSanDiegoX.

Contact this candidate