Resume

Data Engineer Senior

Location:

Cincinnati, OH

Salary:

Posted:

February 02, 2024

Contact this candidate

Resume:

ARCHANA

Email: ad3bb8@r.postjobfree.com PH: 513-***-****

Senior Data Engineer

Professional Summary:

·Around 8+ years of professional work experience as a Data Engineer, working with Python, Spark, SQL, AWS & Micro Strategy in the design, development, Testing and Implementation of business application systems for Health Care and Educational sectors.

·Extensively worked on system analysis, design, development, testing and implementation of projects (SDLC) and capable of handling responsibilities independently as well as proactive team members.

·Designed and implemented multiple ETL processes to extract, transform, and load data from various sources into Snowflake

·Hands-on experience in designing and implementing data engineering pipelines and analyzing data using AWS stack like AWS EMR, AWS Glue, EC2, AWS Lambda, Athena, Red shift, Scoop and Hive.

·Hands on experience in programming using Python, Scala, Java and SQL.

·Sound knowledge of architecture of Distributed Systems and parallel processing frameworks.

·Developed and maintained data pipelines using Snowflake's cloud data warehousing platform

·Designed and implemented end-to-end data pipelines to extract, cleanse, process and analyze huge amounts of behavioural data and log data.

·Good experience working with various data analytics in AWS Cloud like EMR, Redshift, S3, Athena, and Glue.

·Proficient in utilizing Matillion's pre-built data connectors and transformations for faster and more efficient ETL pipeline configurations.

·Experienced in developing production ready spark applications using Spark RDDAPIs, Data frames, Spark-SQL and Spark-Streaming Api’s.

·Hands on experience working in GCP services like Big Query, Cloud Storage (GCS), cloud function, cloud dataflow, Pub/sub, Cloud Shell, GSUTIL, Data Proc, Operations Suite (Stack driver).

·Worked extensively on fine tuning spark applications to improve performance and troubleshooting failures in spark applications.

·Strong experience in using Spark Streaming, Spark SQL and other components of spark like accumulators, Broadcast variables, various levels of caching and optimization techniques for spark jobs.

·Proficient in importing/exporting data from RDBMS to HDFS using Sqoop.

·Experience in setting up Hadoop clusters on cloud platforms like AWS and GCP.

·Used hive extensively to perform various data analytics required by business teams.

·Solid experience in working various data formats like Parquet, Orc, Avro, JSON etc.,

·Experience automating end-to-end data pipelines with strong resilience and recoverability.

·Experience in creating Impala views on hive tables for fast access to data.

·Experienced in using waterfall, Agile and Scrum models of software development process framework.

·Good knowledge in Oracle PL/SQL and shell scripting.

·Database / ETL Performance Tuning: Broad Experience in Database Development including effective use of Database objects, SQL Trace, Explain Plan, Different types of Optimizers, Hints, Indexes, Table Partitions, Sub Partitions, Materialized Views, Global Temporary tables, Autonomous Transitions, Bulk Binds, Capabilities of using Oracle Built-in Functions.

·Experienced in developing production ready spark applications using Spark RDD APIs, Data frames and Spark-SQL.

·Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake.

·Experienced process-oriented Data Analyst having excellent analytical, quantitative, and problem-solving skills using SQL, MicroStrategy, Advanced Excel, Python.

·Proficient in writing unit testing code using Unit Test/PyTest and integrating the test code with the build process.

·Used Python scripts to parse XML and JSON reports and load the information in database.

·Experienced with version control systems like Git, GitHub, Bitbucket to keep the versions and configurations of the code organized.

Technical Skills:

Big Data Eco-System

HDFS, Spark, Kafka, GPFS, Hive, Sqoop, Snowflake, YARN, Pig.

Hadoop Distributions

Hortonworks, Cloudera, IBM Big Insights.

Operating Systems

Windows, Linux (Centos, Ubuntu).

Programming Languages

Python, Scala, Shell Scripting.

Databases

Hive, MySQL, SQL Server, Oracle, IBM DB2, MS Access, Teradata, Snowflake, Postgres, Hadoop, ANSI SQL, AS400, PL/SQL, T-SQL

IDE Tools & Utilities

IntelliJ IDEA, Eclipse, PyCharm, Aginity Workbench, Git.

ETL Tools

DataStage 9.1/11.5 (Designer/Monitor/Director), Informatica Power Center, IICS, Informatica Data Quality 10.2.2, Informatica BDM, Talend, Datastage

Reporting tools

Tableau, Power BI, SSRS, Splunk, Quicksight, Looker and Data studio

Data Warehousing

Snowflake, BigQuery, Redshift

Scrum Methodologies

Agile, Asana, Jira.

Job Scheduler

Control-M, IBM Symphony Platform, Ambari, Apache Airflow.

Other tools

SQL Developer, SQL Plus, Query Analyzer MS Office, RTC, Service Now, Optim, IGC (InfoSphere Governance Catalog), WinSCP.

Professional Experience:

JPMorgan Chase & Co – New York, NY May 2023 to Present

Senior Data Engineer

Responsibilities:

·Involved in writing Spark applications using Python to perform various data cleansing, validation, transformation, and summarization activities according to the requirement.

·Developed multiple POCs using Pyspark and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata and developed code in reading multiple data formats on HDFS using Pyspark.

·Designed, developed and created ETL (Extract, Transform and Load) packages using Python to load data into Data warehouse tools (Teradata) from databases such as Oracle SQL Developer, MS SQL Server.

·Designed and customized data models for Data warehouse supporting data from multiple sources in real time. Designed data model to meet the business requirements, created tables, views, anonymous blocks, materialized views, stored procedures, packages, and functions.

·Loaded the data into Spark data frames and perform in-memory data computation to generate the output as per the requirements.

·Optimized Snowflake performance by tuning queries, optimizing warehouse configurations, and implementing proper indexing strategies

·Worked on AWS Cloud to convert all on premise, existing processes and databases to AWS Cloud.

·Skilled in deploying and managing Matillion instances in cloud-based environments using Amazon Web Services (AWS).

·Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Red shift.

·Developed Pyspark job to load the CSV files into the S3 buckets and created AWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.

·Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.

·Developed a daily process to do incremental import of data from DB2 and Teradata into Hive tables using Sqoop.

·Analyzed the SQL scripts and designed the solution to implement using Spark.

·Worked on importing metadata into Hive using Python and migrated existing tables and the data pipeline from Legacy to AWS cloud (S3) environment and wrote Lambda functions to run the data pipeline in the cloud.

·Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.

·Extensively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.

·Ability to automate ETL processes using Matillion's job scheduler and orchestration features to streamline data pipelines.

·Conducted data modeling and design to ensure efficient and scalable data storage within Snowflake's architecture

·Utilized inbuilt Python module JSON to parse the member data which is in JSON format using JSON loads or JSON. Dumps and load into a database for reporting.

·Used Pandas API to put the data as time series and tabular format for central timestamp data manipulation and retrieval during various loads in the DataMart.

·Used Python libraries and SQL queries/subqueries to create several datasets which produced statistics, tables, figures, charts and graphs and has good experience of software development using Ides: PyCharm, Jupiter Notebook.

·Performed data extraction and manipulation over large relational datasets using SQL, Python, and other analytical tools.

·Extensively worked with Teradata utilities like BTEQ, Fast Export, Fast Load, Multi Load to export and load Claims & Callers data to/from different source systems including flat files.

Environment: AWS EMR, AWS Glue, Redshift, Hadoop, HDFS, Matillion, Snowflake, Teradata, ETL, SQL, Oracle, Hive, Spark, Python, Hive, Sqoop, MicroStrategy, Excel.

Synchrony Financial - Stamford, CT December 2021 to April 2023

Data Engineer

Responsibilities:

·Developed Hive Scripts, Hive UDFs, Python Scripting and used Spark (Spark-SQL,Spark-shell) to process data in Hortonworks.

·Performed advanced procedures like text analytics and processing using the in-memory computing capabilities of PySpark.

·Designed and Developed Scala code for data pull from cloud-based systems and applying transformations on it.

·Usage of Sqoop to import data into HDFS from MySQL database and vice-versa.

·Applied SQL scripts and various AWS resources such as Lambda, Step Function, SNS, and S3 to efficiently automate and streamline data migration.

·Implemented optimized joins to perform analysis on different data sets using PySpark programs.

·Experience in building real-time data pipelines using Matillion and other tools like Apache Kafka and Apache Nifi.

·Created continuous integration and continuous delivery (CI/CD) pipeline on AWS that helps to automate steps in software delivery process.

·Implemented Partitioning, Dynamic Partitions and Buckets in HIVE & Impala for efficient data access.

·Worked in Agile environment and used rally tool to maintain the user stories and tasks.

·Extensively worked on HiveQL, join operations, writing custom UDF's and having good experience in optimizing Hive Queries.

·Experienced in running query using Impala and used BI tools and reporting tool (tableau) to run ad-hoc queries directly on Hadoop.

·Worked on Apache Tez, an extensible framework for building high performance batch and interactive data processing applications Hive jobs

·Experience in using Spark framework with Scala and Python. Good exposure to performance tuning hive queries and MapReduce jobs in spark (Spark-SQL) framework on Hortonworks.

·Proficient in designing and implementing ETL processes using Matillion ETL tool including data extraction, transformation, and loading into various data storage systems.

·Developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark-SQL for Data Aggregation, queries and writing data back into RDBMS through Sqoop.

·Configured Spark streaming (receivers) to receive Kafka input streams from the Kafka and Specified exact block interval for data Processing into HDFS using Scala.

·Collect the data using Spark streaming and dump into HBase and Cassandra.

·Used the Spark- Cassandra Connector to load data to and from Cassandra.

·Collecting and aggregating large amounts of log data using Kafka and staging data in HDFS Data Lake for further analysis.

·Used Hive to analyze data ingested into HBase by using Hive-HBase integration and HBase Filters to compute various metrics for reporting on the dashboard.

·Created and defined job workflows as per their dependencies in Oozie and e-mail notification service upon completion of job for the team that request for the data and monitored jobs using Oozie on Hortonworks.

·Developed proprietary analytical tools, reports Tableau. Created various types of reports (Charts, Gauges, Tables, Matrix, Parameterized, Dashboard, Linked, Sub, Drill Down/Through).

Environment: Hadoop (Cloudera), HDFS, PySpark Hive, Matillion, Python, Pig, Sqoop, AWS, Azure, DB2, UNIX Shell Scripting, JDBC.

Pfizer - New York, NY July 2019 to November 2021

Data Engineer

Responsibilities:

·As a Data/Hadoop Developer worked on Hadoop eco-systems including Hive, MongoDB, Zookeeper, Spark Streaming with MapR distribution.

·Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.

·Involved in various phases of development analyzed and developed the system going through Agile Scrum methodology.

·Involved in designing the row key in HBase to store Text and JSON as key values in HBase table and designed row key in such a way to get/scan it in a sorted order.

·Incorporated Google Cloud Platform (GCP) for enhanced data processing. Utilized GCP services like BigQuery, Cloud Storage, and Cloud Dataflow for efficient analytics and storage in addition to existing Hadoop ecosystems.

·Used Kibana, which is an open source-based browser analytics and search dashboard for Elastic Search.

·Maintain Hadoop, Hadoop ecosystems, and database with updates/upgrades, performance tuning and monitoring.

·Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

·Prepared data analytics processing, and data egress for availability of analytics results to visualization systems, applications, or external data stores.

·Builds large-scale data processing systems in data warehousing solutions, and work with unstructured data mining on NoSQL.

·Responsible for design and development of Spark SQL Scripts based on Functional Specifications.

·Used AWS services like EC2 and S3 for small data sets processing and storage.

·Provisioning of Cloudera Director AWS instance and adding Cloudera manager repository to scale up Hadoop Cluster in AWS.

·Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, and Scala.

Environment: Hadoop 3.0, Spark 2.3, MapReduce, Java, MongoDB, HBase 1.2, JSON, Hive 2.3, Zookeeper 3.4, AWS, MySQL

Data Engineer

Verizon - Dallas, TX April 2017 to June 2019

Responsibilities:

·Participated in Designing databases (schemas) to ensure that the relationship between data is guided by tightly bound Key constraints.

·Writing PL/SQL stored procedures, function, packages, triggers, view to implement business rules into the Application level.

·Extensive experience in Data Definition, Data Manipulation, Data Query and Transaction Control Language

·Understanding the requirements by interacting with business users and mapping them to design and implementing it following the AGILE Development methodology

·Experience in Installing, Upgrading and Configuring Microsoft SQL Server and Migrating data from SQL Server 2008 to SQL Server 2012

·Experience in designing and creating Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions

·Excellent in learning and adapting to new technologies.

·Developed warmup programs to load recently logged in user profile information into MSSQL DB.

·Performed manual testing and used logging tools like Splunk, putty to read the application logs for elastic search.

·Configured the data mapping between Oracle and SQL Server and tested performance accuracy related queries under SQL Server.

·Created connections to database using Hibernate Session Factory, used Hibernate APIs to retrieve and store data with Hibernate transaction control.

·Experienced in writing JUnit test cases for testing.

·Helped in creating Splunk dashboard to monitor MDB modified in the project.

Environment: SQL Developer, Hibernate, Restful Web Services, Agile Methodology, UNIX, Oracle, TOMCAT, Eclipse, Jenkins, CVS, JSON, Oracle PL/SQL

Zensar Technologies Bangalore, India October 2014 to December 2016

ETL Developer

Responsibilities:

Project: Data Warehouse Migration and Machine Learning in Healthcare

·Migrated data to Snowflake Data Warehouse, implement machine learning models for healthcare analytics, and enhance data processing capabilities.

·Migrated data from S3 bucket to Snowflake Data Warehouse.

·Converted database design and loaded scripts from Teradata to Vertica.

·Utilized Informatica Workflow manager and Workflow monitor for creating and monitoring workflows, Worklets, and sessions.

·Built various machine learning models for healthcare analytics, including Linear Regression, Logistic Regression, Support Vector Machines, and Decision Trees.

·Applied Natural Language Processing (NLP) techniques and content classification using tokenization, POS tagging, and NLP toolkits such as NLTK, Gensim, and SpaCy.

·Implemented text mining and expertise in NLP techniques (BOW, TF-IDF, Word2Vec).

·Explored deep learning concepts with Artificial Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks.

·Conducted model tuning using Grid Search, Random Search, and K-Fold cross-validation.

·Applied data science concepts for data visualization and machine learning using Python.

Project : Data Warehousing and ETL Automation for Healthcare Analytics

·Designed and implemented data warehousing solutions, automate ETL processes, and enhance analytics capabilities in healthcare data processing.

·Managed data staging for seamless migration to Snowflake Data Warehouse, involving extraction, cleansing, and transformation. Applied machine learning models (Linear Regression, Logistic Regression, SVM, Decision Trees) and NLP techniques (tokenization, POS tagging) for healthcare analytics.

·Created a database design and executed scripts for data loading from various sources into the data warehouse.

·Used Informatica for workflow automation and monitoring in ETL processes.

·Developed machine learning models using Python for analytics, including regression and classification models.

·Applied data science concepts such as data visualization and machine learning to derive insights from healthcare data.

·Worked with SQL Server Stored Procedures and loaded data into Data Warehouse/Data Marts using Informatica, SQL Loader, Export/Import utilities.

·Collaborated with healthcare analysts to understand data requirements and optimize data processing workflows.

·Ensured compliance with healthcare data regulations and security measures.

·Provided expertise in Linux, Apache Hadoop Framework, and other relevant technologies for effective data processing.

Technologies Used: Linux, Apache Hadoop Framework, HDFS, Hive, HBase, Informatica, SQL Server, Machine Learning, Data Science (Python)

Environment: Hadoop, HDFS, AWS, Vertica, Bash, Scala, Snowflake, Kafka, MapReduce, YARN, Drill, Spark, Pig, Hive, Scala, Python, Java, NiFi, HBase, MySQL, Kerberos, Maven, Shell Scripting, SQL.

Educational Summary:

Bachelor of Technology: Computer Science and Engineering at VNR Vignana Jyoti Institute of Engineering and Technology (2009-2013) Hyderabad, India.

Contact this candidate