Data Engineer Scientist

Location:

Illinoi, IL

Salary:

Posted:

January 27, 2023

Contact this candidate

Resume:

PRIYA N

Data Engineer

Email – ******@******************.*** Contact – 469-***-****

SUMMARY:

Around 7+ years of experience as Data engineer with strong focus on analytical programming – performed data analysis and data visualizations, ETL/ELT process, AWS/Azure cloud technologies, Hadoop Ecosystem and experienced in data migration projects while serving clients in various domains such as IT Services, Pharmaceutical Manufacturing, Healthcare and Airlines domains.

Experienced in development, design, analysis & implementation using SQL, Hadoop Ecosystem, PySpark and Python.

Data engineer with leadership abilities, results-oriented, proactive, and problem-solving. Adapted to the constraints of tight release dates and succeeded.

Professionally qualified in Data Science and Analytics including AI, Machine Learning, Data Mining and Statistical Analysis.

Involved in the entire data science project life cycle and actively involved in all the phases including data extraction, data cleaning, statistical modelling and data visualization with large data sets of structured and unstructured data.

Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, ad aggregations). Strong Knowledge on Architecture of Distributed systems and Parallel processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework.

Experience on Migrating Teradata to AWS Redshift, SAP to AWS Redshift, Migrating on Premise Hadoop databases to AWS Redshift using AWS Data Sync

Experience in Developing Spark Applications using Spark-SQL in Databricks for data extraction, transformation and aggregation from formats for analyzing & transforming the data to uncover insights into the customer usage patterns

Experience in developing web applications by using Python, Django, C++, XML, CSS, HTML, JavaScript and jQuery.

Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, MLlib, Pair RDD 's and worked explicitly on PySpark and Scala.

Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.

Implemented the security requirements for Hadoop and integrating with Kerberos authentication infrastructure- KDC server setup, creating realm /domain, managing.

Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL server, and PostgreSQL. Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase.

Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions.

Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.

Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.

Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.

Instantiated, created, and maintained CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications. Worked on various automation tools like GIT, Terraform, Ansible.

Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and also worked on various applications using python integrated IDEs like Sublime Text and PyCharm.

Developed web-based applications using Python, DJANGO, QT, C++, XML, CSS3, HTML5, DHTML, JavaScript and jQuery.

Highly skilled in deployment, data security and troubleshooting of the applications using AWS services.

Proficient in Shell Scripting and Bash Scripting.

Strong experience in working with python editors like PyCharm, Spyder and Jupyter notebook.

Expertise in using Functional Programming Tools and writing scripts in various operating systems like (Terminal, Bash and PowerShell) Mac, Linux and Windows.

Created data stories from acquiring, processing, visualizing to applying machine learning techniques to derive meaningful outcomes.

Set up clusters in Amazon EC2 and S3 including the automation of setting & extending the clusters in AWS.

Experienced in real time analytics with Apache Spark.

TECHNICAL SKILLS:

Languages

Python, SQL, PL/SQL, Shell scripts, JAVA/J2EE, Scala

Cloud Tools

AWS Glue, S3, Red Shift Spectrum, Kinesis EC2, S3, EMR, Dynamo DB, Data Lake, Athena, AWS Data-Pipeline, AWS Lambda, cloud watch, SNS, SQS.

Web Frameworks

Django, Flask, web2py, Pyramid, Kedro, DBT, Hadoop MapReduce, Informatica Power center 10.2, IICS

Databases

MySQL, PostgreSQL, sequel server, NoSQL, DB2, AWS Redshift, NoSQL – MongoDB, HDFS, Hive

Web services Frameworks

Django-Rest framework, Flask-Restful, Django-Tasty pie, Rest, SOAP

Python Libraries

NumPy, Pandas, Matplotlib, Urllib2

Testing frameworks

Junit, Pytest, Unit testing

Methodologies

Agile, ETL/ELT

Version Controls

GIT, GitHub, SVN

Big Data Tools

Big Data Hadoop Ecosystem, Apache Spark, Map Reduce, Pyspark, Hive, YARN, Kafka, Flume, Oozie, Airflow, Zookeeper, Sqoop, HBase.

Operating systems used

Windows, Linux, Unix, Mac OS, Ubuntu

EDUCATION:

Master of Science in Computer Science and Engineering, University of South Florida, Tampa, FL

Bachelor of Technology in Computer Science and Engineering, Guru Ghasidas University, INDIA.

PROFESSIONAL EXPERIENCE:

Client : United Airlines

Location : Chicago, IL

Role : Data Engineer

Duration : Dec 2021 - Till Date

Responsibilities

Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day.

Implementing and Managing ETL solutions and automating operational processes.

Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.

Load data into Amazon Redshift and use AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential.

Wrote various data normalization jobs for new data ingested into Redshift.

Advanced knowledge on Confidential Redshift and MPP database concepts.

Migrated on premise database structure to Confidential Redshift data warehouse

Was responsible for ETL and data validation using SQL Server Integration Services.

Defined and deployed monitoring, metrics, and logging systems on AWS.

Experienced in Business Analysis and Data Analysis, User Requirement Gathering, User Requirement Analysis, Data Cleansing, Data Transformations, Data Relationships, Source Systems Analysis and Reporting Analysis.

Experienced in all aspects of the System Development Life Cycle (SDLC) from research to requirement analysis through design, implementation, testing, deployment, and enhancement using agile methodologies and Kanban.

Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.

Analysing, designing and developing ETL strategies and processes, writing ETL specifications, Informatica development, and administration and mentoring other team members.

Developed mapping parameters and variables to support SQL override.

Led a Team and managed daily scrum meetings, sprint planning, sprint review, and sprint retrospective for RTB (Run the business) team.

Created user stories, use cases, and workflow diagrams and analysed them to prioritize business requirements. Worked on Business and Technical grooming of user stories.

Worked with business partners (Product manager’s, Lead analysts, Scrum masters) to prioritize the user stories and analyse the gaps.

Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.

Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.

Responsible for Designing Logical and Physical data modelling for various data sources on Confidential Redshift

Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.

Environment: Redshift, AWS Data Pipeline, S3, SQL Server Integration Services, SQL Server2014, AWS Data Migration Services, DQS, SAS Visual Analytics, SAS Forecast server and Tableau.

Client : Peloton Interactive

Location : New York City, NY

Role : AWS Data Engineer

Duration : Oct 2020 - Nov 2021

Responsibilities:

Worked with Business Users and Business analysts for requirements gathering, analyzing the requirements.

Data Solutions is an ongoing increment data load from various data source.

Data Solutions feeds are transferred to Snowflake/MySQL/S3 process through ETL EMR process

Extensively worked on Moving data from Snowflake to Snowflake to S3 for the TMCOMP/ESD feeds

Extensively worked on Moving data from Snowflake to Snowflake to S3 for the LMA/LMA Search

Expertise in Troubleshooting and resolve Data Pipeline related issues.

Used enqueue/Init Lambda to trigger the EMR Process

EMR used for all the Transformation of SQL and Python Scripts loads ETL process. informatica is used for picking the file s3 from Source

Worked on CDMR migration project from Oracle to Snowflake

Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Python On EMR

Processed data from different sources to SNOWFLAKE and MYSQL.

Implemented Spark using Python Spark SQL for faster testing and processing of data.

Write the data and Loaded data into Data Lake environment (SNOWFLAKE) From AWS-EMR which was accessed by business users and data scientist s Using Tableau/OBIEE

Copied data from s3 to Snowflake and connect with SQL workbench seamless importing and movement of data via S3.

Collaborated with engineers to understand data structure to design an ETL package on Azure Databricks which automated extraction of different types of data like pdf, image and excel file from different sources.

Worked on DMS to Process the data to SQL Server

Met with business/user groups to understand the business process and fixed the High priority Production support issues

Data solution team has been involved and supported all production support activities

Served as a Subject Matter Expert on assigned projects.

POC for LMA Search team end -to end production process

Delivers MTTR Analysis Report every quarter and WSR Reports Weekly

Environment: AWS, EC2, S3, IAM, SQS, SNS, SNOWFLAKE, EMR-SPARK, RDS, JSON MySQL Workbench, ETL-Informatica, Oracle, Red Hat Linux, Tableau.

Client : Merck

Location : Rahway, NJ

Role : Data Engineer

Duration : Jan 2019 – Sept 2020

Responsibilities:

Worked on Analysis and understanding of the data from different domains in order to integrate to Data Market Place.

Wrote complex SQL queries using inner join, left join, and temp table to retrieve data from the database for modelling purposes.

Used Spark SQL, PySpark, Python Pandas, and Numpy packages on the data cleaning process and data manipulation.

Used EDA (Exploratory data analysis) on exploring large amounts of information by using PySpark to general view distributions, correlation, statistic values, trends, and patterns of data in Databrick.

Implemented a Python-based k-means clustering via PySpark to analyze the spending habit of different customer groups.

Used Pandas, Numpy, Seaborn, Matplotlib, and Scikit-learn in Python for developing various machine learning models and utilized algorithms such as Logistic regression and random forest.

Worked on Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing of Big Data

Worked with Different business units to drive the design & development strategy.

Created functional specification and technical design documentation.

Co-ordinated with Different Teams like cloud security, Identity Access Management, Platform, Network, Dev OPS in order to get all the necessary accreditations and intake process.

Written Terraform scripts to automate AWS services which include ELB, CloudFront distribution, RDS,EC2, database security groups, Route 53, VPC, Subnets, Security Groups, and S3 Bucket and converted existing AWS infrastructure to AWS Lambda deployed via Terraform and AWS CloudFormation.

Implemented AWS Elastic Container Service (ECS) scheduler to automate application deployment in the cloud using Docker Automation techniques.

Explored large amounts of information by using PySpark to general view distributions, correlation, statistical values, trends, and patterns of data.

Architected and configured a virtual data center in the AWS cloud to support Enterprise Data Warehouse hosting including Virtual Private Cloud (VPC), Public and Private Subnets, Security Groups and Route Tables.

Designed various Jenkins jobs to continuously integrate the processes and executed CI/CD pipeline using Jenkins.

Worked with API gateways, Reports, Dashboards, Databases, Security groups, Data Science Models, Lambda functions

Migrated an existing on-premises application to AWS platform using various services and experienced in maintaining the Hadoop cluster on AWS EMR.

Extensive experience in working with AWS cloud Platform (EC2, S3, EMR, Redshift, Lambda and Glue).

Built NiFi dataflow to consume data from Kafka, make transformations on data, place in HDFS and exposed port to run spark streaming job.

Migrated an existing on-premises application to AWS platform using various services and experienced in maintaining the Hadoop cluster on AWS EMR.

Used Agile methodology and Scrum process for project development.

Experience in using different Hadoop eco system components such as HDFS, YARN, MapReduce, Spark

Expert in creating HIVE UDFs using java in order to analyze data sets for complex aggregate requirements.

Experience in developing ETL applications on large volumes of data using different tools: MapReduce, Spark-Scala, PySpark, and Spark-SQL.

Experience in using SQOOP for importing and exporting data from RDBMS to HDFS and Hive.

Developed Oozie coordinators to schedule Hive scripts to create Data pipelines, on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.

Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring and maintaining Data Pipelines.

Work on Agile and Scrum environment, regularly accessing JIRA tool trackers for the project. Control analytics notebooks, Data Pipeline and ADF Dataflow version with Bitbucket and Git.

Deliver and validate big data ETL processes with PySpark and SQL on Databricks, load data into Azure Delta Lake in CSV or Parquet format partition by date, rank, etc.

De-identify claims data as rules, apply parallel logic to mask fields such as insurer id, zip, account number, birth date across database with Pyspark. Maintain data quality as input data changing.

Environment: AWS, Power BI, Tableau 2020.1, Postgre sql, Angular JS, Python, Pyspark, Snowflake, Dynamo DB, Informatica EDC, AXON, Azure, Mainframe, DB2, IMS, MySQL, Salesforce, JIRA, Terraform, Jenkins

Client : IBM

Location : India

Role : Data Engineer

Duration : Aug 2015 – Dec 2017

Responsibilities:

Analysing, designing and developing ETL strategies and processes, writing ETL specifications, Informatica development, and administration and mentoring other team members.

Led the team in designing and planning functional, regression, GUI and back-end testing.

Actively participated in writing and executing SQL queries for back-end testing

Coordinated the smooth implementation of all the go-lives; extending post-implementation, and application maintenance support.

Communicated with the client on regular basis to eliminate ambiguity in requirements.

Worked as a key contributor in the people development by conducting multiple knowledge.

Developed mapping parameters and variables to support SQL override.

Worked on Teradata and its utilities - tpump, fast load through Informatica. Also created complex Teradata Macros.

Worked and Implemented Pushdown Optimization (PDO) to optimize performance issues of complex mappings involving numerous transformations and hence degrading the performance of the session.

Worked on reusable code known as Tie outs to maintain the data consistency. Compared the source and target after the ETL loading is complete to validate no loss of data during the ETL process.

Worked on Power Exchange bulk data movement process by using Power Exchange Change Data Capture (CDC) method, Power Exchange Navigator, Power Exchange Bulk Data movement. Power Exchange CDC can retrieve updates at user-defined intervals or in near real time.

Worked independently on the critical milestone of the project interfaces by designing a completely parameterized code to be used across the interfaces and delivered them on time in spite of several hurdles like requirement changes, business rules changes, source data issues and complex business functionality.

Environment: Teradata, SQL Server, Shiny, Mahout, HBase, HDFS, Flat files, UNIX- AIX, Windows XP, MS Word, Excel.

Contact this candidate