Senior Data Engineer

Location:

Naperville, IL

Salary:

$60

Posted:

August 29, 2023

Contact this candidate

Resume:

SR. DATA ENGINEER

PRIYANKA N

*****.****@*****.***

925-***-****

PROFESSIONAL SUMMARY:

I bring 7+ years of dedicated SR. Data Engineering experience, with a focus on analytical programming, ETL/ELT techniques, and harnessing the power of AWS/Azure clouds. My expertise extends to the Hadoop Ecosystem, and I've led successful data and cloud migration endeavors across IT Services, Healthcare, Finance, and Airlines domains.

Hands-on experience in implementing various Data Engineering, Data Warehouse, Data Modelling, DataMart, Data Visualization, Reporting, Data Quality, Data virtualization and Data Science Solutions.

Experience in Data transformation, Data mapping from source to target database schema, Data Cleansing procedures.

●Developing, designing, and analyzing using SQL, Hadoop Ecosystem, PySpark, and Python.

●Leading with strong leadership skills, results-driven, proactive, and adept at problem-solving within tight release timelines.

●Managing the entire Data Science project lifecycle, from extraction and cleansing to statistical modeling and visualization.

●Expertise in Data Ingestion, Processing, and Strong Knowledge on Architecture of Distributed Systems, Parallel Processing, MapReduce, and Spark execution.

●Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, ad aggregations).

●Successfully migrating on-premises Teradata and Hadoop databases to AWS S3 and Redshift using tools such as AWS SCT and AWS Data Sync.

●Designing and implementing on-demand Redshift tables via DynamoDB and AWS Glue. In-depth understanding of MapReduce programming paradigm and Spark execution frame DynamoDB work.

●Developing Spark Applications in Databricks for data extraction, transformation, and aggregation.

●Enhancing Spark performance and optimizing algorithms using Spark Context, Spark-SQL, Data frame API, Spark Streaming, MLlib, and Pair RDDs.

●Experience with Oozie and airflow for managing Hadoop jobs and DAG workflows.

●Creating ETL solutions with AWS services like Glue, Lambda, EMR, Athena, S3, SNS, Kinesis, and PySpark.

●Handling SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL Server, and PostgreSQL.

●Designing RDBMS components, including Tables, Views, Indexes, Procedures, Cursors, Triggers, and Transactions.

●Leading Data Lake and Data Warehouse projects with Big Data technologies like Spark and Hadoop.

●Proficiency in AWS services, including EC2, S3, RDS, VPC, IAM, Load Balancing, Auto Scaling, CloudWatch, and more.

●Implementing data security pipelines and automating environments using GIT, Terraform, and Ansible.

●Ensuring secure and efficient data transfer to AWS using Alteryx. Deep understanding of AWS architecture, facilitating seamless integration of Alteryx workflows.

●Proficient in Shell and Bash Scripting, as well as Functional Programming in various OS environments.

●Design and develop data models and data warehouses in Snowflake. Develop and maintain ETL processes to move data from source systems to Snowflake.

●Leveraging Azure Data Factory (ADF) for data orchestration and pipeline management.

TECHNICAL SKILLS:

PROGRAMMING LANGUAGES: Python, SQL, Pyspark, NodeJS, Shell scripts, JAVA, Scala, NumPy, Pandas, Matplotlib,

DATABASE DESIGN TOOLS: MS Visio, Fact and Dimensions tables, Normalization and De- normalization

techniques.

DATA MODELLING TOOLS: Erwin Data Modeler and Manager, ER Studio v17, physical and logical data

modeling

ETL/DATA WAREHOUSE TOOLS: Informatica Power Center, Talend, Tableau, Pentaho, SSIS, DataStage, Kedro, DBT

QUERYING LANGUAGES: SQL, NO SQL, PostgreSQL, MySQL, Microsoft SQL, Spark-SQL, Sqoop.

DATABASES: AWS RDS, Teradata, Hadoop FS, SQL Server, Oracle, Microsoft SQL, DB2.

NOSQL DATABASES: MongoDB, Hadoop HBase, Apache Cassandra

HADOOP ECOSYSTEM: Hadoop, MapReduce, Yarn, HDFS, Kafka, Storm, Pig, Oozie, Zookeeper

BIGDATA ECOSYSTEM: Spark, Spark SQL, Spark Streaming, Python Spark, Hive, Impala

INTEGRATION TOOLS: Git, GitHub, Jenkins, Code Commit (AWS).

CLOUD TOOLS: AWS: Glue, S3, Red Shift Spectrum, Kinesis, EC2, EMR, Dynamo DB, Data Lake, Athena, Data-Pipeline, Lambda, CloudWatch, SNS, SQS, Other: Databricks, Snowflake

OPERATING SYSTEMS: Windows, Linux, Unix, macOS, Ubuntu

METHODOLOGIES: Agile, Scrum, Waterfall

PROFESSIONAL CERTIFICATIONS

● AWS Certified Solutions Architect - Associate

Validation Number: EHQ7X6G24MB4QE99

Validate at: http://aws.amazon.com/verification.

● Snowflake Certified on Data Warehouse Hands-on Essentials

● Certified Scrum Product Owner (CSPO), Scrum Alliance

● Product Management Certification, Product School

PROFESSIONAL EXPERIENCE:

Client : United Airlines Location : Chicago, IL

Role : Senior Data Engineer Duration : Dec 2021 - Till Date

Responsibilities

●Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day.

●Implementing and Managing ETL solutions and automating operational processes.

●Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.

●Load data into Amazon Redshift and use AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential.

●Wrote various data normalization jobs for new data ingested into Redshift. Advanced knowledge on Confidential Redshift and MPP database concepts.

●Migrated on premise database structure to Confidential Redshift data warehouse.

●Was responsible for ETL and data validation using SQL Server Integration Services.

●Defined and deployed monitoring, metrics, and logging systems on AWS.

●Utilized ADO (Azure DevOps) for version control, continuous integration, and deployment of data engineering solutions.

Used Python scripting by implementing machine algorithms to predict data and forecast data for better results

Working experience on ETL implementation using AWS services like Glue, Lambda, EMR, Athena, S3, SNS, Kinesis, Data-Pipelines, PySpark, etc. is required.

●Experienced in Business Analysis and Data Analysis, User Requirement Gathering, User Requirement Analysis, Data Cleansing, Data Transformations, Data Relationships, Source Systems Analysis and Reporting Analysis.

●Experienced in all aspects of the System Development Life Cycle (SDLC) from research to requirement analysis through design, implementation, testing, deployment, and enhancement using agile methodologies and Kanban.

●Expertise in AWS Resources like EC2, S3, EBS, VPC, ELB, AMI, SNS, RDS, IAM, Route 53, Autoscaling, Cloud Formation, Cloud Watch, Security Groups.

●Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.

●Analyzing, designing and developing ETL strategies and processes, writing ETL specifications, Informatica development, and administration and mentoring other team members.

●Experience in working with AWS Redshift, Glue, Dynamo DB.

Generated and injected wealth management data to AWS platform using python scripts.

Developed custom APIs using Nodejs and integrated them with third-party services to enhance application functionality and deliver seamless user experiences.

●Developed mapping parameters and variables to support SQL override.

●Led a Team and managed daily scrum meetings, sprint planning, sprint review, and sprint retrospective for RTB (Run the business) team.

●Used AWS services like Dynamo DB and S3 for small data sets processing and storage.

●Created user stories, use cases, and workflow diagrams and analyzed them to prioritize business requirements. Worked on Business and Technical grooming of user stories.

●Worked with business partners (Product manager’s, Lead analysts, Scrum masters) to prioritize the user stories and analyze the gaps.

●Implemented Workload Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.

●Expertise in creating Apache Spark program using Python (PySpark) to build a topology and well versed in using Datasets and data frames.

●Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.

●Responsible for Designing Logical and Physical data modelling for various data sources on Confidential Redshift

●Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.

Environment: Redshift, AWS Data Pipeline, S3, SQL Server Integration Services, SQL Server2014, AWS Data Migration Services, DQS, SAS Visual Analytics, SAS Forecast server and Tableau.

Client : Merck Location : New York City, NY

Role : Senior Data Engineer Duration : Oct 2020 - Nov 2021

Responsibilities:

●Worked with Business Users and Business analysts for requirements gathering, analyzing the requirements.

●Performed ongoing increment data load from various data sources and Data Solutions feeds are transferred to Snowflake/MySQL/S3 process through ETL EMR process.

●Extensively worked on Moving data from Snowflake to Snowflake to S3 for the TMCOMP/ESD feeds

●Extensively worked on Moving data from Snowflake to Snowflake to S3 for the LMA/LMA Search

●Expertise in Troubleshooting and resolving Data Pipeline related issues.

●Experienced in Maintaining the Hadoop cluster on AWS EMR.

●Used enqueue/Init Lambda to trigger the EMR Process

●EMR used for all the Transformation of SQL and Python Scripts loads ETL process. informatica is used for picking the file s3 from Source.

●Worked on CDMR migration project from Oracle to Snowflake

●Importing data from Dynamo DB to Redshift in Batches using Amazon Batch using TWS scheduler.

●Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Python On EMR

●Apache Apply data analysis, data mining and data engineering to present data clearly.

●Processed data from different sources to SNOWFLAKE and MYSQL.

●Implemented Spark using Python Spark SQL for faster testing and processing of data.

●Write the data and Loaded data into Data Lake environment (SNOWFLAKE) From AWS-EMR which was accessed by business users and data scientist s Using Tableau/OBIEE Redshift

●Copied data from s3 to Snowflake and connect with SQL workbench seamless importing and movement of data via S3.

●Collaborated with engineers to understand data structure to design an ETL package on Azure Databricks which automated extraction of different types of data like pdf, image and excel file from different sources.

●Developed and maintained data pipelines using Azure SQL Database for efficient data storage and retrieval.

●Worked on DMS to Process the data to SQL Server

●Met with business/user groups to understand the business process and fixed the High priority Production support issues.

●Designed, developed, and deployed serverless applications on AWS Lambda using Nodejs. Leveraged AWS API Gateway for seamless integration and to enable efficient microservices architecture.

●Designed and implemented data extraction, transformation, and loading (ETL) processes using SSIS, enabling seamless data integration between Hadoop and SQL Server databases.

●Created Pyspark frame to bring data from Db2 to AWS s3.

●Well Versed with Major Hadoop distributions, Cloudera, and Horton Works.

●Data solution team has been involved and supported all production support activities.

●Delivers MTTR Analysis Report every quarter and WSR Reports Weekly

Environment: AWS, EC2, S3, IAM, SQS, SNS, SNOWFLAKE, EMR-SPARK, RDS, JSON MySQL Workbench, SSIS,ETL-Informatica, Oracle, Red Hat Linux, Tableau.

Client : UnitedHealth Location : Rahway, NJ

Role : Data Engineer Duration : Jan 2019 – Sept 2020

Responsibilities:

●Worked on Analysis and understanding of the data from different domains in order to integrate to Data Market Place.

●Wrote complex SQL queries using inner join, left join, and temp table to retrieve data from the database for modelling purposes.

●Used Spark SQL, PySpark, Python Pandas, and NumPy packages on the data cleaning process and data manipulation.

●Used EDA (Exploratory data analysis) on exploring large amounts of information by using PySpark to general view distributions, correlation, statistic values, trends, and patterns of data in Databrick.

●Implemented a Python-based k-means clustering via PySpark to analyze the spending habit of different customer groups.

●Used Pandas, NumPy, Seaborn, Matplotlib, and Scikit-learn in Python for developing various machine learning models and utilized algorithms such as Logistic regression and random forest.

●Worked on Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing of Big Data

●Worked extensively with Azure Synapse for scalable data warehousing and analytics.

●Worked with Different business units to drive the design & development strategy.

●Created functional specification and technical design documentation.

●Co-ordinated with Different Teams like cloud security, Identity Access Management, Platform, Network, DevOPS in order to get all the necessary accreditations and intake process.

●Written Terraform scripts to automate AWS services which include ELB, CloudFront distribution, RDS, EC2, database security groups, Route 53, VPC, Subnets, Security Groups, and S3 Bucket and converted existing AWS infrastructure to AWS Lambda deployed via Terraform and AWS CloudFormation.

●Implemented AWS Elastic Container Service (ECS) scheduler to automate application deployment in the cloud using Docker Automation techniques.

●Explored large amounts of information by using PySpark to general view distributions, correlation, statistical values, trends, and patterns of data.

●Architected and configured a virtual data center in the AWS cloud to support Enterprise Data Warehouse hosting including Virtual Private Cloud (VPC), Public and Private Subnets, Security Groups and Route Tables.

●Designed various Jenkins jobs to continuously integrate the processes and executed CI/CD pipeline using Jenkins.

●Worked with API gateways, Reports, Dashboards, Databases, Security groups, Data Science Models, Lambda functions.

●Migrated an existing on-premises application to AWS platform using various services and experienced in maintaining the Hadoop cluster on AWS EMR.

●Extensive experience in working with AWS cloud Platform (EC2, S3, EMR, Redshift, Lambda and Glue).

●Built NiFi dataflow to consume data from Kafka, make transformations on data, place in HDFS and exposed port to run spark streaming job.

●Migrated an existing on-premises application to AWS platform using various services and experienced in maintaining the Hadoop cluster on AWS EMR. Redshift

●Used Agile methodology and Scrum process for project development.

●Experience in using different Hadoop eco system components such as HDFS, YARN, MapReduce, Spark

●Expert in creating HIVE UDFs using java in order to analyze data sets for complex aggregate requirements.

●Experience in developing ETL applications on large volumes of data using different tools: MapReduce, Spark-Scala, PySpark, and Spark-SQL.

●Experience in using SQOOP for importing and exporting data from RDBMS to HDFS and Hive.

●Developed Oozie coordinators to schedule Hive scripts to create Data pipelines, on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.

●Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring and maintaining Data Pipelines.

●Work on Agile and Scrum environment, regularly accessing JIRA tool trackers for the project. Control analytics notebooks, Data Pipeline and ADF Dataflow version with Bitbucket and Git.

●Deliver and validate big data ETL processes with PySpark and SQL on Databricks, load data into Azure Delta Lake in CSV or Parquet format partition by date, rank, etc.

●De-identify claims data as rules, apply parallel logic to mask fields such as insurer id, zip, account number, birth date across database with Pyspark. Maintain data quality as input data changing.

Environment: AWS, Power BI, Tableau 2020.1, Postage sql, Angular JS, Python, Pyspark, Snowflake, Dynamo DB, Informatica EDC, AXON, Azure, Mainframe, DB2, IMS, MySQL, Salesforce, JIRA, Terraform, Jenkins

Client : Oracle Financial Services Location : India

Role : Data Engineer Duration : Jan 2016 – Dec 2017

Responsibilities:

●Analyzing, designing and developing ETL strategies and processes, writing ETL specifications, Informatica development, and administration and mentoring other team members.

●Led the team in designing and planning functional, regression, GUI and back-end testing.

●Actively participated in writing and executing SQL queries for back-end testing

●Coordinated the smooth implementation of all the go-lives, extending post-implementation, and application maintenance support.

●Communicated with the client on regular basis to eliminate ambiguity in requirements.

●Worked as a key contributor in the people development by conducting multiple knowledge.

●Developed mapping parameters and variables to support SQL override.

●Worked on Teradata and its utilities - trump, fast load through Informatica. Also created complex Teradata Macros.

●Worked and Implemented Pushdown Optimization (PDO) to optimize performance issues of complex mappings involving numerous transformations and hence degrading the performance of the session.

●Worked on reusable code known as Tie outs to maintain the data consistency. Compared the source and target after the ETL loading is complete to validate no loss of data during the ETL process.

●Worked on Power Exchange bulk data movement process by using Power Exchange Change Data Capture (CDC) method, Power Exchange Navigator, Power Exchange Bulk Data movement. Power Exchange CDC can retrieve updates at user-defined intervals or in near real time.

●Worked independently on the critical milestone of the project interfaces by designing a completely parameterized code to be used across the interfaces and delivered them on time despite several hurdles like requirement changes, business rules changes, source data issues and complex business functionality.

Environment: Teradata, SQL Server, Shiny, Mahout, HBase, HDFS, Flat files, UNIX- AIX, Windows XP, MS Word, Excels, python, pyspark, aws

EDUCATION:

●Master of Science in Computer Science and Engineering, University of South Florida, Tampa, FL

●Bachelor of Technology in Computer Science and Engineering, Guru Ghasidas University,

INDIA

Contact this candidate