Big Data Engineer

Location:

Weymouth, MA

Posted:

April 28, 2023

Contact this candidate

Resume:

SUNANDA POUDEL

Sr. Data Engineer/Cloud Developer

Phone: 781-***-****; Email: ***************@*****.***

Profile Summary

•10+ years’ total IT experience and 7+ years in Big Data; expertise in Data Engineering, ETL Data Pipeline Design, Da ta warehousing with large datasets of structured & unstructured data, data acquisition & validation, and predictive modeling.

•Expertise in designing, creating, and managing scalable ETL (extract, transform, load) systems and pipelines for various data sources.

•Optimizing and improving existing data quality and data governance processes to improve performance and stability.

•Working closely with Business Intelligence, Data Science teams and software developers to define strategic objectives as data models designing.

•Experience building data pipelines, designing and implementing data warehouses, and developing real-time data processing solutions for a range of industries, including e-commerce, finance, and healthcare.

•Passionate about staying up-to-date with the latest technologies and best practices, and am committed to driving innovation in the field of big data engineering.

•A versatile technocrat with hands-on experience in HDFS, YARN, Pig, Hive, Sqoop, HBase, Flume, Airflow, SQL, Zookeeper, Kafka, Snowflake on both on-premises and Cloud environments such as AWS, Azure and GCP

•Passionate about Big Data technologies and the delivery of effective solutions through creative problem solving with a track record of building large-scale systems using cutting-edge Big Data tools and systems.

•Strong computer programming skills and machine learning techniques using business intelligence knowledge of applying statistical analysis tools.

•In-depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, YARN, Namenode, Datanode, and MapReduce concepts and experience.

•Hands-on experience working with eco-systems like Hive, Pig, Sqoop, MapReduce, Flume, and Oozie. Proficient in using Apache Hadoop for work with Big Data to analyze large data sets efficiently

•Possess experience working with databases such as Cassandra, HBase, MongoDB, Dynamo DB, etc.

•Efficient at writing complex SQL, PL/SQL for creating tables, views, indexes, stored procedures, and functions.

•Proficient with Microsoft Azure cloud and Amazon Web Services (AWS), with the ability to adapt to Google Cloud Platform.

•Perform data ingestion, extraction and transformation using ETL processes developed using Hive, Sqoop, Kafka, Firehose, Flume, and Kinesis Experience in AWS services like S3, Lambda, EMR, Glue, SQS, SNS, Data catalog, Crawlers and AWS Athena.

Technical Skills

The architecture of Big Data Systems: Amazon AWS - EC2, S3, Kinesis, Azure Data Factory, Azure Data Lake, Google Cloud, Cloudera Hadoop, Horton works, Hadoop, Spark, PySpark, Hive, Kafka

Programming Languages: Python, Scala, Java, Bash, SQL, Shell scripts, HiveQL

Hadoop Components: Hive, Zookeeper, Sqoop, SBT, Yarn, Maven, Flume, HDFS, Airflow

Hadoop Administration: Zookeeper, Oozie, Cloudera Manager, Ambari, Yarn, Hortonworks

Data Management: Apache Cassandra, AWS Redshift, Amazon RDS, Apache HBase, SQL, NoSQL, Elastic Search, HDFS, Data Lake, Data Warehouse, Database, Teradata, SQL Server, Azure Data Lake, MongoDB

Big Data Frameworks: Spark and Kafka, Hadoop, Spark Streaming

Spark Framework: Spark API, Spark Streaming, Spark SQL, Spark Structured Streaming, PySpark

Visualization: Tableau, QlikView, PowerBI, AWS QuickSight, Kibana, Splunk

IDE/Web-based platforms: Jupyter Notebooks, Databricks, PyCharm, IntelliJ, Continuous Integration (CI-CD): Jenkins, Versioning: Git, GitHub, Project Method: Agile Scrum, Test-Driven Development, Continuous Integration, Unit Testing, Functional Testing, Scenario Testing, Dockers containers

Professional Experience

Big Data Engineer / June 2021 – Present

Century 21 Real Estate LLC, Madison, NJ

Collaborated and was involved in the migration of our entire web-platform in an on-premise data center to the cloud (AWS) to support the growth of the company and take advantage of the huge scale of the Cloud to add new features and expand Century 21 as a whole, allowing to our customers launch campaigns and make changes to their campaigns in real time.

Our unique, proactive approach allows us to handle all IT needs seamlessly and in the background, freeing up more of our customers time to focus on their business.

Some of my main responsibilities include but not limited to:

•Build a real-time data processing and analytics system for Century 21 Real Estate using AWS Kinesis, Lambda, and Glue.

•Gather and collect real estate data from various sources. This includes working with data providers, extracting data from websites, and using APIs to access real-time data using AWS services like EC2, S3 and Kinesis Data Streams for ingesting and storage. Maintaining the Hadoop cluster on AWS EMR (with PySpark)

•The system will handle data from various sources such as real estate listings, client inquiries, and sales records.

•Develop near real-time ETL pipelines with Amazon Kinesis and other AWS analytics services, to provide sophisticated applications that wouldn’t be economically feasible on premises.

•Responsible to clean and preprocess the data to ensure its accuracy and completeness creating many AWS Glue jobs (as ETL tool) with Spark, creating many Dataframes and Spark SQL API for faster processing of data.

•Collect and stream ad impression through Amazon Kinesis Data Streams. The ad impressions are then consumed by the company’s internal API billing system, which uses Amazon Kinesis Firehose to aggregate and send the data to an Amazon Redshift data warehouse for analysis.

•Data cleaning, which needs to be stored in S3 buckets and Amazon Redshift as data warehouse solution to meets the business requirements of Century 21.

•Ensure that the integrated data is accurate and consistent after all transformations applied to it, then populate database tables via AWS Kinesis Firehose and AWS Redshift.

•Create the POC to use Amazon Elastic Compute Cloud (Amazon EC2) instances to run the business logic for the ad campaigns.

•Integrate different applications to read data from Amazon Kinesis Data Streams and use AWS Lambda (with Python and Boto4) to process each event based on S3 triggers.

•Work closely with data scientists to ensure that the analysis is accurate and reliable, after all the data is processed and integrated, to be analyzed to identify trends, patterns, and insights.

•Use AWS services such as AWS Identity and Access Management (IAM) and AWS Key Management Service (KMS) to ensure the security and compliance of real estate data.

•Develop Spark Applications by using Spark, and Java, and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.

•Involve in converting Hive/SQL queries into Spark transformations using Spark RDDs an Python.

•Moving the data from snowflake to S3 and vice versa as an alternative to Redshift, planning to go cloud agnostic in the future.

•Using Lambda functions to perform real-time data processing and transformation. This enabled the system to enrich the incoming data, filter it based on predefined rules, and perform calculations on the data.

•Use AWS CloudFormation to automate the deployment and configuration of the AWS infrastructure to enable the platform to be easily replicated across different environments in our internal Data Lake

•Work on writing the data to snowflake from PySpark applications into different file format such as Parquet, CSV, Avro and ORC depending if the module.

•Develop Spark code using Python to run in the EMR clusters to load data into Snowflake Data warehouse.

•Contribute to serverless architectural design using AWS API, Lambda, S3, and Dynamo DB with optimized design with Auto scaling performance.

Data Engineer / April 2019 – May 2021

American Family Insurance, Remote

Collaborated with a Data Analytics team of Spark together in IT in the healthcare industry for our client American Family Insurance.

Throughout that time, we watched IT firms struggle to find a way to combine excellent service with attention to critical details and procedures—and we knew we could make a difference in the world beyond healthcare. American Family Insurance, also abbreviated as AmFam, is an American private mutual company that focuses on property, casualty, and auto insurance, and also offers commercial insurance, life, health, and homeowner coverage as well as investment and retirement-planning products.

•Created S3 bucket structure and Data Lake layout for optimal use of glue crawlers and S3 buckets.

•Created a new Redshift cluster for data science using Quick Sight for reporting and mobile visualization.

•Created training material for others and assist others with interacting with AWS boto3 and Lambda.

•Created metadata tables for Redshift Spectrum and Amazon Athena for serverless querying ad hoc.

•Created a new Data Lake ingesting data from on-prem and other clouds to S3, Redshift, and RDS.

•Developed Spark Applications by using Scala and Python, and Implemented Apache Spark data processing project to handle data from various RDBMS and streaming sources.

•Used Spark and PySpark for streaming and batch applications on many ETL jobs to and from data sources.

•Developed a new API Gateway for streaming to Kinesis and ingestion of event streaming data.

•Used Terraform Enterprise and GitLab to deploy IAC to various AWS accounts.

•Integrated Big Data Spark jobs with EMR and glue to create ETL jobs for around 450 GB of data daily.

•Optimized EMR clusters with partitioning and parquet file format to increase speeds and efficiency in ETL.

•Implemented AWS Step-Functions as orchestration and cloud watch events for automation of pipelines.

•Used Hive Glue data catalog to obtain and validate schema of data and lake formation for data governance.

•Worked and maintained GitLab repository using Git, Bash, and Ubuntu for working on project code.

•Tuned EMR cluster for big data with different GZIP and parquet data formats and compression types.

•Used Terraform to create various lambda functions for orchestrations and automation.

•Set up and configured AWS CLI and boto3 for interacting with a cloud environment.

•Using python to connect to the MQTT protocol and receive data from the sensor.

•Creating data frames with pandas

•Developed the automation and scheduling of the whole project using Apache Airflow with one DAG per module.

•Pulled data from Oracle database into HDFS.

•Access to MySQL to create text files from databases.

•Creation of CSV report using data from sensors and pandas.

•Designed Spark Python job to consume information from S3 Buckets using Boto3.

•Programmed Python classes to stack information from Kafka to DynamoDB according to the ideal model.

•Created automated Python scripts to convert data from different sources and to generate ETL pipelines

•Applied open-source configuration management and deployment using Puppet and Python. Created and maintained Python scripts for automating build and deployment processes.

Data Engineer / February 2018 – January 2019

mBankCard, Remote

mBankCard (merchant account) is a type of bank account that allows businesses to accept payments by credit cards or debit cards. Essentially, a merchant account is an agreement between a retailer, a merchant bank, and a payment processor for the settlement of credit card and/or debit card transactions. We created a solution for all business owners in the U.S. to accept credit card payments, with a custom merchant account for every business.

•Designed a cost-effective archival platform for storing big data using Hadoop and its related technologies.

•Collaborated with the Digital Data Engineering team to develop and implement different data pipelines in Microsoft Azure cloud using most common services such as Azure Data Lake Gen2, Blob Storage, Azure Data Factory, Azure Databricks, Azure SQL, Azure Synapse for analytics and MS Power BI for reporting and visualization.

•Worked with the Data Science team to gather requirements for various data mining projects.

•Developed dynamic file configurations and environment variables to run jobs in different environments.

•Work on hive partitioning bucketing, different files formats JSON, XML, CSV ORC

•Installed clusters, commissioned and decommissioned data nodes, configured slots and on-name node high availability, and capacity planning.

•Executed tasks for upgrading clusters on the staging platform before doing it on a production cluster.

•Used Cassandra to work on JSON-documented data.

•Used HBase to store the majority of data which needed to be divided based on region.

•Managed Zookeeper configurations and ZNodes to ensure High Availability on the Hadoop Cluster.

•Monitored background operations in Hortonworks Ambari.

•Designed and built highly scalable data pipelines for near real-time and batch data ingestion, processing, and data integration using Azure Functions, Azure Event Hub, and CosmosDB.

•Used the image files of an instance to create instances containing Hadoop installed and running.

•Responsible to manage data coming from different sources.

•Import and export data between the environments like MySQL, and HDFS and deploy into productions.

•Involved in SQL Development, Unit Testing, and Performance Tuning to ensure testing issues are resolved on basis of using defect reports.

•Involved in preparing SQL and PL/SQL coding conventions and standards.

•Worked closely with data scientists for data gathering enterprise level to predict consumer behavior, such as what products user has bought, and made recommendations based on recognizing patterns.

•Written Python scripts to analyze the data.

•Exported data from the HDFS environment into RDBMS using SQOOP for report generation and visualization purposes.

•Consulting with business clients, creating Tableau dashboard reports, collaborating with developer teams, using data visualization tools, and using the Tableau platform to create business intelligence visualizations.

•Analyzed large amounts of data sets to determine the optimal way to aggregate and report on them.

•Created and modified users and groups with root permissions.

•Coding knowledge of python, Django Framework, and Rest Framework as well as sound knowledge of web frameworks.

•Navigating and using SQL Management Studio.

•Writing SQL scripts and queries.

Hadoop Developer (Cloud) / August 2016 – January 2018

The Hershey Company, Derry Township, PN

The Hershey Company, commonly known as Hershey's, is a multinational company that manufactures chocolates, cookies, and cakes, and sells beverages such as milkshakes.

•Performed data profiling and transformation on the raw data using Python, and oracle.

•Designed and implemented a data ingestion framework to load data into the data lake for analytical purposes.

•Created Lambda to process the data from S3 to Spark for structured streaming to get structured data by the schema.

•Created a benchmark between MongoDB and HBase for fast ingestion

•Configured, installed, and managed Hortonworks (HDP) Distributions.

•Created Hadoop clusters using HDFS, and Amazon Redshift for NoSQL along with ArrangoDB for multi-modal data warehouse solution.

•Implemented Spark using Scala and utilized DataFrames and Spark SQL API for faster processing of data.

•Implemented Amazon Virtual Private Cloud (Amazon VPC) and configured direct connection to the Nestle data centers.

•Applied Amazon Simple Storage Service (Amazon S3) for data backups, including HANA, and Amazon Elastic Block Store (Amazon EBS) provisioned IOPS (P-IOPS) volumes for storage.

•Used Scala to connect to EC2 and push files to AWS S3.

•Scheduled and executed workflows in Oozie to run Hive jobs.

•Implemented Amazon Web Services (AWS) SAP HANA environment to achieve the speed, performance, and agility it required without making a significant investment in physical hardware.

•Used Spark Streaming to divide streaming data into batches as input to Spark engine for batch processing.

•Utilized PyCharm for the majority of project testing and coding.

•Used Amazon Elastic Compute Cloud (Amazon EC2) instances to process terabytes of sales data weekly from promotions, modeling dozens of data simulations a day.

•Worked on various real-time and batch processing applications using Spark/Scala, Kafka and Cassandra.

•Built Spark applications to perform data enrichments and transformations using Spark Data frames with Cassandra lookups.

•Managed Zookeeper configurations and ZNodes to ensure High Availability on the Hadoop Cluster.

Business Intelligence (BI) Developer / Nov 2013 – Jun 2016

MetLife, Inc., NY

MetLife, through its subsidiaries and affiliates, is one of the world’s leading financial services companies, providing insurance, annuities, employee benefits, and asset management to help its individual and institutional customers navigate their changing world.

•Configured Hive for exposing data for further analysis and for generating and transforming files from different analytical formats to text files.

•Analyze and interpret financial, CMDB, and consumption feed data, and organize them into accessible formats with meaningful insights.

•Applied many SDLC methodologies to leverage End User Technology Product Management experience in business analytics.

•Analyze financial data to uncover industry, company, and customer trends using Data Science/Web Automation tools such as Python, Jupyter notebooks/Pycharm, Selenium/BeautifulSoup/Requests.

•Developed data warehouse solutions for Business Intelligence and Data normalization of tables to apply SQL operations and publish reports to Power BI.

•Programmed shell scripts to monitor the health of Apache Tomcat; daemon services and respond accordingly to any warning or failure conditions.

•Installed and configured MySQL Servers, and maintained and updated the servers.

•Worked on Data Warehousing, Hadoop HDFS, and pipelines.

•Configured system to pull data from various file formats from various sources into Hadoop HDFS with Hive, Sqoop, and Spark.

•Transformed and cleansed data for BI analysis.

•Engaged with business and technology stakeholders on day-to-day requests and long-term planning initiatives.

•Analyzed system failures, identified root causes, recommended corrective actions, and fixed the issues.

•Developed, tested, and implemented the financial services application to bring multiple clients into a standard database format.

Education

Bachelor in Electronics & Communication Engineering

Pokhara Engineering College

Contact this candidate