Data Engineer Big

Location:

Washington, DC

Posted:

February 01, 2024

Contact this candidate

Resume:

Kamal Subedi

Data Engineer

Email: **************@*****.***

Contact: +1-302-***-****

PROFESSIONAL SUMMARY:

Accomplished Sr. Data Engineer with 7+ years of experience in Data Engineering, specializing in Data Pipeline Design, Development, and Implementation.

Extensive knowledge and hands-on experience in Hadoop and Big Data ecosystems, including HDFS, MapReduce, Spark, Cassandra, Pig, Sqoop, Hive, Oozie, and Kafka.

Hands on experience building and deploying Airflow DAGs.

Proficient in working with Hive data warehouse infrastructure, encompassing table creation, data distribution through Partitioning and Bucketing techniques, and developing and optimizing HQL queries.

Good knowledge and experience with NoSQL databases like HBase, Cassandra, and MongoDB, as well as SQL databases such as Teradata, Oracle, PostgreSQL, and SQL Server.

Strong expertise in performance tuning of Spark applications and Hive scripts to achieve optimal execution.

Strong experience using HDFS, MapReduce, Hive, Spark, Sqoop, Oozie, and HBase.

Possess strong working experience with Cassandra for data retrieval and modeling.

Proficient in understanding Cassandra's cluster mechanism, including replication strategies, snitch, gossip, consistent hashing, and consistency levels.

Utilized DataStax Spark-Cassandra connector to load data into Cassandra and analyze it using CQL for efficient searching, sorting, and grouping.

Appended the Data Frames into Cassandra Key Space Tables using DataStax Spark-Cassandra Connector.

Experience with Cassandra YAML, Configuration files, RACK DC properties file, Cassandra-env file for JMX configurations etc.,

Developed Spark Applications utilizing Spark RDD, Spark-SQL, and Data frame APIs for efficient data processing.

Demonstrated experience in building end-to-end data pipelines on the Hadoop platform.

Created both simple and complex MapReduce streaming jobs using Python, integrated with Hive and Pig.

Skilled in processing large volumes of structured, semi-structured, and unstructured data, supporting diverse systems and application architectures.

Proficient in creating separate virtual data warehouses of various size classes.

Demonstrated hands-on experience with Hortonworks tools such as Tea and Ambari.

Hands-on experience in bulk loading and unloading data into Snowflake tables using the COPY command.

Experience in data transformations leveraging Snow SQL in Snowflake.

Experience in writing custom MapReduce programs &UDF's in Java to extend Hive and Pig core functionality.

Solid understanding of AWS, Redshift, S3, EC2, Apache Spark, Scala processes, and concepts, including server configuration for auto scaling and elastic load balancing.

Experienced in Importing and exporting data into HDFS and Hive using Sqoop.

Possess in-depth knowledge of Data Sharing in Snowflake, leveraging its capabilities for effective data management.

Proficient in machine learning, big data, data visualization, and R and Python development. Also skilled in Linux, SQL, and GIT/GitHub.

Experienced in Python data manipulation for loading, extraction, and analysis using libraries such as NumPy, SciPy, and Pandas.

Developed comprehensive mapping spreadsheets for the ETL team, providing source-to-target data mapping with physical naming standards, data types, volume, domain definitions, and corporate meta-data definitions.

Proficient in data visualization, designing dashboards using Tableau, and generating complex reports, including charts, summaries, and graphs to effectively communicate findings to team members and stakeholders.

Strong experience in scripting using Python API, PySpark API, and Spark API for data analysis.

Extensive experience in Extraction, Transformation, and Loading (ETL) of data from various sources into Data Warehouses, as well as data processing, including collection, aggregation, and movement of data from multiple sources.

Skilled in designing star schema and Snowflake schema for Data Warehousing.

Proficient in Agile Methodologies, with involvement in designing and developing data pipeline processes for various modules within AWS.

TECHNICAL SKILLS:

Operating Systems

Unix, Linux, Mac OS, Windows, Ubuntu

Hadoop Ecosystem

HDFS, MapReduce, Yarn, Oozie, Zookeeper, Job Tracker, Task Tracker, Name Node, Data Node, Cloudera

Big Data Tech

Hadoop, Spark, MapReduce, YARN, Hive, Pig, Sqoop, Oozie, Maven

Data Ingestion

Sqoop, Flume, Kafka

Cloud Computing

Tools

Snowflake, AWS, Databricks, Azure data lake services, Amazon EC2, EMR, S3

NoSQL Databases

HBase, Cassandra, MongoDB, Apache, Hadoop HBase

Programming Languages

Python (Jupiter, PyCharm IDE), R, Java, Scala, SQL, PL/SQL, SAS

Scripting Languages

Bash, Python, R Language, YAML, Shell Scripts

Databases

Snowflake Cloud DB, AWS Redshift, AWS Athena, Oracle, MySQL, Teradata 12/14, DB2 10.5, MS Access, SQL Server, PostgreSQL 9.3, Netezza, AmazonRDS

SQL Server Tools

SQL Server Management studio, Enterprise Manager, Query Analyzer, Profiler, Export, and Import (DTS)

IDE

IntelliJ, Eclipse, Visual Studio, IDLE

Web Services

Restful, SOAP, O9iAS, Oracle Form Server, WebLogic 8.1/10.3, Web Sphere MQ 6.0

Methodologies

Agile, Scrum, Waterfall

ETL

PySpark, CTRL-M, Data Stage, Informatica Power Center, Talend, Pentaho, Microsoft SSIS, DataStage 7.5, Ab Initio

Reporting / BI Tools

MS Excel, Tableau, Tableau Server and Reader, Power BI, QlikView, Crystal Reports, SSRS, Splunk

PROFESSIONAL EXPERIENCE:

United Health Group, Falls Church VA 02/2021 - Current

AWS Data Engineer

Responsibilities:

Successfully migrated on-premise ETL pipelines from IBM Netezza to AWS, automating the data migration process to AWS S3, running ETL processes using Spark on EC2, and delivering data to AWS S3, AWS Athena, and AWS Redshift.

Played a crucial role in requirements gathering and building a data lake on top of HDFS, utilizing Go-CD CI/CD tool for application deployment, and possessing significant experience within the big data testing framework.

Implementing Amazon DynamoDB as a fully managed NoSQL database solution, taking advantage of its scalability, low-latency performance, and seamless integration with other AWS services.

Implementing big data projects on Cloudera 5.6, 5.8, 5.13, Hortonworks 2.7, and AWS 5.6, 5.20, 5.29 versions.

Implementing MongoDB solution for handling unstructured data, providing flexibility in schema design and supporting dynamic data models.

Developing MongoDB queries and aggregation pipelines to retrieve, transform, and analyze data stored in MongoDB collections.

Utilizing Hortonworks distribution for the Hadoop ecosystem to drive project success.

Creating Sqoop jobs for importing data from Relational Database systems into HDFS and exporting results into databases using Sqoop.

Utilizing Pig and Pig scripts extensively for data cleansing and manipulation tasks.

Implementing and scheduling with Oozie workflow engine to orchestrate the execution of multiple Hive and Pig jobs.

Developing Python scripts for data analysis and customer insights.

Designed and implemented partitioned tables in Hive, leveraged Hive external tables for data warehousing, and crafted Hive queries for data analysis.

Leveraged Flume to capture data logs from web servers into HDFS for further analysis.

Utilizing Spark's in-memory computing capabilities to perform advanced procedures like text analytics and processing.

Migrating MapReduce jobs to Spark, utilizing Spark SQL and Data Frames API to load structured data into Spark clusters.

Designing and optimized DynamoDB tables to accommodate various types of data models, ensuring efficient storage, retrieval, and querying of data.

Developing scripts and tools to automate MongoDB backup, restoration, and maintenance tasks, ensuring data integrity and availability.

Utilizing Data Frames for data transformations, taking advantage of Spark's RDD transformations and actions.

Designing and developed Spark workflows using Scala, extracting data from cloud-based systems and applying transformations.

Developing ETL processes using Informatica to load data into Snowflake from various sources.

Utilizing Spark Streaming to consume topics from distributed messaging source Event Hub and process real-time data batches.

Tuning Cassandra and MySQL for optimal data performance.

Implementing monitoring and established best practices for using Elasticsearch.

Utilizing Spark API over Hortonworks Hadoop YARN for analytics on Hive data.

Implementing fine-grained access control using DynamoDB's built-in security features, such as AWS Identity and Access Management (IAM) roles and policies, to ensure secure data access.

Utilizing Apache NiFi as an ETL tool for batch and real-time processing.

Extracting data from Cassandra through Sqoop and processed it in HDFS.

Working on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop.

Conducting performance tuning and optimization of MongoDB databases, addressing query bottlenecks and improving overall system responsiveness.

Developing MapReduce jobs to power data for search and aggregation purposes.

Managing Hadoop jobs using Oozie workflow scheduler.

Developing code to write JSON records from multiple input sources to Kafka queues, leveraging Kafka producers and consumers to load data from Linux file systems, servers, and Java web services.

Environment: Cloudera Management (CDH5), Hadoop, Hive, Map Reduce, Sqoop, Spark, Eclipse, Maven, agile methodologies, AWS, Tableau, Pig, Elastic search, Strom, Cassandra, DataStax Cassandra, Impala, Oozie, Python, Shell Scripting, Java Collection, MySQL, Apache AVRO, Sqoop Zookeeper, SVN, Jenkins, Snowflake, windows AD, windows KDC, Horton works distribution of Hadoop 2.3, YARN, Amari

Principal Financial Group Inc Des Moines, IA 03/2019 - 01/2021

AWS Data Engineer

Responsibilities:

Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).

Designed AWS architecture, Cloud migration, AWS EMR, DynamoDB, Redshift and event processing using lambda function.

Utilized AWS services with focus on big data analytics, enterprise data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility

Involved in building data pipelines to ingest and transform data using spark and loading the output in multiple sources like Hive and HBase.

Implemented and managed PostgreSQL databases, taking advantage of its open-source nature, extensibility, and support for advanced SQL features.

Worked as a Snowflake Developer for Teradata to Snowflake migration using hybrid migration model (combination of lift & shift and staged).

Migrated the PHI data from existing Teradata DWH to Snowflake using TPT data moved to S3 buckets. From S3 used matillion to load data into Canonical layer in Snowflake.

Extracted and loaded CSV files, Json files data from AWS S3 to Snowflake Cloud Data Warehouse.

Monitored and optimized the usage of warehouses, automatic clustering, snow pipes based on the business needs and reduced the cost by 15%.

Used snow pipes to consume Claims & Enrollment data from vendors (Institutional, Financial, and independent) and use it in Near real time reporting.

Designed and optimized database schemas in PostgreSQL, considering normalization and denormalization strategies based on specific application requirements.

Worked on creating SNS (Sequential Notification Services) and SQS (Sequential queuing Services) for snow pipes.

Using Bash and Python included Boto3 to supplement automation provided by Ansible and Terraform for tasks such as encrypting EBS volumes backing AMIs.

Involved in using Terraform migrate legacy and monolithic systems to Amazon Web Services.

Wrote Lambda function code and set Cloud Watch Event as trigger with Cron job Expression.

Validated Scoop jobs, Shell scripts & perform data validation to check if data is loaded correctly without any discrepancy. Perform migration and testing of static data and transaction data from one core system to another.

Worked on creating and running Docker images with multiple micro – services and Docker container orchestration using ECS and lambda.

Developed Spark scripts by writing custom RDDs in Scala for data transformations and perform actions on RDDs.

Conducted PostgreSQL database upgrades, applying patches and new releases while ensuring minimal downtime and data integrity.

Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.

Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.

Implemented Kafka producers create custom partitions, configured brokers, and implemented high level consumers to implement data platform.

Developed best practice, processes, and standards for effectively carrying out data migration activities. Work across multiple functional projects to understand data usage and implications for data migration.

Prepared data migration plans including migration risk, milestones, quality, and business sign-off details.

Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.

Worked on migrating MapReduce programs into Spark transformations using Scala.

Developed spark code and spark-SQL/streaming for faster testing and processing of data.

Wrote Python modules to extract data from the MySQL source database.

Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage.

Created Jenkins jobs for CI/CD using git, Maven and Bash scripting

Involved in collecting, aggregating and moving data from servers to HDFS using Flume.

Involved in creating Oozie workflow and Coordinator jobs to kick off the jobs on time for data availability.

Enabled speedy reviews and first mover advantages by using Oozie to automate data loading into the Hadoop Distributed File System and PIG to pre-process the data.

Developed Pig Latin scripts to extract and filter relevant data from the web server output files to load into HDFS.

Conducted ETL Data Integration, Cleansing, and Transformations using AWS glue Spark script.

Environment: Amazon Redshift, DynamoDB, Pyspark, Snowflake, EC2, EMR, Glue, S3, Kafka, IAM, PostgreSQL, Jenkins, Maven, AWSCLI, Shell Scripting, Git, Hadoop, Hive, Map Reduce, Sqoop, Spark

Citi Bank Irving, TX 08/2018 - 02/2019

Big Data / Hadoop Developer

Responsibilities:

Played a key role in the development and maintenance of the extraction, transformation, and load process (ETL) for data integration.

Developed and implemented robust data extraction processes from product platforms to generate analytics on product feature usage.

Streamlined job monitoring and validation procedures through automation, resulting in a 10% reduction in processing time.

Addressed real-time issues and mitigated business impact by creating ad-hoc complex SQL queries on a data warehouse with a massive volume of data.

Contributed to the development and structuring of the Real-Time Analysis module for the Analytic Dashboard, utilizing Cassandra, Kafka, and Spark Streaming.

Engaged in the outlining and construction of the Real-Time Analysis module for the Analytic Dashboard, incorporating Cassandra, Kafka, and Spark Streaming technologies. Set-up configured and optimized the Cassandra cluster and developed real-time Spark based application to work along with the Cassandra database.

Proficiently utilized SQL, PL/SQL, SQL Plus, SQL Loader, and performed query performance tuning to optimize database operations.

Utilized ER/Studio's Compare and Merge Utility to convert the Logical Data Model into a Physical Data Model, adhering to naming standards.

Contributed to the implementation of the customer data loading process into Informatica Power Center and MDM from various source systems.

Installed and configured Hive, Pig, Sqoop, Flume, and Oozie on the Hadoop cluster.

Utilized Oozie workflow engine to run multiple Hive and Pig jobs.

Setup and benchmarked Hadoop/ Hbase clusters for internal use.

Developed dynamic ETL solutions by leveraging mapping parameters, variables, and parameter files.

Designed, developed, and tested Extract Transform Load (ETL) applications with diverse data sources.

Leveraged Hive and HUE to create and optimize SQL queries and implemented MapReduce jobs for data processing.

Explored Spark to enhance performance and optimization of existing algorithms in Hadoop, utilizing Spark context, Spark-SQL, Data Frame, and pair RDDs.

Identified, documented, and refined detailed business rules and use cases based on thorough requirements analysis.

Actively participated in requirements meetings and data mapping sessions to gain a comprehensive understanding of business needs.

Analyzed the requirements of multiple transactional data source systems to design efficient data models.

Implemented reporting data marts for Sales and Marketing teams on AWS Redshift, contributing to data schema design, ETL pipelines in Python/MySQL Stored Procedures, and Jenkins automation.

Demonstrated expertise in query performance tuning in both relational and distributed systems.

Designed an ETL architecture for seamless data transfer from OLTP to OLAP systems.

Environment: Hadoop, HDFS, Hive, Pig, HBase, Big Data, Oozie, Sqoop, Zookeeper, MapReduce, Cassandra, Scala, Linux, AWS Redshift NoSQL, MySQL Workbench, Eclipse, Oracle 10g, SQL.

Charter Communications South San Francisco, CA 12/2016 - 07/2018

Hadoop Developer

Responsibilities:

Worked as a Data Analyst/Engineer to generate Data Models using Erwin and developed relational database system.

Analyzed data where it lives by Mounting Azure Data Lake and Blob to Databricks.

Involved in extracting and mining data for analysis to aid in solving business problems.

Used Azure Data Lake as Source and pulled data using Azure Polybase.

Formulated SQL queries, Aggregate Functions, and database schema to automate information retrieval.

Involved in manipulating data to fulfill on analytical and segmentation requests.

Managed data privacy and security in Power BI.

Written complex SQL queries for data analysis to meet business requirements.

Using Data Visualization tools and techniques to best share data with business partners

Designed and implemented a Data Lake to consolidate data from multiple sources, using Hadoop stacks technologies like Sqoop, hive.

Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.

Reviewed code and system interfaces and extract to handle the migration of data between systems/databases.

Developed a Conceptual model using Erwin based on requirements analysis.

Involved in ETL mapping documents in data warehouse projects.

Involved in loading of data into Teradata from legacy systems and flat files using complex Multi Load scripts and Fast Load scripts.

Created azure data factory (ADF pipelines) using Azure polybase and Azure blob.

Developed Star and Snowflake schemas based dimensional model to develop the data warehouse.

Developed T-SQL code, stored procedures, views, functions, and other database objects to supply data for downstream applications and fulfill business requirements.

Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources

Implemented data ingestion and handling clusters in real time processing using Kafka.

Created/ Tuned PL/SQL Procedures, SQL queries for Data Validation for ETL Process.

Environment: Azure Data Lake, Erwin, Power BI, Hadoop, HBase, Teradata, T-SQL, SSIS, PL/SQL.

Educational Details:

Bachelors in Computer Science- Vellore Institute of Technology, Vellore, TN, India 2014

Masters in Computer Science- George Washington University - Washington, DC.

Contact this candidate