Data Engineer

Location:

Houston, TX

Posted:

April 17, 2023

Contact this candidate

Resume:

Manasa S

Email: yakanth(a)pinnaclesofts(dot)com

Contact: ***,***,****

PROFESSIONAL SUMMARY

Around 8 years of experience in Analysis, Design, Development and Implementation as a Data Engineer. Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.

●Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.

●Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.

●Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s

●Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.

●Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.

●Worked on GCP for the purpose of data migration from Oracle database to GCP.

●Good working knowledge of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.

●Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.

●Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.

●Develop the automation scripts to transfer the data from on premise clusters to Google Cloud Platform (GCP).

●Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.

●Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.

●Worked with Cloudera and Hortonworks distributions.

●Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/ data marts from heterogeneous sources.

●Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.

●Developed Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.

●Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.

●Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables

●Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.

●Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.

●Worked on google cloud platform (GCP) services like compute engine, cloud storage, cloud SQL, stack driver monitoring, data proc, and Bigquery

●Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inmon’s Approach.

●Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.

●Experience with AWS services like EC2, VPC, Cloud Front, Elastic Beanstalk, Route 53, RDBMS and S3.

●Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.

●Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.

●Experience on MS SQL Server, including SSRS, SSIS, and T-SQL.

●Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills.

TOOLS AND TECHNOLOGIES

Databases

Snowflake, AWS RDS, Teradata, Oracle, MySQL, Microsoft SQL, Postgre SQL.

NoSQL Databases

MongoDB, Hadoop HBase, and Apache Cassandra.

Programming Languages

Java, Python, SQL, Scala, MATLAB.

Cloud Technologies

AWS, GCP, Amazon, S3, EMR, Redshift, Lambda

Data Formats

CSV, JSONss

Querying Languages

SQL, NO SQL, PostgreSQL, MySQL, Microsoft SQL, PL/SQL

Integration Tools

Jenkins

Scalable Data Tools

Hadoop, Hive, Apache Spark, Pig, Map Reduce, Sqoop, PySpark.

Operating Systems

Red Hat Linux, Unix, Windows, macOS.

Reporting & Visualization

Tableau

PROFESSIONAL EXPERIENCE

Client: Driven Brands, Charlotte, NC Aug 2022 - present

Role: Sr. Data Engineer

Responsibilities:

●Worked on financial FRED Macro data ingestion and developed python scripts for ingesting data from GCS to Big Query which includes 18 different sources.

●Worked on data inventory and automation part for FRED data where we need to update the data in big query on daily basis.

●Also worked on gathering list of customer data sources for driven data proto.

●Worked on tickets sprint wise (documentation, data ingestion for landing the data, create list of customer data sources in the specified projects, assess current table with team needs, develop customer master launch, document current state end to end (data, processes, etc.), validate the process, decide if current table is to be enhanced or recreated, launch into production, develop quality/monitoring).

●Prepared necessary documentation and validated the process fits in architecture standards for financial FRED Macro data.

●Created big query tables for airflow to automate the data in google cloud platform.

●Also created big query tables using sql queries.

●Loaded Data into Big query for on arrival csv files in GCS bucket and worked with airflow server and google cloud platform.

●Performed Data Migration from GCS to GBQ.

●Worked on scheduling all jobs using Airflow scripts using python. Adding different tasks to DAG’s and dependencies between the tasks.

●Installed and configured Apache airflow for workflow management and created workflows in python

●Created Airflow Scheduling scripts in Python

●Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines

●Worked on entity relationship diagram for transaction build ddm.

●Created data extracts in Tableau by connecting to the view using Tableau MSSQL connector.

●Created Tableau dashboards, datasets, data sources and worksheets.

●Worked on GCP for the purpose of data migration from Oracle database to GCP.

Client: Nestle, St Charles, MO Feb 2020- Jul 2022

Role : Sr. Data Engineer

Responsibilities:

●Evaluating client needs and translating their business requirement to functional specifications there by on boarding them on to Hadoop ecosystem.

●Worked on migrating MapReduce programs into Spark transformations using Spark and Scala.

●Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

●Load the files data from ADLS Server to the Google Cloud Platform Buckets and create the Hive Tables for the end users.

●Worked on GCP for the purpose of data migration from Oracle database to GCP.

●Worked extensively with Sqoop for importing and exporting the data from data lake HDFS to Relational Database systems like Oracle and MySQL.

●Built real time pipeline for streaming data using Kafka /pubsub used for data ingestion into Google cloud platform(GCP).

●Hands of experience in GCP Dataproc, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver

●Processing, Ad hoc reporting, and SSRS report server, and data mining in SSAS.

●Used Apache NiFi to copy data from local file system to HDP.

●Used Spark for interactive queries, processing of streaming data and integration with Hbase database for huge volume of data.

●Generating Data flows using ETL for data. (Batch and Streaming data)

●Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.

●Develop generic data frameworks and data products using Apache Spark, Scala to maintain the highest availability, performance, and strive for simplicity.

●Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.

●Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.

●Permissions for the users are given using IAM in GCP

●Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra.

●Worked on auto scaling the instances to design cost effective, fault tolerant and highly reliable systems.

Environment: Hadoop (HDFS, MapReduce),Scala,Oracle Databricks, Yarn, IAM, PostgreSql, Spark, Impala, Hive, MongoDB, GCP, Pig, HBase, Oozie, Hue, Sqoop, Flume, Oracle, NIFI, Git.

Client: Qualcomm, San Diego, CA Aug 2019 – Jan 2020 Role : Data Engineer

Responsibilities:

●Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.

●Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.

●Build end to end automation using shell-scripting, AWS SNS, AROW and Pager Duty.

●Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.

●Involved in designing optimizing Spark SQL queries, Data frames, import data from Data sources, perform transformations; perform read/write operations, save the results to output directory into HDFS/AWS S3.

●Supported data quality management by implementing proper data quality checks in data pipelines.

●Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.

●Implemented the import and export of data using XML and SSIS

●Expertise in Spark, Kafka, AWS, SQL, Python, PySpark

●Used SSIS to build automated multi-dimensional cubes.

●Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

●Experience on moving raw data between different systems using Apache NIFI.

●Build machine learning models to showcase Big data capabilities using Pyspark and MLlib.

●Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.

●Implemented data streaming capability using Kafka and Talend for multiple data sources.

●Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.

●Implemented AWS Elastic Container Service (ECS) scheduler to automate application deployment in the cloud using Docker Automation techniques.

●Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.

●Involved in the development of agile, iterative, and proven data modeling patterns that provide flexibility.

●Knowledge on implementing the JILs to automate the jobs in production cluster.

●Trouble shooted user's analyses bugs ( JIRA and IRIS Ticket).

●Worked with SCRUM team in delivering agreed user stories on time for every Sprint.

●Worked on analyzing and resolving the production job failures in several scenarios.

●Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.

●Knowledge on implementing the JILs to automate the jobs in production cluster.

Environment: Spark,AWS,Redshift, Python, HDFS, Hive, Pig, Sqoop, Scala,Nifi,Kafka, Shell,scripting, Linux,Jenkins, Eclipse, Git, Oozie, Talend, Azure,Agile Methodology.

Client: Broadridge,New York,NY Aug 2018 - July 2019

Data Engineer

Responsibilities:

●Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating

●Worked on Hadoop cluster which ranged from 4-8 nodes during pre-production stage and it was sometimes extended up to 24 nodes during production.

●Built APIs that will allow customer service representatives to access the data and answer queries.

●Created scripts to read CSV, JSON and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.

●Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.

●Extending the functionality of Hive with custom UDF s and UDAF's.

●The new Business Data Warehouse (BDW) improved query/report performance, reduced the time needed to develop reports and established a self-service reporting model in Cognos for business users.

●Implemented Bucketing and Partitioning using hive to assist the users with data analysis.

●Used Oozie scripts for deployment of the application and perforce as the secure versioning software.

●Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.

● Develop database management systems for easy access, storage, and retrieval of data.

● Perform DB activities such as indexing, performance tuning, and backup and restore.

● Implemented AJAX, JSON, and Java script to create interactive web screens.

● Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.

● Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries

Environment: Cloudera CDH4.3, AWS,Hadoop, Pig, Hive, MapReduce, HDFS, Sqoop, Impala, Tableau, Flume,Oozie, Linux.

Snipe IT Solutions, Hyderabad, India Oct 2016 – Jun 2017

Role : Hadoop Developer

Responsibilities:

●Installed and configured Apache Hadoop to test the maintenance of log files in Hadoop cluster.

●Installed and configured Hive, Pig, Sqoop, and Oozie on the Hadoop cluster.

●Experienced in SQL programming and creation of relational database models

●Installed Oozie Workflow engine to run multiple Hive and Pig Jobs.

●Used SQL Azure extensively for database needs in various applications.

●Developed multiple Map Reduce jobs in Java for data cleansing and preGood experience in handling data manipulation using Python Scripts.

●Analyzed large amounts of data sets to determine optimal ways to aggregate and report on it.

●Responsible for building scalable distributed data solutions using Hadoop.

●Written SQL Queries, Store Procedures, Triggers and functions for MySQL Databases

●Migration of ETL processes from Oracle to Hive to test the easy data manipulation.

●Performed optimization on Pig scripts and Hive queries increase efficiency and add new features to existing code.

●Worked on creating tabular models on Azure analysis services for meeting business reporting requirements

●Used Hive and created Hive tables and was involved in data loading and writing Hive UDFs.

●Used Sqoop to import data into HDFS and Hive from other data systems.

●Installed Oozie workflow engine to run multiple Hive.

●Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.

●Developed Hive queries to process the data for visualizing and reporting & processing.

Environment: Apache Hadoop, Cloudera Manager, CDH2, Python, CentOS, Java, MapReduce, Pig, Hive, Sqoop, Oozie and SQL.

Client: Adaptech Systems Ltd, Hyderabad, India Jul 2014 – Sep 2016

Role: Java Developer

Responsibilities:

●Involved in Analysis, Design, Development, Integration, and Testing of application modules.

●Designed and developed applications using Java/J2EE technologies.

●Converted a monolithic app to microservices architecture using Spring Boot using 12-factor app methodology. Deployed, Scaled, Configured, wrote manifest file for various Microservices in PCF.

●Developed components using Spring MVC.

●Used Spring Framework for Dependency injection and integration with Service objects, DAO etc.

●Developed server-side services using Java, spring, Web Services.

●Implemented Object Oriented Analysis and Design, Java Collections framework, design patterns, and multi-threading.

●Implemented REST Microservices using spring boot. Generated Metrics with method level granularity and Persistence using Spring AOP and Spring Actuator.

●Developed Web Services using RESTful web services.

●Consumed restful web services using Http Client for the data coming from external systems.

●Involved in generating and configuring the JPA entities from the database.

●Involved in developing Triggers, Stored procedures in SQL, PL/SQL.

●Integrated Central logging system by using Log4j to capture the log that includes runtime exception and for logging info and are helpful in debugging the issues.

●Implemented Web Services using SOA Architecture for data exchange across different Enterprise systems.

●Utilized Test Driven Development (TDD) for web application development using Agile methodology

●Used spring config server for centralized configuration and Splunk for centralized logging. Used Concourse and Jenkins for Microservices deployment

●Developed complex SQL scripts to compare all the records for every field and table at each phase of the data.

●Implemented testing using JUnit, and Mockito Framework.

●Used Subversion to commit the source code.

●Used Jenkins for builds and continuous integration.

Environment: Java, J2EE, Struts, design patterns, Multi- threading, JSP, Spring, Hibernate JPA, JAX-RS, JUnit, Log4j, Ajax JavaScript, Maven, Spring MVC, Spring IOC.

Contact this candidate