Data Engineer Software Development

Location:

McKinney, TX

Posted:

May 16, 2024

Contact this candidate

Resume:

Data Engineer

Sanjay Reddy

Email: ***************@*****.***

Phone: +1-469-***-****

PROFESSIONAL SUMMARY:

Around 9 years of experience with emphasis on Analytics, Design, Development, Implementation, Testing and Deployment of Software Applications.

Led the implementation of DevOps methodologies to streamline software development processes, resulting in a 30% reduction in deployment time.

Implemented and maintained Docker containers for application deployment, reducing deployment time by 40%.

Managed version control and collaboration with Git, facilitating seamless code integration and team collaboration.

Configured NGINX as a reverse proxy for load balancing and SSL termination, improving application performance and security.

Utilized Amazon RDS for PostgreSQL and MongoDB in cloud environments, ensuring data integrity and scalability.

Deployed and managed applications on AWS Elastic Beanstalk, optimizing resource utilization and scalability.

Designed and maintained Virtual Private Cloud (VPC) environments, ensuring network isolation and security.

Orchestrated containerized workloads using Kubernetes, automating deployment and scaling processes.

Integrated Identity Provider (IDP) and Identity Access Management (IAM) solutions for centralized authentication and access control.

Automated infrastructure provisioning and configuration using scripting languages such as Ruby, Python, and Bash.

Implemented Microservice architecture, breaking down monolithic applications into modular and scalable components.

Collaborated with cross-functional teams using Agile software development practices, ensuring rapid iteration and delivery.

Adhered to Project Management Office (PMO) processes, policies, and procedures, ensuring project alignment with organizational goals.

Collaborated with cross-functional teams to automate infrastructure provisioning and configuration management using tools such as like Ansible, Puppet, Chef, Terraform, IAC frameworks..

Managed CI/CD pipelines with tools like [mention tools like Jenkins, GitLab CI, Travis CI, etc.], improving deployment frequency and reliability.

Implemented monitoring and alerting solutions ([mention tools like Prometheus, Grafana, ELK stack, etc.]) to ensure system stability and performance.

Worked closely with development teams to promote a culture of continuous integration and delivery, facilitating faster time-to-market for products.

Expertise with Big data technologies (Hadoop & Spark) on-prem and on AWS cloud services i.e., EC2, S3, Auto Scaling, Glue, Lambda, Cloud Watch, CloudFormation, Athena, DynamoDB and RedShift

Good working experience with Azure Big data technologies like Azure Data Lake Analytics, Azure Data Lake Store, Azure Data Factory, Azure Databricks, and created POC in moving the data from flat files and SQL Server.

Hands-on experience working with the Databricks platform and Snowflake database.

Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe from data lake Confidential AWS S3 bucket.

Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.

Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.

Experience in using Kafka and Kafka brokers to initiate spark context and processing live streaming.

Experienced in running queries - using Impala and BI tools to run ad-hoc queries directly on Hadoop.

Good experience in Oozie Framework and Automating daily import jobs.

Working experience with data governance tools such as Collibra.

Hands-on experience working on building OLAP cubes.

Collected log data from various sources and integrated it into HDFS using Flume.

Good understanding of NoSQL Databases and hands-on work experience in writing applications on No SQL databases like HBase, Dynamo DB and Mongo DB.

Utilized Azure Data Catalog to catalogue and document datasets, making it easier for stakeholders to discover, understand, and leverage available data assets.

Hands-on experience working with Azure Databricks.

Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Python.

Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.

Strong experience in core Java, Scala, SQL, PL/SQL and Restful web services.

Experience in developing custom UDFs for Pig and Hive.

Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations, filters, prompts, calculated fields, Sets, Groups, Parameters etc., in Tableau experience in working with Flume and NiFi for loading log files into Hadoop.

Excellent Interpersonal and com

Communication skills, efficient time management and organization skills, ability to handle multiple tasks and work well in a team environment.

TECHNICAL SKILLS:

Hadoop Eco System

Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Airflow, HBase.

Programming Languages

Java, PL/SQL, SQL, Python, Scala, PySpark, C, C++, Go

Methodologies

DevOps (Ansible, Puppet, Chef, Jenkins, GitLab CI, Travis CI)

Cluster Mgmt.& monitoring

CDH 4, CDH 5, Horton Works Ambari 2.5

Data Bases

Snowflake, MySQL, SQL Server, Oracle.

NoSQL Data Bases

DynamoDB, Cassandra, HBase

Workflow mgmt. tools

Oozie, Apache Airflow

Visualization & and ETL tools

Tableau, Informatica, Talend, PowerBI.

Cloud Technologies

Azure, AWS

IDE’s

Eclipse, Jupyter notebook, Spyder, PyCharm, IntelliJ

Version Control Systems

Git, SVN

PROFESSIONAL EXPERIENCE

Wells Fargo- San Francisco, California July 2022 to Present

Data Engineer

Responsibilities:

●Engineered and Administered Informatica platforms for Cloud Services, Big Data Management, Master Data Management, Data Integration and Data Quality.

●Wrote python scripts to parse XML documents and load the data in database and developed web - based applications using Python, CSS and HTML.

●Worked on applications and developed them with XML, JSON, XSL (PHP, Django, Python, Rails).

●Experienced in developing Web Services with Python programming language.

●Implemented infrastructure as code (IaC) using [mention tools like Terraform, AWS CloudFormation, etc.], enabling reproducible and scalable deployments.

●Conducted regular code reviews and implemented best practices for version control using Git, branching strategies, and Git workflows.

●Participated in on-call rotation, providing support for production incidents and contributing to post-mortem analyses to prevent future occurrences.

●Mentored junior team members on DevOps principles, tools, and best practices.

●Experience in writing Sub Queries, Stored Procedures, Triggers, Cursors, and Functions on MySQL and PostgreSQL database.

●Designed and maintained Virtual Private Cloud (VPC) environments, ensuring network isolation and security.

●Orchestrated containerized workloads using Kubernetes, automating deployment and scaling processes.

●Integrated Identity Provider (IDP) and Identity Access Management (IAM) solutions for centralized authentication and access control.

●Automated infrastructure provisioning and configuration using scripting languages such as Ruby, Python, and Bash.

●Implemented Microservice architecture, breaking down monolithic applications into modular and scalable components.

●Collaborated with cross-functional teams using Agile software development practices, ensuring rapid iteration and delivery.

●Adhered to Project Management Office (PMO) processes, policies, and procedures, ensuring project alignment with organizational goals.

●Collaborated with cross-functional tea

●Code to build software application to collect data from SQL Server, Flat File through complex ETL Process using, MuleSoft and stored in Azure SQL Data Warehouse, Amazon Redshift, Azure SQL Database

●Implement One time Data Migration of Multistate level data from SQL server to Snowflake by using Python and SnowSQL.

●Create manage bucket policies and lifecycle for S3 storage as per organizations and compliance guidelines.

●Day to-day responsibility includes developing ETL Pipelines in and out of data warehouse, develop major regulatory and financial reports using advanced SQL queries in snowflake.

●Used SNOW PIPE for continuous data ingestion from the S3 bucket.

●Cleaned data and processed third party spending data into maneuverable deliverables within specific formats with Excel macros and python libraries.

●Analyzed and processed the S3 data from starting stage to the persistence stage by using AWS Athena, Glue crawlers and creating glue jobs.

●Implemented AWS services like EC2, SQS, SNS, IAM, S3, and DynamoDB to deploy multi-tier advertiser applications concerning fault tolerance, high availability, and auto-scaling in AWS Cloud formation.

●Worked on analyzing Data-Integration, Data Mapping, Data Profiling and Data Warehouse access using SQL, ETL process.

●Involved in preparing Logical Data Models/Physical Data Models.

●Developed use case diagrams, class diagrams, database tables, and mapping between relational database tables & loading to Hive tables.

●Created Hive-compatible tables schemas on top of raw data in Data Lake which were partitioned by time dimension key, and product dimension, and then analyzed, ad-hoc queried using AWS Athena

●Worked on Agile (Scrum) Methodology, participated in daily scrum meetings, and was actively involved in sprint planning and product backlog creation

●Worked closely with the SME to get an understanding of the business requirements.

●Responsible for performing cleansing, filtering, and comparing existing data with the model and database using excel and data comparison tools.

●Optimize existing data pipelines and maintain all domain-related data pipelines.

●Designed & built the efficient and reliable data pipelines to move & transform data (both large & smaller amounts)

●Used PySpark and Spark to implement data quality checks, data transformation, and data validation processes.

●Developed and maintained Spark data pipelines for large-scale data processing, reducing processing time.

●Implemented optimizations in Spark jobs, enhancing overall performance and scalability of the data processing infrastructure.

Liberty Mutual - Boston, Massachusetts, Apr 2020 to June 2022

Data Engineer.

Responsibilities:

Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL

Monitored cluster health by Setting up alerts using Nagios and Ganglia

Transforming business problems into Big Data solutions and defining Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines.

Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.

Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.

Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.

Experience in building data pipelines using Azure Datafactory, and Azure Data bricks and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data warehouse and controlling and granting database access.

Implemented Copy activity, Custom Azure Data Factory Pipeline Activities

Have real-time experience with Kafka-Storm on the HDP platform for real-time analysis.

Developed Kafka producers and consumers efficiently ingested data from various data sources

Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, JSON, Parquet, etc. Involved in loading data from LINUX file system to HDFS

Authoring Python (PySpark) Scripts for custom UDFs for Row/ Column manipulations, merges, aggregations, stacking, data labelling and all Cleaning and conforming tasks.

Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.

Writing PySpark and spark SQL transformation in Azure Databricks to perform complex transformations for business rule implementation

Worked closely with regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupiter, Notebook, Hive and NoSQL.

Created several types of data visualizations using Python - Matplotlib and Tableau. Extracted data using SQL Queries to create reports.

Using Spark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real-time and persists into Cassandra

Extensively experienced in deploying, managing and developing MongoDB clusters.

Worked on Hortonworks-HDP distribution.

Used Hortonworks Apache Falcon for data management and pipeline process in the Hadoop cluster.

Oracle Cerner- Kansas City, Missouri Nov 2017 to Mar 2020

Data Engineer

Responsibilities:

Extracted the needed data from the server into HDFS and Bulk Loaded the cleaned data into Hbase.

Supporting Continuous storage in AWS using Elastic Block Storage, S3, and Glacier. Created Volumes and configured Snapshots for EC2 instances

Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.

Designed, implemented and deployed within a customer’s existing Hadoop / Cassandra cluster a series of custom parallel algorithms for various customer-defined metrics and unsupervised learning models.

Estimated the Software and hardware requirements for the Name Node and Data Node & and planned the cluster.

Wrote queries Using DataStax Cassandra CQL to create, alter, insert and delete elements.

Written the Map Reduce programs, and Hive UDFs in Java.

Successfully Generated consumer group lags from Kafka using their API

Deployed an Apache Solr/Lucene search engine server to help speed up the search of financial documents.

Worked on a Python script to extract data from Netezza databases and transfer it to AWS S3

Developed Lambda functions and assigned IAM roles to run Python scripts along with various triggers (SQS, SNS)

Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server using Python.

Architected several DAGs (Directed Acyclic Graph) for automating ETL pipelines

Created a Lambda Deployment function, and configured it to receive events from S3 buckets

Experience dimensional modelling (Star schema, Snowflake schema), transactional modelling and SCD (Slowly changing dimension) Files extracted from Hadoop and dropped on a daily hourly basis into S3.

Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.

Developed Airflow DAGs in Python by importing the Airflow libraries.

Developed and Configured Kafka brokers to pipeline server logs data into Spark streaming.

Defined job workflows as per their dependencies in Oozie.

Played a key role in productionizing the application after testing by BI analysts.

Given POC of FLUME to handle the real-time log processing for attribution reports.

Maintain System integrity of all sub-components related to Hadoop.

Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.

Involved in creating Hive tables, loading the data and writing Hive queries that will run internally in a map-reduced way. Developed a custom File System plugin for Hadoop so it can access files on the Data Platform.

Imported data using Sqoop to load data from Oracle to HDFS regularly.

Written Hive queries for data analysis to meet the business requirements.

Creating Hive tables and working on them using Hive QL. Experienced in defining job flows.

Developed Java Map Reduce programs for the analysis of sample log files stored in clusters.

Developed Simple to complex Map/reduce Jobs using Hive and Pig

The custom File System plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified and access files directly.

Used Sqoop to import data into HDFS and Hive from other data systems.

Setup and benchmarked Hadoop/HBase clusters for internal use

IBM - India Sep 2014 to Aug 2017

Software Engineer

Responsibilities:

Utilized Azure Blob Storage, Azure Data Lake Storage, and Azure SQL Database to architect scalable and reliable data storage solutions, ensuring seamless data availability and accessibility for downstream analysis.

Analyzing Hadoop cluster and different Big Data analytic tools including Hive, HBase and Sqoop.

Successfully loading files to Hive and HDFS from Oracle, SQL Server using SQOOP.

Creating Hive tables, loading with data and writing Hive queries.

Developed custom User Defined Functions (UDFs) in Hive to transform large volumes of data concerning business requirements.

Designed and maintained data warehousing solutions using Azure Synapse Analytics (formerly SQL Data Warehouse), creating optimized environments for data storage, querying, and reporting.

Architect & and implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).

Designed and tested disaster recovery strategies using Azure Site Recovery, ensuring data availability and business continuity in case of disruptions.

Responsible for the development of Spark Cassandra connector to load data from flat files to Cassandra for analysis.

Designed solutions to process high volume data stream ingestion, processing and low latency data provisioning using Hadoop Ecosystems Hive, Pig, Scoop and Kafka, Python, Spark, Scala, NoSQL, Nifi, and Druid.

Build machine learning models to showcase big data capabilities using PySpark and MLlib.

Developed Python scripts to find vulnerabilities with SQL Queries by doing SQL injection.

Involved in loading data from the edge node to HDFS using shell scripting.

Implemented scripts for loading data from UNIX file system to HDFS.

Cluster coordination services through Zookeeper.

Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight.

Automated workflow using Shell Scripts.

Built Kafka Rest API to collect events from the front end.

Designed and implemented end-to-end ETL (Extract, Transform, Load) processes using Azure Data Factory, orchestrating data movement and transformation activities across different data sources and destinations.

Implemented real-time data processing solutions using Azure Stream Analytics, enabling the ingestion and analysis of streaming data for immediate insights and action.

Good experience in Hive partitioning, bucketing and performing different types of joins on Hive tables.

Developed Hive Scripts, and Hive UDFs to load data files.

Managed Hadoop jobs using the Oozie workflow scheduler system for MapReduce, Hive, and Sqoop actions.

Used Oozie workflow engine to run multiple Hive and jobs.

Responsible for developing batch processes using Unix Shell Scripting.

Contact this candidate