Resume

Data Engineer Engineering

Location:

Dallas, TX

Posted:

February 14, 2024

Contact this candidate

Resume:

SAI KIRAN NANCHERLA (Senior DATA ENGINEER)

PHONE: 303-***-****

EMAIL: ad3mo9@r.postjobfree.com

PROFESSIONAL SUMMARY:

* ***** ** ** ********** as a Data Engineer with a focus on Data Analysis and Data Engineering using Hadoop and Spark Frameworks.

Familiarity with Agile (Scrum) software development process and SDLC.

Expertise in Data Architecture, Data Analytics, and advanced Data processing.

Skilled in Python packages like Pandas, NumPy, Scikit learn, SciPy, and Matplotlib.

Experienced with Big Data components such as Hadoop (HDFS, MapReduce, Yarn), Pig, Hive, Spark (Streaming, SQL, RDD), Sqoop, Oozie, Kafka, and ML Algorithms.

Hands-on experience with Cloudera CDH and Hortonworks HDP for Hadoop cluster management.

Knowledgeable in Spark development using Scala for performance comparison.

Worked with Azure cloud services, including Data Bricks, Data Factory, Machine Learning, SQL, and Data Lake.

Managed GCP's Kubernetes Engine to orchestrate containerized applications, improving scalability and resource utilization across multiple projects.

Expertise in managing Azure Data Lakes and integration with other Azure services.

Proficient in Amazon Web Services (AWS) Cloud Platform, handling EC2, S3, Redshift, DynamoDB, and more. Proficient in JIRA for project management and issue tracking.

Implemented and customized qTest configurations to meet specific project and organizational requirements.

Migrated data using Sqoop between HDFS and Relational Database Systems.

Proficient in branching and merging workflows within Bitbucket, ensuring a streamlined and organized development process.

Proficient in Hadoop ecosystem tools like MapReduce, HDFS, Pig, Hive, Kafka, Yarn, Sqoop, Storm, Oozie, HBase, and Zookeeper.

Experience in implementing scalable cloud-based web applications using AWS and GCP.

Solid understanding of Hadoop architecture and ecosystem components.

Familiar with Spark architecture, including Core, SQL, Data Frames, Streaming, MLLib, and RDD.

Experience with DBT Airflow for conditional tasks and triggering rules for joins.

Designed and implemented applications using microservices architecture on Pivotal Cloud Foundry (PCF), enhancing scalability and maintainability.

Proficient in designing and implementing solutions on Azure Cloud infrastructure, with a focus on Azure DevOps, Azure Data Factory.

Hands on experience in creating, modifying, and maintaining Terraform scripts for provisioning and managing infrastructure on platforms such as AWS, Azure, or Google Cloud.

Knowledgeable in NoSQL databases like HBase, Cassandra, and MongoDB for writing applications.

Experience in data pipeline development using Flume, Sqoop, and Pig for weblogs extraction and storage in HDFS.

Implemented and maintained security roles and permissions in MS Purview CRM to safeguard sensitive customer data. Proficient in Data Manipulation and Cleansing using Python Scripts.

Provisioned Azure/GCP resources for Data Engineering and Data Science projects.

Extensive experience in importing and exporting data using Flume and Kafka stream processing platforms.

Familiar with fact dimensional modeling (Star Schema, Snowflake Schema) and ETL processes from various sources. Proficient in Python, UNIX, and shell scripting.

Utilized SonarQube for comprehensive code quality analysis, identifying and addressing code smells, bugs, and security vulnerabilities.

Expertise in Data Warehousing ETL concepts using Informatica Power Center, OLAP, OLTP, and AutoSys

TECHNICAL Skills:

Languages

Bash Script, Scala, Python, shell scripting, SQL

Cloud Technologies

AWS, Azure, AWS EMR, Glue, RDS, Kinesis, DynamoDB, Redshift Cluster, GCP (Google cloud platform)

MS Azure Services

Azure data bricks, Azure Data Factory (ADF), Azure SQL, Azure Databricks, Azure Data Lake

AWS Services

EC2, S3, VPC, ELB, IAM, DynamoDB, CloudFront, CloudWatch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Redshift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, SQS, SNS, SES

Big Data Technologies

HDFS, Map Reduce, Pig, Hive, Sqoop, Oozie, Scala, Spark, Kafka, Nifi, Airflow, Flume, Snowflake

Hadoop Frameworks

Cloudera CDHs, Hortonworks HDPs

Databases

Oracle, MySQL, MS SQL Server, DB2

NOSQL Databases

HBase, Cassandra, MongoDB

Modelling Schemas

Star Schema, Snowflake Schema

BI Tools

Tableau, Power BI

Operating System

Windows, Linux

Methodologies

Agile, Waterfall

Streaming Tools

Kafka, RabbitMQ

ETL Tools

Talend, Apache Spark, Azure DevOps, DBT Airflow, Microsoft Dynamics, Microsoft SQL Server Integration Services (SSIS), Hadoop, AWS Data Pipeline

Senior Data Engineer (April-2021 to Present)

Bank of the West San Francisco, CA

Responsibilities:

Developed and maintained data pipelines in AWS to ingest, transform, and load data from various sources into a data warehouse.

Writing Complex Snow SQL scripts in snowflake cloud data warehouse to business analysis and reporting

Developed/maintained the data pipeline to ingest streaming and transactional data across different data sources using Spark, Kafka, Redshift, S3, Java and Python

Created graphs, which are representations of data transformation processes with Ab initio GDE

Worked with data stewardship and analytics teams to migrate an existing on-prem data pipelines to GCP using cloud native tools such as GCS Bucket, G - Cloud functions, Cloud dataflow, Pub/Sub cloud shell, gustily, BQ command line utilities, Cloud Composer, PySpark, Python, DataProc and Big Query.

Strong proficiency with Scala, preferably on Spark (Hadoop)

Implemented Agile methodologies, such as Scrum or Kanban, to enhance project efficiency and adaptability.

Proficient in using AWS services such as S3, Glue, EMR, Redshift, Lambda, and more to build scalable data solutions

Redesigned cloud-based data warehouse to enhance security and improve performance.

Designed and implemented ETL processes using AWS Glue, Python, and SQL to ensure data quality and consistency. Proficient in using qTest for test case management, execution, and tracking.

Proficient in writing Infrastructure as Code (IaC) using Terraform to automate and manage cloud resources.

Experience in creating and managing JIRA projects, workflows, and boards.

Hands-on experience in deploying, managing, & scaling applications on the Pivotal Cloud Foundry (PCF) platform.

Monitored and optimized CI/CD pipelines for performance, reliability, and efficiency.

Experience in defining and managing infrastructure configurations using Terraform modules for increased reusability and maintainability.

Configured and customized SonarQube rules to align with specific project requirements and coding standards.

Design and development of ThoughtSpot's mobile application, enhancing user engagement and experience through intuitive interface design and efficient navigation.

Using Erwin to design physical model for the subject areas.

Demonstrated expertise in organizing and tracking project requirements, user stories, and tasks within the ALM system. Actively contributed to the improvement of ALM processes by leveraging the capabilities of Client-ALM.

Strong programming skills in Python, SQL, and experience with big data technologies like Microsoft SQL, Apache Spark.

Experienced in working with spark eco system using SCALA and HIVE queries on different data formats like text file and parquet.

Wrote Spark applications for Data Validation, Cleansing, Transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis

Demonstrated proficiency in programming languages such as Python or Scala within the Databricks environment, enabling efficient data manipulation and analysis

Implemented of Google Cloud Platform (GCP) services to optimize infrastructure, resulting in a 30% reduction in operational costs.

Developed spark jobs with python to process JSON Data

Developed and maintained data ingestion processes using AWS Lambda functions and Kinesis streams.

Conducted performance tuning and troubleshooting of AWS services to minimize downtime and improve system reliability.

Designed and implemented data lake architecture in AWS S3 to store and manage raw data efficiently.

Experienced in Python, Scala, SQL, and other programming languages commonly used with Databricks.

Implemented Power BI gateways and established data refresh schedules to ensure real-time data availability for stakeholders. Skilled in customizing JIRA configurations to align with project requirements.

Developed snowflake Procedures for executing branching and looping

Designed, executed, and monitored data integration processes in a scalable and flexible manner with IICS.

Conducted data analysis, data profiling, and data cleansing to prepare data sources for Tableau.

Supported tasks such as data integration, data quality, data synchronization, and application integration in a cloud environment with IICS

Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries and other components of spark like accumulators, Broadcast variables, different levels of caching and optimizations for spark jobs

Used SNOW PIPE for continuous data ingestion from the S3 bucket

Exploring Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's

Proficient in building data pipelines, ETL processes, and data transformations using PySpark.

Created Tableau reports with complex calculations and worked on Ad-hoc reporting using Power BI

Coordinated with business users through JRD meetings to outline business requirements for OLAP system

Successfully orchestrated and managed multi-tier applications by designing and implementing Terraform configurations.

Hands on experience in building and architecting multiple data pipelines, end to end ETL and ELT process for Data ingestion, transformation in GCP.

Created and managed AWS EMR clusters for processing large-scale data using Apache Spark and Hadoop.

Optimized Tableau visualizations for performance, including data extracts and aggregation.

Used error handing techniques like try catch and error functions to create robust T-SQL code

Utilized dynamic SQL for user customizable queries to be answered by the OLTP Server

Monitored and optimized AWS Redshift clusters for query performance and cost efficiency.

Performed Forward engineering, Reverse engineering, and applied naming standards in Erwin.

Developed Map Reduce jobs for data cleaning and manipulation, and worked on the migration of data from existing RDBMS (oracle and SQL server) to Hadoop using Sqoop for processing

Developed solutions for import/export of data from Teradata, Oracle to HDFS, S3 and S3 to snowflake

Designed complex T-SQL queries and user defined functions in SQL server

Expertise in working with big data frameworks, such as Apache Hadoop and Apache Spark.

Develop solution tools and identify opportunities for process improvements using Informatica and Python

Successfully executed data synchronization projects, maintaining data consistency and integrity across various platforms with the use of (IICS) Informatica Intelligent Cloud Services.

Handled data schema design and development, ETL pipelines in Python/ MySQL Stored Procedures and Automation using Jenkins

Tools & Environment: SQL Server, SSDT, T-SQL, SSIS, ETL, SQL, SSMS, SSDT, Power BI, Data Bricks, Snowflake, AWS

Senior Data Engineer (Jan-2019 to Feb-2021)

Blue Cross Blue Shield Chicago, Illinois

Responsibilities:

Gathered the requirements and implemented a framework to transform the customer related data from existing Excel to SQL

Experience in Database Design and development with Business Intelligence using SQL Server 2014/2016 Integration Services (SSIS), DTS Packages, SQL Server Analysis Services (SSAS), DAX, OLAP Cubes, Star Schema and Snowflake Schema

Designed and build data processing applications with Ab Initio GDE

Created graphs, which are representations of data transformation processes with Ab initio GDE

Developed Big Query, DataProc to extract and deliver meaningful insights to stakeholders

Integration with CI/CD: Integrated Bitbucket with continuous integration and continuous deployment (CI/CD) pipelines, automating the software build and release processes.

Collaborated with cross-functional teams to ensure effective use of JIRA for project communication.

Used SonarQube to detect and eliminate code duplications, improving code maintainability and reducing redundancy.

Worked with Ab Initio to specialize data processing, data management, and analytics.

Implemented authentication and authorization mechanisms, such as OAuth, to ensure secure access to web APIs and protect sensitive data.

Utilized Checkmarx for static application security testing (SAST) to identify and remediate security vulnerabilities in the source code.

Extensively worked on spark using scala on cluster for computational (Analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with HIVE and SQL

Developed ETL processes using PySpark to clean, transform, and load data into data lakes and data warehouses.

Worked in the area of Google Cloud (GCP) strategy and operating model transformation, cloud development & integration, cloud migration, and cloud infrastructure & managed services.

Designed Star and Snowflake Data Models for Enterprise Data Warehouse using ERWIN Created an enterprise data dictionary and maintained standards documentation.

Implemented ETL process to streamline the import of data from various sources into Big Query warehouse

Designed end-to-end ETL strategy for loading data from OLTP systems (XML, SQL OLTP) to OLAP system utilizing SSIS

Proactively identified and mitigated impediments to Agile delivery, ensuring a smooth and efficient development process.

Optimized Azure SQL Data Warehouse performance by implementing partitioning strategies, reducing query execution times

Implemented and optimized ETL processes using Databricks, resulting in a 30% reduction in data processing time.

Proficient in Teradata SQL, performance tuning, and query optimization.

Designed, executed, and monitored data integration processes in a scalable and flexible manner with IICS.

Developed custom Azure Functions in Python for data validation and transformation, reducing data errors

Implemented real-time data processing solutions using PySpark Streaming and structured streaming.

Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshoot managing and reviewing data backups and Hadoop log files

Supported tasks such as data integration, data quality, data synchronization, and application integration in a cloud environment with IICS

Designed and implemented Teradata database schemas, tables, views, and stored procedures.

Administered and maintained Teradata database systems, ensuring data integrity, availability, and security.

Used SQL to clean and aggregate data from relational database to generate status reports and dashboards

Designed the retry framework to restart the SSIS jobs after the configurable time

Implemented a real-time data ingestion system using MS Purview Event Hubs and Azure Databricks, enabling instant data analysis and visualization.

Designed SSRS reports utilizing diverse types of charts and graphs for trend analysis on periodic segments

Created Session Beans and controller Servlets for handling HTTP requests from Talend

Strong background in deploying and managing Azure Virtual Machines (VMs), Azure Storage, and Azure Networking components

Implemented data quality measures and error handling processes within IICS to ensure accurate and reliable data integration.

Implemented robust security measures on GCP, ensuring compliance with industry standards and safeguarding sensitive data against potential threats.

Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including charts, summaries, and graphs to interpret the findings to the team and stakeholders

Assisted with testing and quality assurance activities with data extracts, research, and analysis

Handled data schema design and development, ETL pipelines in Python/ MySQL Stored Procedures and Automation using Jenkins

Skilled in integrating Terraform with CI/CD pipelines to achieve continuous delivery and deployment.

Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Flume, Oozie Zookeeper and Sqoop

Extensively worked on data source types – SQL server, Teradata, flat-files, JSON, CSV, GZIP etc

Facilitated data requirement meetings with business and technical stakeholders and resolved conflicts to drive decisions

Used various sources to pull data into Power BI such as SQL Server, Excel, Oracle, SQL Azure, etc

Designed the ETL strategy from source to destination utilizing a staging area for data profiling and cleaning

Created SQL statements with joins, sub queries, and correlated sub queries as a part of business requirements

Created physical & logical models and used Erwin for dimensional data modeling

Wrote complex SQL queries utilizing scalar/table variables to hold user inputs and perform business logics

Prepared various SSRS report templates adhering reporting standards to work on the report development

Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in real time and peOLTrsist it to Cassandra

Deployed and maintained Azure Virtual Machines, including Windows and Linux instances, using best practices for security and performance.

Tools & Environment: T-SQL, ETL - SSIS, SQL Server, SSRS, SSMS, Excel, Snowflake, DataProc, Data bricks, BigQuery

Data Engineer (Dec-2016 to Dec-2018)

Alcon’s Lab, Fort Worth, TX

Responsibilities:

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala

Developed Real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka, Flume and JMS

Proficient in using GCP services such as Cloud Storage, Dataflow, BigQuery, Data prep, and more to develop scalable data solutions.

Implemented CI/CD pipelines on Google Cloud Platform (GCP) using tools such as Jenkins, GitLab CI, or Cloud Build.

Expertise in generating and analyzing JIRA reports and dashboards for project tracking and performance evaluation.

Written SQL Views for UI and Downstream to send data (Both Normalized and de-normalized views) Proficient in model versioning based on the releases on ERwin model mart Developed Data Mapping, Data Governance, and Transformation and cleansing rules as part of transferring the data from source to target.

Designed and implemented data pipelines on GCP using Dataflow and Apache Beam to extract, transform, and load data from various sources into BigQuery

Proven track record of designing and optimizing ETL processes for large-scale data transformations.

Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata

Developed and executed comprehensive user feedback mechanisms, fostering a user-centric approach that significantly contributed to a 15% increase in customer satisfaction on ThoughtSpot.

Proven track record of designing and optimizing ETL processes for large-scale data transformations.

Implemented data security measures and access controls to ensure data protection and compliance with industry standards.

Stayed informed about industry best practices and emerging technologies related to web APIs, ensuring the adoption of the latest advancements in API development.

Developed MapReduce jobs for data ingestion, transformation, and aggregation.

Conducted performance tuning and troubleshooting of GCP services to minimize downtime and enhance system reliability.

Experienced in generating and documenting Metadata while designing OLTP and OLAP systems environment Designed tables and implemented the naming conventions for Logical and Physical Data Models in Erwin.

Developed analytical component using Scala, Spark and Spark Stream

Developing UDFs in java for hive and pig and worked on reading multiple data formats on HDFS using Scala

Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the map reduce jobs that extract.

Worked with Hue GUI in scheduling jobs with ease and File browsing, Job browsing, Megastore management.

Expertise in using Bitbucket for version control, managing source code repositories.

Assisted in data quality assessments and data validation using Tableau.

Implemented Terraform best practices, such as state management, variable usage, and efficient resource provisioning.

Created POC to store Server Log data in MongoDB to identify System Alert Metrics

Mentored and trained junior team members, contributing to their growth and expertise in GCP data engineering.

Collaborated with data engineers to ingest and prepare data for MapReduce jobs.

Developed PySpark jobs for data ingestion, transformation, and aggregation.

Developed and maintained data ingestion processes using GCP Dataflow and Pub/Sub for real-time data processing.

Worked with Hadoop Distributed File System (HDFS) and YARN for distributed data storage and processing

Conducted performance tuning and optimization of PySpark applications for efficiency.

Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop

Integrated and migrated data from external sources, such as Snow pipe, Snowflake Data Loader and other ETL tools like Talend, Informatica, and Alteryx

Automated all the jobs, for pulling data from FTP server to load data into Hive tables, using Oozie workflows

Involved in creating Hive tables & working on them using HiveQL and perform data analysis using Hive and Pig

Provided ongoing support and troubleshooting for MS Purview CRM, addressing user issues and ensuring system stability.

Used QlikView and D3 for visualization of query required by BI team

Defined UDFs using PIG and Hive to capture customer behavior

Monitored and optimized GCP data services for query performance and resource utilization.

Create Hive external tables on the MapReduce output before partitioning, bucketing is applied on it

Configured Hive Server (HS2) to enable analytical tools like Tableau, QlikView and SAS to interact with Hive tables.

Tools & Environment: Hadoop, MapReduce, HDFS, Hive, Java, SQL, Cloudera Manager, Pig, Sqoop, DataProc, Zookeeper, Teradata, PL/SQL, MySQL, HBase, DataStage, ETL(Informatica/SSIS)

Education Details:

Bachelors: Anurag group of Institution (Electrical and Electronics engineering)

Masters: University of North Texas (Majoring in Data Science)

Contact this candidate