Data Engineer Analyst

Location:

TX-114, TX

Salary:

$65

Posted:

May 12, 2023

Contact this candidate

Resume:

Ranga Gudisha

Data Engineer

Phone: 469-***-**** Email: *******.*****@*****.***

Professional Summary:

Over 8 years of total IT experience which includes Data Analysis, Data Modelling, Big Data, Hadoop, Web application Development.

Strong Skills in developing applications involving big data components/technologies like Hadoop, Spark, MapReduce, Hive, Pig, Kafka, MongoDB.

Hands on experience in writing PySpark and Scala scripts and AWS cloud services (EC2, S3, RDS, Redshift, Data Pipeline, EMR, DynamoDB, Lambda, SNS, SQS)

Experienced working on large volume of Data using Teradata SQL and BASE SAS programming.

Experienced in Extracting Data from Mainframes Flat File and converting them into Teradata tables using SAS PROC IMPORT, PROC SQL etc.

Ingested data from various data sources into Hadoop HDFS/Hive Tables using SQOOP, Flume, Kafka.

Built an end-to-end real-time data pipeline by building four micro-services on top of Apache Kafka for data processing.

Excellent understanding and hands on experience using NOSQL databases like Cassandra, Mongo DB and HBase.

Analyzed, designed, developed, implemented and maintained Parallel jobs using IBM info sphere Data Stage

Develop best practices, standards, and methodologies to assist in the implementation and execution of Data Governance

Experience in developing enterprise level solution using batch processing ( using Apache pig) and streaming framework (using Spark streaming, Apache Kafka and Apache Flink)

Good understanding of Spark Architecture with DataBricks, Structured Streaming. Setting up AWS and Microsoft Azure with DataBricks, Data Bricks workspace for Business Analytics, manage clusters in DataBricks, Managing the machine learning lifecycle.

Involved in building of the Data Warehouse which includes the Design of Data mart using Star Schema Created the Source and Target Definitions in SSIS.

3+ years of Professional experience in the space of informatica AXON and EDC.

Monitoring BigQuery, DataProc, and cloud data flow jobs via Stackdriver for all the environments.

Responsible for ingesting large volumes of IOT data to Kafka.

Involved in building the application which works on mainly Groovy grails, HTML, CSS, Rest Services, Java script, Spring, Maven, Hibernate.

Running of Apache Hadoop, CDH and Map-R distros, dubbed Elastic MapReduce(EMR) on (EC2)

Design and Build CICD Pipelines for Google Cloud Platform (GCP) services: BigQuery, DataFlow, Pub/Sub, Data Fusion and others.

Worked on Apache Flink to implements the transformation on data streams for filtering, aggregating, update sate.

Adapt knowledge and experience in mapping source to target data using IBM Data Stage 8.x

Experience in Python, Scala, C, SQL, and Distributed Systems Architecture and Parallel Processing Frameworks, as well as a thorough understanding of Distributed Systems Architecture and Parallel Processing Frameworks.

Experience in creating the methodologies and technologies that depict the flow of data within and between application systems and business functions/operations & data flow diagrams.

Experience in integrating Talend Open Studio with Hadoop, Hive, Spark and MySQL.

Strong experience with Groovy, Hibernate/GORM, Jenkins and Spring frameworks.

Constructing and manipulating large datasets of structured, semi-structured, and unstructured data and supporting systems application architecture using tools like SAS, SQL, Python, R, Minitab, PowerBI, and more to extract multi-factor interactions and drive change.

Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud and making the data available in Athena and Snowflake

Experience with container-based deployments using Docker.

Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR.

Developed multiple MapReduce jobs in java to clean datasets.

Distributed database Design, Data modelling, Development and Support in Data tax Cassandra distribution.

Minimum of 4years of experience in ETL development using known ETL tools like Snaplogic, DataStage, Informatica, etc

Create and implement Azure Data factory Data Flow ETL and Pipelines

Experience in Designing, Architecting and implementing scalable cloud-based web applications using AWS and GCP

Designed and developed a Data Pipeline to collect data from multiple sources and inject it to Hadoop, Hive data lake using Talend BigData, Spark.

Strong experience in migrating other databases to Snowflake.

Provided expertise and hands on experience working on Kafka connect using schema registry in a very high-volume environment (~900 million messages).

Provided expertise in Kafka brokers, zookeepers, KSQL, K Stream and Kafka Control center.

Provided expertise and hands on experience working on Avro Converters, Json Converters, and String Converters.

Experience in ETL development using known ETL tools like DataStage, Informatica, SnapLogic is a plus

Supervises the development and implementation of comprehensive project and work plans for efforts within the data governance platform

Provided expertise and hands on experience working on Kafka connectors such as MQ connectors, Elastic Search connectors, JDBC connectors, File stream connector, JMS source connectors, Tasks, Workers, converters, Transforms.

Experience in understanding of Data Warehousing (DWH) and ETL concepts, ETL loading strategy.

Experience in creating Pyspark scripts and Spark Scala jars using IntelliJ IDE and executing them.

Set up a GCP Firewall rules in order to allow or deny traffic to and from the VM’s instances based on specified configuration and used GCP cloud CDN (Content Delivery Network) to delivery content from GCP cache locations drastically improving user experience and latency.

Developed ETL Processes for support of the 340B Pharmaceutical management product with SSIS, SQL Stored procedures, and C # scripting.

Involved in creating SSIS jobs to automate the reports generation, cube refresh packages.

Experience in Big Data and Apache Hadoop ecosystem components like Map-Reduce, HDFS, Sqoop, Flume, Spark, Spark Streaming, Pig, Hive, HBase, Impala, Oozie, Kafka, and Zookeeper.

Strong knowledge in Spark Streaming and Kafka.

Implement pipeline using PySpark. Also used Talend spark components.

Experience in troubleshooting Spark/Map Reduce jobs.

Full life cycle of Data Lake, Data Warehouse with Big data technologies like Spark, Hadoop, Cassandra.

Built reports for monitoring data loads into GCP and drive reliability at the site level

Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow

Experience with event-driven and scheduled AWS Custom Lambda (Python) functions to trigger various AWS resources.

Used cloud shell SDK in GCP to configure the services Data Proc, storage, BigQuery.

Demonstrated expertise utilizing ETL tools: Talend Data integration, SQL Server Integration Services (SSIS), Developed slowly changing dimension (SCD) mappings using Type-I, Type-II and Type-III methods.

Experienced in Microsoft Business Intelligence tools, developing SSIS (Integration Service), SSAS (Analysis Service) and SSRS (Reporting Service), and OLAP cubes.

Good Expertise in ingesting, processing, exporting, analyzing Terabytes of structured and unstructured data on Hadoop clusters in Banking, Insurance, and Technology domains.

Experience with integration platforms, such as SnapLogic is a plus

Involved in creating SSIS jobs to automate the reports generation, cube refresh packages.

Worked with various file formats such as CSV, JSON, XML, ORC, Avro, and Parquet file formats.

Experience in developing and scheduling ETL workflows in Hadoop using Oozie.

Experience in developing Python ETL jobs run on AWS services (Glue) and integrating with enterprise systems like Enterprise logging and alerting, enterprise configuration management and enterprise build and versioning infrastructure.

Implemented enterprise-grade platform (Mark logic) for ETL from mainframe to NoSQL (Cassandra).

Expertise in configuring the monitoring and alerting tools according to the requirement like AWS CloudWatch.

Good experience with Apache Spark, Spark SQL, and Spark streaming with the use of Scala & Python programming languages.

Ability to work with clients to understand requirements and envision data ingestion solution

Technical Skills:

Operating Systems

Windows, Mac OS

Hadoop ECO Systems

Hadoop, MapReduce, HDFS, ETL, Hive, Pig,

Programming Languages

Python, C#, Scala, Pyspark, R

Web Technologies

HTML, CSS, JavaScript, jQuery

Databases

Oracle PL-SQL, MySQL, SQL-Server, Teradata, Mongo

Scheduling Tools

Control M, Apache Airflow, Autosys

Cloud

Snowflake, S3 EC2 AWS Redshift, EC2, Glue, Lambda, SNS, SQS

IDE

Eclipse, Visual Studio, IDLE, IntelliJ, Spider

Web Services

Restful, SOAP

Analytics Tools

PowerBI, Microsoft SSIS, SSAS and SSRS

Methodologies

Agile, Design Patterns

Solisystems Inc., Allen, TX Apr 2022 – Present

Sr. Data Engineer

Responsibilities:

Worked on AWS Redshift and RDS for implementing models and data on RDS and Redshift.

Worked on optimizing and tuning the Teradata views and SQL to improve the performance of batch and response time of data for the users.

Developed Spark/Scala, Python for regular expression(regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources

Expertise in Extraction, Transformation, loading Data from Oracle, DB2, SQL Server, MS Access, Excel, Flat Files and XML using Informatica and Talend.

Worked in coordination with Customer Analytics team to improve the turnaround time of customer communication by implementing predictive modelling.

Provided expertise and hands on experience on custom connectors using the Kafka core concepts and API.

Involved in development of real time streaming applications using PySpark, Apache Flink, Kafka, Hive on distributed Hadoop Cluster.

Collaborate with business stakeholders to develop and document policies and procedures surrounding Data Governance

Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark DataBricks cluster.

Involved in creating SSIS jobs to automate the reports generation, cube refresh packages.

Running of Apache Hadoop, CDH and Map-R distros, dubbed Elastic MapReduce (EMR) on (EC2)

Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR.

Monitoring BigQuery, DataProc, and cloud data flow jobs via Stackdriver for all the environments.

Strong experience with architecting highly per formant databases using PostgreSQL, PostGIS, MySQL and Cassandra

Designed and developed SSIS packages to move data from various sources into destination flat files and databases.

Good knowledge on other ETL tools like Pentaho Kettle, Snap Logic and AB Intio.

Troubleshoot previous SSIS packages to solve issues Performance tuning existing scripts and processes

Used Flink Streaming for pipeline Flink engine to process data streams to deploy New API including definition of flexible. Windows

Good working knowledge on the integration tolls like SnapLogic, Oracle Warehouse builder.

Adapt knowledge and experience in mapping source to target data using IBM Data Stage 8.x

Created stubs for producers, consumers and consumer groups for helping onboard applications from different languages/platforms. Leverage Hadoop ecosystem knowledge to design, and develop capabilities to deliver our solutions using Spark, Scala, Python, Hive, Kafka and other things in the Hadoop ecosystem

Automation and scheduling of UNIX shell scripts and informatica sessions and batches using Autosys.

Strong experience in migrating other databases to Snowflake.

Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.

Worked on Snowflake Multi-Cluster Warehouses.

Experience with event driven and real-time architectures, patterns, messaging and streaming technologies such as using Apache Kafka, AWS Kinesis, Amazon SQS/SNS, Amazon MQ, AWS Managed Services for Kafka etc.

Install and configure Apache Airflow for s3 bucket and Snowflake data warehouse and created dags to run the Airflow.

Created ETL/Talend jobs both design and code to process data to target databases.

Extensively used Spark to read data from S3 and preprocess it and to store in back S3 again for creating tables using Athena.

Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.

Worked extensively on Hadoop eco-system components Map Reduce, Pig, Hive, HBase, Flume, Sqoop, Hue, Oozie, Spark and Kafka.

Good Experience in UNIX Shell script, Perl Script for automation of tasks for file loading job scheduling.

Built Snaplogic pipelines to ingest data from variety of sources such as Oracle, S3 & SFTP etc

Design and develop ETL integration patterns using Python on Spark

Design, develop and deploy convergent mediation platform for data collection and billing process using Talend ETL

Written Python, Shell scripts for various deployments and automation process and written MapReduce programs in Python with the Hadoop streaming API.

Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.

Involved in converting Hive/SQL queries into Spark

transformations using Spark RDDs, Python and Scala.

Experience in Designing, Architecting and implementing scalable cloud-based web applications using AWS and GCP

Excellent experienced on NoSQL databases like MongoDB, Cassandra and write Apache Spark streaming API on Big Data distribution in the active duster environment.

Developed data transition programs from Dynamo DB to AWS Redshift (ETL Process) using AWS

Developed Microservices with Java using Spring Boot IDE.

Developed data ingestion modules (both real time and batch data load) to data into various layers in S3, Redshift and Snowflake using AWS Kinesis, AWS Glue, AWS Lambda and AWS Step Functions

Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.

Used AWS glue catalog with crawler to get the data from S3 and perform Sql query operations.

Transform and analyze the data using Pyspark, HIVE, based on ETL mappings

Integrated the spark with Hadoop eco systems and stores the data Hadoop file system (HDFS)

US Bank, Dallas, TX Sept 2021 – Mar 2022 Data Engineer

Responsibilities:

Involved in building scalable distributed data lake system for Confidential real time and batch analytical needs.

Involved in designing, reviewing, optimizing data transformation processes using Apache Storm.

Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.

Developed Scala scripts, UDFs using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Scoop.

Designed Database Architecture, Administration, System Analysis, Design, Development and Support of MS SQL Server, MSBI ETL tools, Core Java, JSP, Servlets, JavaScript, XML, jQuery, Python and Scala scripting.

Designed and developed SSIS packages to move data from various sources into destination flat files and databases.

Troubleshoot previous SSIS packages to solve issuesPerformance tuning existing scripts and processes.

Work with EDL & SnapLogic admins to setup environments in SnapLogic & EDL.

Used Flink Streaming for pipeline Flink engine to process data streams to deploy New API including definition of flexible. Windows.

Used the ETL Data Stage Director to scheduling and running the jobs, Testing and debugging its components & monitoring performance statistics

Implemented framework related to read data from excel using Groovy.

Involved in building the application which works on mainly Groovy grails, HTML, CSS, Rest Services, Java script, Spring, Maven, Hibernate.

Running of Apache Hadoop, CDH and Map-R distros, dubbed Elastic MapReduce (EMR) on (EC2)

Excellent experienced on NoSQL databases like MongoDB, Cassandra and write Apache Spark streaming API on Big Data distribution in the active duster environment.

Snaplogic Architect for the Confidential & Confidental to EDL Integration Project.

Developed common Flink module for serializing and deserializing AVRO data applying schema

Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR.

Deep knowledge and strong deployment experience in Hadoop and Big Data ecosystems- HDFS, MapReduce, Spark, Pig, Sqoop, Hive, Oozie, Kafka, zookeeper, and HBase.

Collaborate with business stakeholders to develop and document policies and procedures surrounding Data Governance

Experience in Designing, Architecting and implementing scalable cloud-based web applications using AWS and GCP.

Developed SSIS Packages for migrating data from Staging Area of SQL Server 2008 to SQL Server 2012.

Design and Build CICD Pipelines for Google Cloud Platform (GCP) services: BigQuery, DataFlow, Pub/Sub, Data Fusion and others.

Loaded the data into Spark RDD and did in memory data computation to generate the output response.

Maintain and work with our data pipeline that transfers and processes several terabytes of data using Spark, Scala, Python, Apache Kafka, Pig/Hive & Impala

Created packages in SSIS with error handling.

Imported data from Kafka Consumer into HBase using Spark streaming

Used AWS services like EC2 and S3 for small data sets processing and storage.

Implemented the Machine learning algorithms using Spark with Python.

Imported data from Kafka Consumer into HBase using Spark streaming.

Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java MapReduce, Hive and Sqoop as well as system specific jobs.

Experienced in handling large datasets using partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective efficient Joins, Transformation and other during ingestion process itself.

Worked on migrating legacy Map Reduce programs into Spark transformations using Spark and Scala.

Worked on a POC to compare processing time for Impala with Apache Hive for batch applications to implement the former in the project.

Developed data transition programs from Dynamo DB to AWS Redshift (ETL Process) using AWS.

Worked extensively with Sqoop for importing metadata from SQL.

SS&C, Hyderabad, India Dec 2019 – Aug 2021 Data Engineer

Responsibilities:

Communicated initial findings through Exploratory Data Analysis for pricing solutions in business-to-business industries.

Extracted, interpreted, and analyzed data to identify KPI’s and transform raw data into meaningful, actionable information.

Encoded and decoded json objects using PySpark to create and modify the dataframes in Apache Spark.

Worked on different Machine Learning models in AWS Sage Maker to build a Profit Maximizing Pricing Model by predicting the demand curve of each product for Price optimization that led to strategic pricing.

Responsible for ETL operation using SQL Server Integration Services (SSIS) and Worked with users in training and support and also have experience in incremental data loading.

Create and implement Azure Data factory Data Flow ETL and Pipelines

Design re – usable SnapLogic Pipelines to ingest data from Oracel DB and SFTP filesysytems.

Have Good Experience in Enterprise Data Lake (EDL) Big Data Integration Projects using SnapLogic.

Experience in creating the methodologies and technologies that depict the flow of data within and between application systems and business functions/operations & data flow diagrams.

Worked on Big Data Integration and Analytics based on Hadoop, SOLR, Spark, Kafka, Storm and web Methods technologies.

Used cloud shell SDK in GCP to configure the services Data Proc, storage, BigQuery.

Evaluated effectiveness of current pricing architecture and developed recommendations for refinement or to capture new opportunities.

Design and Build CICD Pipelines for Google Cloud Platform (GCP) services: BigQuery, DataFlow, Pub/Sub, Data Fusion and others

Use SnapLogic best practices to tweak the performance of Pipelines.

Used the ETL Data Stage Director to scheduling and running the jobs, Testing and debugging its components & monitoring performance statistics.

Experienced in creating complex SSIS packages using proper control and data flow elements.

Created Talend jobs to copy the files from one server to another and utilized Talend FTP components

Migrated Existing MapReduce programs to Spark Models using Python.

Built reports for monitoring data loads into GCP and drive reliability at the site level

Implemented Spark using Scala and Spark SQL for faster testing and processing of data.

Installed and configured Hadoop MapReduce, HDFS, developed multiple MapReduce jobs in Java and NiFi for data cleaning and preprocessing.

Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra.

Developed Crawlers java ETL framework to extract data from client’s database and ingest into HDFS & HBase for Long Term Storage.

Used Apache airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators and python callable and branching operators.

Develop complex Talend ETL jobs to migrate the data from flat files to database.

Helped the client design, build, stabilize and deploy a Microsoft Business Intelligence solution for a large 12 million record flat file using SQL Server, SSIS, SSAS, SSRS.

Built NiFi flows for data ingestion purposes. Ingested data from Kafka, Microservices, CSV files from edge nodes using NIFI flows.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDD and Pyspark concepts.

Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.

Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).

Processe business requests in ad-hoc of loading data in to production database using Talend Jobs

Built out Data Lake ingestion and staging to corporate data sources for Analytics purposes using Azure Data Factory.

Worked closely with Power BI report creators empowering them to map reports to standardized data models that are under pinned by the Data Warehouse solution.

Worked on Teradata stored procedures and functions to confirm the data and have load it on the table.

Controlled and monitored performance of Snap logic orchestrations and administers the lifecycle of data and process flows.

Involved in designing Kafka for multi data center cluster and monitoring it.

Responsible for importing real time data to pull the data from sources to Kafka clusters.

Implemented Error handling in Talend to validate the data integrity and data completeness for the data from the Flat File.

Installed and configured Apache airflow for workflow management and created workflows in python.

Worked on batch processing of data sources using Apache Spark, Elastic search.

Used AWS services like EC2 and S3 for small data sets processing and storage.

Implemented the Machine learning algorithms using Spark with Python.

Developed Spark scripts by using Scala Shell commands as per the requirement.

Visualized the model in tableau to derive dependent variables and gain further insights.

Involved in converting MapReduce programs into Spark transformations using Spark RDD in Scala.

Conduct/Participate in project team meetings to gather status, discuss issues & action items

Provide support for research and resolution of testing issues.

Bank Of America, Hyderabad, India Apr 2015 – Dec 2019

Data Analyst

Responsibilities:

Installed, designed, and developed the SQL Server database.

Created a logical design of the central relational database using Erwin.

Configured the DTS packages to run in periodic intervals.

Extensively worked with DTS to load the data from source systems and run-in periodic intervals.

Worked with data transformations in both normalized and de-normalized data environments.

Involved in data manipulation using stored procedures and Integration Services.

Worked on query optimization, stored procedures, views, and triggers.

Assisted in OLAP and Data Warehouse environment when assigned.

Created tables, views, triggers, stored procedures, and indexes.

Designed and developed SSIS packages, stored procedures, configuration files, tables, views, functions and implemented best practices to maintain optimal performance.

Designed and tested (Unit, Integration and Regression) packages to extract, transform and load data using SQL Server integration services (SSIS) Designed packages which utilized tasks and transformations such as Execute SQL Task, Data Flow Task, Sequence container and Conditional split, Data conversion, Derived column and Multi casting.

Scheduled the jobs to streamline and automate cumbersome repeated tasks using SSIS.

Designed and implemented database replication strategies for both internal and Disaster Recovery.

Created ftp connections, database connections for the sources and targets.

Maintained security and data integrity of the database.

Developed several forms & reports using Crystal Reports.

Analyzed business requirements, system requirements, data mapping requirement specifications, and responsible for documenting functional requirements and supplementary requirements in Quality Centre.

Design, develop and deploy convergent mediation platform for data collection and billing process using Talend ETL

Setting up of environments to be used for testing and the range of functionalities to be tested as per technical specifications.

Tested Complex ETL Mapping and Sessions based on business user requirements and business rules to load data from source flat files and RDBMS tables to target tables.

Responsible for different Data mapping activities from Source systems to EDW, ODS& data marts.

Delivered file in various file formatting system (ex. Excel file, Tab delimited text, Coma separated text, Pipe delimited text etc.)

Performed ad hoc analyses, as needed, with the ability to comprehend analysis as needed.

Involved in Teradata SQL Development, Unit testing and Performance tuning and to ensure testing issues are resolved based on using defect reports.

Tested the database to check field size validation, check constraints, stored procedures and cross verifying the field size defined within the application with metadata.

Contact this candidate