Data Engineer Warehouse

Location:

Chicago, IL

Posted:

January 31, 2024

Contact this candidate

Resume:

Sai Priyanka Annadasu Contact no: 312-***-****

Sr Data Engineer Email Id: ***************@*****.***

PROFESSIONAL SUMMARY

9+ years of experience in developing, deploying, and managing big data applications.

Expertise in designing data intensive applications using Hadoop Ecosystem, Big Data Analytics, Cloud Data engineering, Data Science, Data Warehouse/Data Mart, Data Visualization, Reporting, and Data Quality solutions.

Experience on Migrating SQL database to Azure data Lake, Analytics, Azure SQL Database, Data Bricks, Azure Blob, and Azure SQL Data warehouse and controlling and granting database.

Hands-on experience with S3, EC2, RDS, EMR, Redshift, SQS, Glue, and other services of AWS.

Development of ETL process using PL SQL, Informatica & Unix scripts/jobs.

In-Depth understanding of Snowflake Multi-cluster Size and Credit Usages.

Played key role in Migrating Teradata objects into Snowflake environment.

Experience with Snowflake Multi-Cluster Warehouses and Snowflake Virtual Warehouses.

Expertise in writing spark RDD transformations, actions, Data Frames for the given input.

Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/ data marts from heterogeneous sources.

Good understanding on Spark core, Spark SQL, Kinesis and Kafka.

Improve the performance of SSIS packages by implementing parallel execution, removing unnecessary sorting and using optimized queries and stored procedures.

Experience in of design or development experience with Tableau

Experience with Spark streaming to receive real time data using Kafka.

Strong knowledge of Data Warehousing implementation concept in Redshift. Has done a POC with Matillion and Redshift for DW implementation.

Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.

Experience in developing custom UDFs for Pig and Hive to incorporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL) and Used UDFs from Piggybank UDF Repository.

Experience in cloud provisioning tools such as Terraform and CloudFormation.

Knowledge on NoSQL databases such as HBase, MongoDB, Cassandra.

Imported the data from different sources like HDFS/HBase into Spark RDD.

Experienced in serverless services in AWS like Lambda, Step functions, Glue.

Skilled on streaming data using Apache Spark, migrating the data from Oracle to Hadoop HDFS using Sqoop.

Good understanding of building consumption frameworks on Hadoop, AWS, Azure or GCP (Restful services, Self-service BI and Analytic.)

Experienced in processing large datasets of different forms including structured, semi-structured and unstructured data.

Expertise in usage of Hadoop and its ecosystem commands.

Expertise at designing tables in Hive, MYSQL using SQOOP and processing data like importing and exporting of databases to the HDFS.

Hands on experience in setting up workflow using Airflow and Oozie workflow for managing and scheduling Hadoop jobs.

Experienced in handling various file formats like ASCII, XML, JSON.

Technical Skills

Big Data Ecosystem

Hadoop Map Reduce, Impala, HDFS, Hive, Pig, HBase, Flume, Storm, Sqoop, Oozie, Airflow, Kafka, Spark

Programming Languages

Python, Scala, SAS, Java, SQL, HiveQL, PL/SQL, UNIX shell Scripting

Machine Learning

Linear Regression, Logistic Regression, Decision Tree, Random Forest, SVM, XGBoost, GBM, CatBoost, Naïve Bayes, PCA, LDA, K-Means, KNN

Deep Learning

Pytorch and TensorFlow, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), LSTM, GRUs,

Databases

Snowflake, MySQL, Teradata, Oracle, MS SQL, PostgreSQL, DB2, Cassandra, Mongo DB, DynamoDB and Cosmos DB

Devops Tools

Jenkins, Docker, Maven and Kubernetes for CI/CD

Cloud

AWS (Amazon web services), Azure Cloud and snowflake.

Version Control

Git, GitHub, Bitbucket

ETL/BI

Informatica, SSIS, SSRS, SSAS, Tableau, Power BI

Operating System

Mac OS, Windows 7/8/10, Unix, Linux, Ubuntu

SDLC Methodologies

Jira, Confluence, Agile, Scrum

PROFESSIONAL WORK EXPERIENCE

Client: BCBS JUN 2022 – Present

Location: Chicago,IL

Role: Sr. Data Engineer

Responsibilities:

Worked closely with business stakeholders for obtaining the required data, analyzing, designing, and implementing the big data applications.

Created Big Data Solutions that allowed the business and technology teams to make data-driven decisions about how to best acquire customers and provide them with business solutions.

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW).

Used Kubernetes to orchestrate and deploy multiple Docker containers and VM’s to process the data pipelines.

Created Kafka streaming data pipelines to use data from multiple sources and perform transformations using Scala.

Used different AWS Data Migration Services and Schema Conversion Tool along with Matillion ETL tool.

Developed ETL and ELT data pipelines for SQL and NoSQL sources using tools such as Matillion, Google Dataflow and Python.

Involved in Migrating Objects from Teradata to Snowflake.

Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.

Gather data from various input sources about users, intents, and context in which the users are in to develop smart algorithms and software, which include content-based search algorithms, collaborative filtering, behavioral, clustering and personalization algorithms for finding correlations relevant to each user.

Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinating tasks among the team.

Responsible for implementing monitoring solutions in Ansible, Terraform, Docker, and Jenkins.

Configured and managed infrastructure as code using tools such as Ansible and Terraform, enabling easy scaling and maintenance of the CI/CD pipeline.

Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

Conducted Data blending, Data preparation using Alteryx and SQL for tableau consumption and publishing data sources to Tableau server.

Managed to perform transformations and aggregations using Python and Spark SQL.;

Created and scheduled workflows using Oozie, Hive to run on AWS EC2.

Applied optimization techniques to enhance the performance of ETL jobs in the EMR cluster.

Imported data from multiple sources into Spark RDD.

Environment: HDFS, Alteryx 11, EMR, Glue Spark, PySpark, ADF, Kafka, AWS, Pig, SBT, SSIS, Maven, Python, SparkSQL, Snowflake, Jenkins CI/CD

Client: State of NY AUG 2021 – MAY 2022

Location: New York

Role: Data Engineer

Responsibilities:

Performed spark streaming and batch processing using Scala.

Used Hive in Spark for data cleansing and transformation.

Used Scala and Kafka to create data pipelines for structuring, processing, and transforming given data.

Responsible for building scalable distributed data solutions using the EMR cluster environment with Amazon EMR 5.6.1.

Performed the migration of Hive and MapReduce Jobs from on - premises MapR to AWS cloud using EMR and Qubole.

Implemented Performance testing using Apache JMeter and created a Dashboard using Grafana to view the Results.

Participate in creating state-of-the-art data and analytics driven solutions, developing, and deploying cutting edge scalable algorithms, to drive business analytics to a new level of predictive analytics while leveraging big data tools and technologies.

Develop real-time data feeds and microservices leveraging AWS Kinesis, Lambda, Kafka, Spark Streaming, etc. to enhance useful analytic opportunities and influence customer content and experiences.

Identify and utilize existing tools and algorithms from multiple sources to enhance confidence in the assessment of various targets.

Worked on ETL testing and used an SSIS tester automated tool for unit and integration testing.

Used AWS glue catalog with crawler to get the data from S3 and perform sql query operations.

Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on clusters on AWS using Docker, Ansible, and Terraform with CI/CD.

Collaborated with cross-functional teams to integrate security testing into the CI/CD pipeline, ensuring the security of the software throughout the development lifecycle.

Heavily involved in testing Snowflake to understand best possible way to use the cloud resources

Performed efficient load and transform Spark code using Python and Spark SQL.

To meet specific business requirements, I wrote UDF in Scala and Pyspark.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the Sql Activity.

Used Oozie to create big data workflows for ingesting data from various sources to Hadoop.

Developed Spark jobs using Scala for better data processing and used Spark SQL for querying.

Environment: HDFS, Spark, Scala, Pyspark, ADF, Kafka, AWS, Pig, SBT, Maven, Circle CI

Client: USAA AUG 2018 - JUN 2021

Location: Plano, TX

Role: Data Engineer

Responsibilities:

Worked on analyzing Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Spark and Kafka.

Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

Used SparkAPI over ClouderaHadoopYARN to perform analytics on data in Hive.

As a Big Data Developer implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, MapReduce Frameworks, MongoDB, Hive, Oozie, Flume, Sqoop and Talend etc.

Developed a job server (REST API, spring boot, ORACLE DB) and job shell for job submission, job profile storage, job data (HDFS) query/monitoring.

Explored with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark -SQL, Data Frame, PairRDD's, Spark YARN.

Deployed application to AWS and monitored the load balancing of different EC2 instances.

Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from SQL into HDFS using Sqoop.

Developed analytical components using Scala, Spark, Apache Mesos and Spark Stream.

Installed Hadoop, Map Reduce, and HDFS and developed multiple MapReduce jobs in PIG and Hive for data cleaning and pre-processing.

Worked on Big Data Integration &Analytics based on Hadoop, SOLR, Spark, Kafka, Storm and web Methods.

Extensively worked on Python and build the custom ingest framework and worked on Rest API using python.

Improved the monitoring and alerting system of the CI/CD pipeline using Jenkins, resulting in quicker detection and resolution of issues.

Developed Kafka producer and consumers, Spark, and Hadoop MapReduce jobs.

Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.

Imported the data from different sources like HDFS/HBase into Spark RDD.

Configured deployed and maintained multi-node Dev and Test Kafka Clusters.

Environment: Hadoop, Python, HDFS, Spark, MapReduce, Pig, Hive, Sqoop, Kafka, HBase, Oozie, Flume, Scala, Java, Cassandra, Zookeeper, HBase, MongoDB, AWS EC2, EMR, S3.

Client: Bicon DEC 2016 - JUNE 2018

Location: Hyderabad, India

Role: Data Engineer

Responsibilities:

Designed and built a custom and genetic ETL framework – Spark application using Scala.

Handled data transformations based on the requirements.

Configured spark jobs for weekly and monthly executions using amazon data pipeline.

Handled application log data by creating customer loggers.

Created error reprocessing framework to handle errors during subsequent loads.

Executed queries using Spark SQL for complex joins and data validation.

Developed Complex transformations Mapplets using Informatica to Extract Transform and Load Data into Data marts Enterprise Data warehouse EDW and Operational data store ODS.

Created SSIS package to get the dynamic source filename using For Each Loop Container.

Used the Lookup, Merge, Data conversion, sort etc. Data flow transformations in SSIS.

Created independent components for AWS S3 connections and extracted data into Redshift.

Involved in writing Scala scripts for extracting from Cassandra Operational Data Store tables for comparing with legacy system data.

Worked on data ingestion file validation component for threshold levels, last modified and checksum.

Environment: Spark, Scala, AWS, DBeaver, Zeppelin, S3, Cassandra, Alteryx 11, Workspace, Shell scripting.

Client: Genpact JUL 2014 - NOV 2016

Location: Hyderabad, India

Role: Data Analyst

Responsibilities:

Analyzed the requirements provided by the client and developed a detailed design with the team.

Worked with the client team to confirm the design and modified based on the changes mentioned.

Involved in extracting and exporting data from DB2 into AWS for analysis, visualization, and report generation.

Created HBase tables and columns to store the user event data.

Used Hive and Impala to query the data in HBase.

Developed and implemented core API services using Scala and Spark.

Managed querying the data frames using Spark SQL.

Used Spark data frames to migrate data from AWS to MySQL.

Built continuous ETL pipeline by using Kafka, Spark streaming and HDFS.

Performed ETL on data from various file formats (JSON, Parquet and Database).

Performed complex data transformations using Scala in Spark.

Converted SQL queries to Spark transformations using Spark RDDs and Scala.

Worked on importing real time data to Hadoop using Kafka and implemented Oozie job.

Collected log data from web servers and exported to HDFS.

Involved in defining job flows, management, and log files reviews.

Installed Oozie workflow to run Spark, Pig jobs simultaneously.

Created hive tables to store the data in table format.

Environment: Spark, Scala, HDFS, SQL, Oozie, SQOOP, Zookeeper, MySQL, HBase.

EDUCATIONAL QUALIFICATIONS

Bachelor’s in computer science Guru Nanak Institute of Technology

Contact this candidate