Resume

Azure Data Lake

Location:

Richardson, TX

Posted:

August 07, 2023

Contact this candidate

Resume:

SASHANK PERICHERLA

Email: adyrlv@r.postjobfree.com PH: (469)666 - 4394

Sr. Data Engineer

PROFESSIONAL SUMMARY

• 9+ years of IT experience in Analysis, Design, and Development in Big Data technologies like Spark, SnowFlake, Hive, Kafka and HDFS including programming languages like Java and Python.

• Strong experience building data pipelines and performing large-scale data transformations.

• In-Depth knowledge in working with Distributed Computing Systems and parallel processing techniques to efficiently deal with Big Data.

• Built data pipelines using Azure Data Factory, Azure Databricks.

• Loaded the data to Azure Data Lake, Azure SQL Database

• Used Azure SQL Data Warehouse to control and grant database access.

• Worked with Azure services such as HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, and Cosmos DB.

• Firm understanding of Hadoop architecture and various components including HDFS, Yarn, MapReduce, Hive, Pig, HBase, Kafka, Oozie, etc.,

• Strong experience building Spark applications using Scala and Python as programming languages.

• Good experience troubleshooting and fine-tuning long-running spark applications.

• Strong experience using Spark RDD API, Spark Data Frame/Dataset API, Spark-SQL and Spark ML frameworks for building end-to-end data pipelines.

• Extensive hands-on experience tuning spark Jobs.

• Experienced in working with structured data using HiveQL and optimizing Hive queries.

• Good experience working with real-time streaming pipelines using Kafka and Spark-Streaming.

• Strong experience working with Hive for performing various data analyses.

• Detailed exposure to various hive concepts like Partitioning, Bucketing, Join optimizations, Ser-De’s, built-in UDF, and custom UDFs.

• Good experience in automating end-to-end data pipelines using the Oozie workflow orchestrator.

• Good experience working with Cloudera, Hortonworks, Snowflake, and AWS big data services.

• Working on AWS Services IAM, EC2, VPC, AMI, SNS, RDS, SQS, EMR, LAMBDA, GLUE, ATHENA, Dynamo DB, Kinesis, Cloud Watch, Auto Scaling, S3, and Route 53

• Implemented Lambda to configure Dynamo DB Auto scaling feature and implemented Data Access Layer to access AWS DynamoDB data.

• Developed and deployed various Lambda functions in AWS with in-built AWS Lambda Libraries and also deployed Lambda Functions in Python with custom Libraries.

• Experience with developing and maintaining Applications written for AWS S3, AWS EMR (Elastic Map Reduce), and AWS Cloud Watch

• Experience in analyzing, designing, and developing ETL Strategies and processes, and writing ETL specifications.

• Excellent understanding of NoSQL databases like HBASE, Cassandra, and MongoDB.

• Proficient knowledge and hands on experience in writing shell scripts in Linux.

• Experienced in requirement analysis, application development, application migration, and maintenance using Software Development Lifecycle (SDLC) and Python/Java technologies.

• Excellent technical and analytical skills with a clear understanding of design goals and development for OLTP and dimensions modeling for OLAP.

• Adequate knowledge and working experience in Agile and Waterfall Methodologies.

• Defining user stories and driving the agile board in JIRA during project execution, participating in sprint demos and retrospectives.

• Have good interpersonal and communication skills, strong problem-solving skills, explore/adapt to new technologies with ease, and a good team member.

• In-depth knowledge of Snowflake Database, Schema, and Table structures.

• Experience with gathering and analyzing system requirements.

TECHNICAL SKILLS:

Hadoop/Big Data Technologies

Hadoop, Apache Spark, HDFS, Map Reduce, Sqoop, Hive, Oozie, Zookeeper, Kafka, Flume

Programming & Scripting

Python, SQL, PySpark.

Databases

MY SQL, Oracle, MS-SQL Server, Teradata

NO SQL Database

HBase, Cassandra, Dynamo DB, Mango DB.

BIG data Distribution

Horton Works, Cloudera, Spark

Version Control

Git, Bit bucket.

Operating Systems

Linux, Unix, Mac OS-X, CentOS, Windows 10, Windows 8, Windows 7

Cloud Computing

Azure Data lake, Blob storage, Data factory, Azure Databricks, Azure SQL database, Azure Synapse Analytics. AWS s3, Amazon RDS, Athena, Glue, Kinesis, Redshift, DynamoDB.

Visualization Tools

Power BI, Tableau, Matplotlib, Seaborn, Quick sight.

PROFESSIONAL EXPERIENCE

Azure Data Engineer

UNUM, AZ Sept 2022 - Present

Responsibilities:

• Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, Azure Data Catalog, and Azure SQL Server.

• Designed end-to-end scalable architecture to solve business problems using various Azure Components like Data Factory, Data Lake Storage, and Databricks.

• Created Database objects in Snowflake and migrated data from MS SQL Server into

Snowflake.

• Uploaded data files from local instance into Amazon S3 buckets as a staging area.

• Created new streams for capturing all data changes based on DML operations.

• Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the SQL Activity.

• Created Databricks notebooks using SQL, Python, and automated notebooks using jobs.

• Developed Spark applications using Python and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into customer usage patterns.

• Created spark clusters and configured high-concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.

• Implemented Kafka, spark structured streaming for real-time data ingestion.

• Expertise in moving data between GCP and Azure using Data Factory.

• Scalable metadata handling, Streaming, and batch unification are offered by Delta Lake.

• Used Delta Lakes for time traveling as Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.

• Writing pyspark and spark SQL transformation in Azure Databricks to perform complex transformations for business rule implementation.

• Created Airflow scheduling scripts in Python.

• Used Azure Databricks for the fast, easy, and collaborative spark-based platform on Azure.

• Used Databricks to integrate easily with the whole Microsoft stack.

• Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.

• Used Azure Data Catalog which helps in organizing and to get more value from their existing investments.

• Used Azure Synapse to bring these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.

• Followed the organization-defined Naming conventions for naming the Flat file structure, Talend Jobs, and daily batches for executing the Talend Jobs.

• Used Apache airflow in the GCP composer environment to build data pipelines using various airflow operations like bash operator.

• Exposed transformed data in Azure Spark Databricks platform to parquet formats for efficient data storage.

• Experienced in performance tuning of Spark Applications for setting the right Batch Interval time, the correct level of Parallelism, and memory tuning.

• Involved in running all the hive scripts through Hive, Impala, Hive on Spark, and some through Spark SQL.

• Collected the JSON data from HTTP Source and developed Spark APIs that help to do inserts and updates in Hive tables.

• Used Azure Data Factory, SQL API, and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB, cosmos DB)

• Analyzed the SQL scripts and designed it by using PySpark SQL for faster performance.

• Developed spark applications in Python (PySpark) on the distributed environment to load a huge number of CSV files with different schema into Hive ORC tables.

• Worked on scheduling all jobs using Airflow scripts using Python and added different tasks to DAG, and LAMBDA.

• Worked on reading and writing multiple data formats like JSON, ORC, and Parquet on HDFS using PySpark.

• Provide guidance to the development team working on PySpark as an ETL platform.

• Performed all necessary day-to-day GIT support for different projects, Responsible for maintenance of the GIT Repositories, and the access control strategies.

Environment: Spark 3, Databricks, Functions, Data Factory, Azure data grid, Azure Synapse analytics, Azure data catalog, Delta Lake, Blob, cosmos DB, Python, PySpark, Java, SQL, Kafka, JIRA.

AWS Data Engineer

Markel, VA March 2022 – Sept 2022

Responsibilities:

• Responsible for the design, implementation, and architecture of very large-scale data intelligence solutions around big data platforms.

• Analyzed large and critical datasets using HDFS, HBase, Hive, HQL, Pig, Sqoop, and Zookeeper.

• Developed multiple POCs using Spark, and Scala and deployed them on the Yarn Cluster, comparing the performance of Spark, with Hive and SQL.

• Use Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as a storage mechanism.

• Deployed application using AWS EC2 standard deployment techniques and worked on AWS infrastructure and automation. Worked on CI/CD environment on deploying applications on Docker containers.

• Capable of using AWS utilities such as EMR, S3, and Cloud Watch to run and monitor Hadoop and Spark jobs on AWS.

• Worked on reading and writing multiple data formats like JSON, ORC, and Parquet on HDFS using PySpark.

• Worked on SQL queries in dimensional data warehouses and relational data warehouses. Performed Data Analysis and Data Profiling using Complex SQL queries on various systems.

• Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR.

• Troubleshoot and resolve data processing issues and proactively engage in data modeling discussions.

• Worked on RDD Architecture and implementing spark operations on RDD and optimizing transformations and actions in Spark.

• Written programs in Spark using Python (PySpark) packages for performance tuning, optimization, and data quality validations.

• Implemented data ingestion from various source systems using sqoop and PySpark.

• Hands-on experience implementing Spark and Hive jobs performance tuning.

• Worked on developing Kafka Producers and Kafka Consumers for streaming millions of events per second on streaming data.

• Implemented a distributing messaging queue to integrate with Cassandra using Apache Kafka.

• Hands-on experience in fetching the live stream data from UDB into the HBase table using PySpark streaming and Apache Kafka.

• Evaluate Snowflake Design considerations for any change in the application.

Environment: HDFS, Python, SQL, Spark, Scala, Kafka, Hive, Yarn, Snowflake, Lambda, AWS Cloud, GitHub, Shell Scripting.

Data Engineer

Tenneco, MI Jan 2019 - Feb 2022

Responsibilities:

• Involved in data warehouse implementation on Azure Synapse using Synapse pipelines, notebooks, SQL Serverless, and dedicated SQL Pool

• Developed Azure Synapse and Data bricks notebooks to Join, filter, pre-aggregate, and process the files stored in Azure data lake storage using PySpark.

• Use various types of activities: data movement activities, transformations, and control activities; Copy data, Data flow, Get Metadata, Lookup, Stored procedure, Execute Pipeline

• I worked extensively with Azure Storage, Azure Data Lake, Azure File Share &amp, and Azure Blob Storage to store and retrieve data files.

• Used Azure Key Vault to securely store secrets for connection strings and database passwords and used them in configuring Linked services.

• Developed ETL pipelines using Spark and Hive for performing various business-specific transformations.

• Building Applications and automating the pipelines in Spark for Bulk loads as well as Incremental Loads of various Datasets.

• Using a mix of Azure Data Factory, T-SQL, and Spark SQL, Azure Data Lake Analytics, extract, transform, and localize data from sources systems to Azure Data Storage services.

• Processing data in Azure Databricks once it has been ingested into one or more Azure Services.

• Deployed Data Factory for creating data pipelines to orchestrate the data into SQL database.

• Implemented automated data pipelines using Azure Data Factory, reducing manual effort and improving data processing efficiency.

• Developed Python scripts to do file validations in Databricks and automated the process using ADF.

• Developed an automated process in Azure cloud that can ingest data daily from web service and load it into Azure SQL DB.

• Wrote spark applications to perform operations like data inspection, cleaning, loading, and transforming large sets of structured and semi-structured data.

• Developed Spark with Scala and Spark-SQL for testing and processing of data.

• Reporting the spark job stats, Monitoring, and running Data Quality Checks are made available for each dataset.

Environment: AWS Cloud Services, Apache Spark, Spark-SQL, Snowflake, Unix, Kafka, Scala, SQL Server.

Azure Data Engineer

Inovalon, Bowie, MD OCT 2016 to Dec 2018

Responsibilities

• Used Agile software development methodology in defining the problem, gathering requirements, development iterations, business modeling, and communicating with the technical team for the development of the system.

• Creating pipelines in ADF for data extraction, transformation, and loading from different sources.

• Generates ETL scripts to transform, flatten, and enrich the data from source to target using AWS Glue and created event driven ETL pipelines with AWS Glue.

• Developing Spark-SQL applications in Databricks for data extraction, transformation, and aggregation from multiple file formats.

• Utilized Spark Scala API to implement batch processing of jobs and developed Spark Streaming applications to consume data from Kafka topics and insert processed streams to HBase.

• Experience in migrating SQL databases to Azure data lake, Azure data lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data Warehouse, controlling and granting database access.

• Develop and maintain analytics data pipelines using SQL and/or Python in both on-prem and cloud (Azure Databricks) environments.

• Created Pipelines in ADF using Linked Services/Data sets/Pipeline/to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse. Performed Data Aggregation, Validation and on Azure HDInsight using spark scripts written in Python.

• Performed monitoring and management of the Hadoop cluster by using Azure HDInsight.

• Generated PL/SQL scripts for data manipulation, validation, and materialized views for remote instances.

• Created partitioned tables in Hive, also designed a data warehouse using Hive external tables and also created hive queries for analysis.

• Used Azure Data Lake as Source and pulled data using Azure blob.

Environment: Microsoft Azure, Databricks, Azure Data Lake, PySpark, Python, Azure Data Factory V2, Azure Synapse (DWH), Azure Analysis Server, Power-BI, Azure

Data Warehouse / ETL Developer

Dhruvsoft Services Private Limited, Hyderabad, India Jan 2014 to July 2016

Responsibilities:

• Worked on power center client tools such as source analyzer, warehouse designer, mapping designer, Mapplet designer, and transformations developer.

• Implemented ETL using transformations such as the source qualifier, expression, aggregation, connected and unconnected lookups, filter, router, sequence generator, sorter, joiner, and update strategy.

• Developed shell scripts, PL/SQL procedures for creating and dropping tables, and indexes of performance for pre- and post-session management.

• Imported data from various sources transformed and loaded into data warehouse targets using Informatica.

• Developed Informatica mappings and tuned them when necessary Optimized query performance and session performance.

• Performed unit and system testing of developed mapping Involved in documentation.

Environment: Oracle, Informatica Power Center, Sybase, Unix Scripting, Selenium, Maven, Eclipse, TOAD, SPSS

EDUCATIONAL QUALIFICATIONS

• Bachelor’s in Electronics and Electrical Engineering, from GITAM University. India.

Contact this candidate