Post Job Free

Resume

Sign in

Aws Data Pipeline

Location:
Atlanta, GA
Posted:
January 24, 2024

Contact this candidate

Resume:

Rashmi

Sr. AWS Data Engineer

Contact Number: +1-901-***-****

Gmail:ad22ey@r.postjobfree.com

Linked in url:www.linkedin.com/in/rashmi-alpati-1bb7a1264

Professional Summary: -

Overall 11 years of experience in the IT industry, including Big data environment, Hadoop ecosystem, Python, Java and Design, Development, Maintenance of various applications.

Extensive experience with AWS Data Pipeline and AWS Glue for ingesting data from various source systems, including relational and unstructured data, to meet business functional requirements.

Designed and developed batch processing and real-time processing solutions using AWS Data Pipeline, Amazon EMR clusters, and Amazon Kinesis Stream Analytics.

Created optimal pipeline architecture on the AWS platform, ensuring scalability and performance. Implemented AWS Data Pipeline with Linked services to facilitate data landing from different sources.

Proficient in managing Amazon S3 and Amazon Athena, with the ability to integrate with other AWS services. Experienced in IAM (Identity and Access Management) for access control.

Have hands-on experience in migrating on-premise ETLs to AWS using cloud-native tools such as AWS Glue, Amazon Redshift, and Amazon S3.

Experience in building efficient pipelines for moving data between AWS and GCP using AWS Data Pipeline.

Experience in building Amazon QuickSight reports on Amazon Redshift for better performance when comparing that to direct query using Google BigQuery.

Hands-on experience on AWS in all the big data products like Amazon Redshift, Amazon EMR, Amazon S3, AWS Glue, and AWS Step Functions for workflow orchestration.

Experience in building and architecting multiple Data pipelines, end-to-end ETL and ELT processes for Data ingestion and transformation in AWS and coordinating tasks among the team.

Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services which provides fast and efficient processing of Teradata Big Data Analytics.

Experience in installation, configuration, supporting and monitoring Hadoop clusters using Apache, Cloudera distributions and AWS.

Experience in AWS, implementing solutions using services like EC2, S3, RDS, Redshift, VPC.

Implemented AWS Object Oriented Programming, Java Collections API, SOA, design patterns, Multi-threading and S3, Data build tool, SNS, SQS Network programming techniques.

Experienced with AWS batch processing of data sources using Apache Spark.

Developed a predictive analytics model using Apache Spark's Scala APIs to process data stored in Amazon S3, and utilized AWS SNS (Simple Notification Service) and SQS (Simple Queue Service) for data ingestion and messaging within the analytics pipeline.

Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source systems which include loading nested JSON formatted data into snowflake table.

Expertise in Big Data architecture like Hadoop (Hortonworks, Cloudera) distributed system, NoSQL.

Experience in development of Big Data projects using Hadoop, Hive, HDP, Flume, Storm and MapReduce open-source tools.

Experience in developing custom UDFs for Hive to incorporate methods and functionality of Python/Java into HQL (HiveQL).

Created Hive Lambda External tables and loaded the data from Cassandra into the tables. Utilized Hive Query Language (HQL) to query and analyze the data stored in these tables.

Hands-on experience on Hadoop /Big Data related technology experience in Storage, Querying, Processing, and analysis of data.

Experience in working with MapReduce programs using Apache Hadoop for working with Big Data.

Development of spark-based application to load streaming data with low latency, using Kafka and Pyspark programming.

Leveraged Spark's distributed computing capabilities to efficiently process data in parallel across a cluster, significantly improving processing speed and performance compared to traditional batch processing approaches.

Imported millions of structured data records from relational databases into Snowflake using Snowflake data loading services for data extraction and analyzed the data.

Imported the data from relational databases to HDFS using Sqoop and processed that imported data using Spark and stored it in HDFS (Hadoop Distributed File System) in CSV format.

Hands on experience with Snowflake utilities, Snow SQL, Snow Pipe, Big Data model technique using Java/Python.

Proficiently utilized Abinitio, demonstrating strong skills in data integration, transformation, and ETL processes within the insurance domain.

Flexible with full implementation of spark jobs with PySpark API and Spark Scala API.

Developed Spark streaming application to pull data from cloud to Hive table. Used Spark SQL to process the huge amount of structured data.

Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.

Assigned star schema name to each of the columns using case class option in Scala.

Led and Managed Data Warehousing and Data Integration projects. Responsible for Data Warehousing with help of SSIS services.

Good experience in Tableau for Data Visualization and analysis on large data sets, drawing various conclusions.

Architected an enterprise-wide consumer Analytics data warehousing to facilitate onboarding, cross-selling and consumer acquisition programs.

Experience with Configuration Management, Service Now Administration, Incident and Problem Management.

Technical Skills:

Big Data

HDFS, MapReduce, Hive, Kafka, Sqoop, Flume, Oozie, and Zookeeper, Nifi, YARN, Scala, Impala, Spark SQL.

Programming Languages

C, Python, Java, J2EE, PL/SQL, HiveQL, Unix/Linux shell scripts

Operating Systems

Sun Solaris, HP-UNIX, RedHat Linux, Ubuntu Linux and Windows XP/Vista/7/8

Web Technologies

HTML, DHTML, XML, AJAX, WSDL, SOAP, JQuery, CSS3

Web/Application servers

Apache Tomcat, WebLogic, JBoss, WAS

Databases

Oracle 9i/10g/11g, DB2, SQL Server, MySQL, Teradata, Hbase, Cassandra, MongoDB

Tools and IDE

Eclipse, NetBeans, Toad, Maven, ANT, Hudson, Sonar, JDeveloper, Assent PMD, DBVisualizer, Abinitio, Informatica, Talend

Version Control & CI Tools

GIT, Jenkins, GitHub

Cloud

AWS - S3, Redshift, Athena, Quicksight, EMR, EC2, Glue, AWS Lambda, Kinesis, Data pipeline, Data bricks, DataSync, Aurora DB, Kubernetes, Terraform

Project Management Tools

JIRA, Rally, Confluence

Professional Experience: -

Sr. AWS Data Engineer

Client:AmTrust Group, June 2022 – Present

Dallas, TX

Responsibilities:

Architected and deployed a scalable and fault-tolerant data processing pipeline on AWS for handling high-volume data streams from multiple sources in real-time.

Developed ETL workflows using AWS Glue to efficiently extract, transform, and load telecom data into Amazon S3 and Amazon Redshift, enabling near-real-time analytics and reporting for network performance monitoring and optimization.

Implemented data transformations and enrichment using AWS Lambda functions and AWS Glue ETL jobs, ensuring the data was processed accurately and efficiently to meet business requirements.

Designed and implemented data models in Amazon Redshift for optimal data organization and querying performance, enabling complex analytics and visualization for telecom-specific use cases such as call detail record analysis and customer churn prediction.

Leveraged AWS services such as AWS EMR and Apache Spark for large-scale data processing and analysis, enabling advanced analytics and machine learning algorithms on telecom datasets to derive actionable insights.

Implemented data quality checks and validation rules using AWS Glue and AWS Lambda functions, ensuring data integrity, consistency, and adherence to business rules throughout the data pipeline.

Implemented data governance and security measures in compliance with industry standards and regulations, including data encryption at rest and in transit, access controls, and data masking techniques for sensitive telecom data.

Utilized AWS CloudFormation and Infrastructure as Code (IaC) principles to automate the provisioning and configuration of AWS resources, ensuring consistent and repeatable deployments of the telecom data processing infrastructure.

Implemented cost optimization strategies by utilizing AWS Cost Explorer, analyzing usage patterns, and leveraging spot instances and reserved capacity to optimize data processing and storage costs for the telecommunications project.

Good experience in Tableau for Data Visualization and analysis on large data sets, drawing various conclusions.

Leveraged and integrated Amazon S3 (Simple Storage Service) and Amazon Redshift, which connected to Tableau for end user web-based dashboards and reports.

Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the Spark jobs. Automated all the jobs to pull the data and load into Hive tables, using Oozie workflows

Worked in Loading and transforming large sets of structured, semi structured, and unstructured data. Involved in collecting, aggregating, and moving data from servers to HDFS using Flume.

Collecting data from various Flume agents that are imported on various servers using Multi-hop Flow.

Created Hive UDFs and UDAFs using python scripts based on the given requirement

Analyzed the data by performing Hive queries to study customer behavior.

Experience in AWS EMR (Elastic MapReduce), AWS Glue, Amazon Kinesis, S3, Lambda functions, Amazon Redshift, IAM, and QuickSight for reporting, etc

Working knowledge in working around Kubernetes in AWS, working on creating new monitoring techniques using the Cloudwatch logs and designing reports in Amazon quicksight.

Worked with Kubernetes and Amazon EKS to orchestrate containerized applications, optimizing resource utilization and scalability.

Created and maintained technical documentation for launching Hadoop Clusters. Developed SQOOP scripts to migrate data from Oracle to Big data Environment.

Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3 buckets, performed folder management in each bucket, managed logs and objects within each bucket.

Developed ELT processes using AWS services to process files from Ab Initio and Google Sheets. The data preparation and transformation steps were performed using AWS Glue (PySpark), and the processed data was loaded into AWS Redshift or Amazon S3 for storage. Additionally, AWS Athena or AWS EMR (Elastic MapReduce) could be utilized for running queries and performing analytics on the transformed data.

Used Apache airflow in AWS environment to build data pipelines and used various airflow operators like bash operator, AWS EMR operators and python callable and branching operators.

Extensively worked with Avro and Parquet files and converted the data from either format. Parsed Semi Structured JSON data and converted to Parquet using Data Frames in Spark.

Transformed Kafka loaded data using Spark-streaming with Scala and Python.

Utilized Sci-kit learn, Pandas, Numpy, and Tensorflow for data analysis and predictive modeling. Developed and trained an ML model to predict customer churn and identify potential high-value customers for targeted marketing campaigns. Developed and deployed a predictive model to analyze network traffic patterns, detect anomalies, and optimize network performance in real-time.

Developed the framework for the dashboard using Tableau and optimized its performance using open-source optimization tools compatible with AWS.

Created Airflow Scheduling scripts in Python to automate the process of Sqooping a wide range of data sets.

Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS

Converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size

Developed Data bricks ETL pipelines using notebooks, Spark Data frames, SPARK SQL, and python scripting.

Environment: AWS, S3, Redshift, Glue, ETL, AWS Lambda, AWS CloudFormation, Amazon Kinesis, AWS IAM, AWS QuickSight, AWS Athena, AWS Data Pipeline, Hadoop, Data Lake, HDFS, Hive, Kubernetes, Tableau, Spark, Scala, Kafka, Looker, AWS EC2 and EMR, Data bricks, Airflow, Python.

AWS Data Engineer

Client - Ally Financial, Detroit, Michigan DEC 2017 – Dec 2021

Responsibilities:

Designed database schema and wrote complex SQL queries for inserting, reading, updating and deleting data in Oracle 10g database.

Utilized Abintio to develop and implement batch jobs for extracting the data from different sources in the upstream and to push it into the centralized repository.

Successfully executed migration projects, transferring on-premise data (Oracle/Teradata) to Amazon S3 using AWS Glue and AWS DataSync.

Demonstrated expertise in designing infrastructure for efficient extraction, transformation, and loading of data from diverse data sources using AWS Glue and AWS Glue ETL jobs.

Automated jobs in AWS Data Pipeline using various triggers such as time-based schedules and event-based triggers.

Automated infrastructure provisioning and management using Terraform, resulting in improved efficiency and reduced manual errors.

Performed ETL on data from different formats like CSV, TXT, JSON, Parquet.

Created and maintained numerous data pipelines in AWS using AWS Glue and AWS Data Pipeline, utilizing different AWS Activities like S3 Data Movement, Data Transformation, and Lambda functions for optimal data flows, complex transformations, and manipulations with PySpark.

Provisioned and managed Amazon EMR clusters for both batch and continuous streaming data processing, including the installation of necessary libraries and dependencies.

Played a key role in optimizing data processing and storage costs by implementing data lifecycle management strategies, including data partitioning, columnar storage, and data archiving techniques in Amazon Redshift, resulting in significant cost savings for the project.

Led and executed large-scale migration projects, ensuring seamless data center to cloud transitions with minimal downtime.

Collaborated closely with cross-functional teams, including network engineers, data scientists, and business stakeholders, to understand their requirements and deliver data engineering solutions that meet their needs.

Performed the migration of large data sets to Amazon S3 and Amazon Redshift, created and administered clusters, configured data pipelines, and loaded data from S3 to Redshift using AWS Data Pipeline.

Created various data pipelines to load the data from Amazon S3 into Staging databases and then to Amazon RDS/Aurora databases.

Created Amazon EMR notebooks to streamline and curate the data for various business use cases and also mounted S3 buckets on EMR.

Utilized AWS Step Functions to build workflows to schedule and automate batch jobs by integrating Lambda functions, AWS Glue ETL jobs, and other services like SQS, SNS, etc.

Worked extensively on AWS Glue including data transformations, creating and managing AWS Glue ETL jobs, AWS Glue Crawlers, and migrating data to higher environments using CloudFormation templates.

Leveraged and utilized Abinitio tool for ETL process of the raw data collected from various sources.

Automated data cleansing using UNIX scripting, coupled with proficient management of Abinitio jobs to effectively process multiple XML files.

Successfully completed the migration of the entire dataset from Oracle to Oracle Exadata Database.

Implemented optimization strategies in Spark code to improve job execution time efficiency.

Wrote SQL queries to retrieve data from the database using JDBC.

Environment: AWS, Abinitio, Agile, AWS Glue, AWS DataSync, AWS Data Pipeline, Amazon S3, Amazon Redshift, Amazon EMR, AWS Step Functions, AWS Lambda, Amazon RDS/Aurora, CloudFormation, PySpark, JDBC, Python, Talend, Oracle SQL, Spark.

Data Engineer

Client: Accenture, DEC 2012–OCT 2017

Hyderabad, India

Responsibilities:

Worked on Database modeling, Data mapping Logical and Physical database design using Erwin. Designed STAR, SNOWFLAKE Schemas.

Involved in Analyzing, designing, building & testing of OLAP cubes with SSAS 2008 using MDX. Created various Dimensional Hierarchies, KPI’s, Measures and calculations as part of cube design.

Built Complex Stored Procedures for writing SSIS Package Logs into Physical Database.

Took part in the Normalization and Denormalization of databases.

Successfully completed the migration of the entire dataset from Oracle to Oracle Exadata Database.

Developed Bash shell scripts to automate the scheduling of Abinitio jobs for efficient file handling and FTP processes, resulting in reduced job run time.

Worked with a team in developing interfaces using Java.

Collaborated closely with cross-functional teams, including network engineers, data scientists, and business stakeholders, to understand their requirements and deliver data engineering solutions that meet their needs.

Worked with database administrators and developed high performing ETL and maintained consistent development functions.

Worked on Performance by tuning the SQL scripts and SSIS packages and solutions to improve the overall execution time.

Created and modified the indexes on the tables and views for the improvement in the performance of SQL Stored procedures.

Created Proxy accounts, Jobs and scheduled for SSIS packages on the UAT and production server.

Developed ETL solutions and SQL Scripts with ERROR HANDLING and ERROR LOGGING

Wrote SQL queries to retrieve data from the database using JDBC.

Involved in ETL architecture enhancements to increase the performance using query optimizer

Developed T-SQL Queries and stored procedures, Cursors, triggers, views and adding/changing tables for data load and transformation and data extraction.

Environment: Ab Initio, MS SQL Server 2012, 2008 R2/2000, DTS (Data Transformation Services), Master Data Services, MDM, Windows, SQL, DATA TOOLS, T-SQL, VB.net, Java, Visual Studio 2013/2010/2008, XML, XSLT, MS Office and Visual source safe, Requisite Pro, TFS, MS Excel, and MS Visio.



Contact this candidate