Post Job Free

Resume

Sign in

Senior Big Data Engineer

Location:
North Decatur, GA, 30033
Posted:
September 12, 2023

Contact this candidate

Resume:

Krishna

AWS Cloud Senior Big Data Engineer

AWS Engineer with overall 15 years of extensive IT experience including 9+ years of experience in Big Data

Hadoop and AWS Cloud: targeting assignments with an organization of repute.

Email: adznjg@r.postjobfree.com Phone: 470-***-****

Profile Summary

Experience working on Big Data systems, ETL pipelines, and real-time analytic systems, including slicing/dicing OLAP Cubes and drilling tabular models.

Achievement-driven professional skilled in databases, data management, analytics, data processing, data cleanings, data modeling, and data-driven projects.

Skillfully developed a technical plan and roadmap and wrote scalable CloudFormation Script. • Documented System design.

Diligently developed and deployed a CloudFormation script to provision a secure Virtual Private Cloud (VPC) which housed our multi-tier infrastructure.

Skilled at building and deploying CloudFormation scripts to provide a secure and reliable multi-tier infrastructure which included a Network Address Translation Service (NAT) gateway, Internet Gateway (IGW), Elastic Load Balancer (ELB), Auto-Scaling Group (ASG), Security Groups (SG), DynamoDB and Elastic Cloud Computing (EC2) instances.

Successfully built and deployed Terraform to provision a complete Virtual Private Cloud (VPC) which hosted a web app.

Expertise in troubleshooting to fix issues with scripts or processes.

Extensive experience in Infrastructure as Code (IaC) such as Terraform and CloudFormation for this project.

Experienced with Amazon Web Services (AWS) such as Elastic Cloud Computing (EC2) instances, Identity and Access Management (IAM), DynamoDB, Lambda, Load Balancer, Auto-Scaling Groups, CloudWatch, Systems Manager (SSM), Virtual Private Cloud (VPC) and all of its components; Subnets, route tables, Internet Gateway (IGW), Network Address Translation (NAT) gateway, Network Access Control Lists (NACLs), VPC flow logs, Security Groups (SG), VPC endpoint, Application Load Balancer (ALB).

Worked with various file formats (Parquet, Avro & JSON) and Compressions (Snappy& Gzip).

Skillfully deployed large multiple nodes of a Hadoop and Spark cluster.

An avid communicator with strong analytical and interpersonal skills.

Education

AWS Certified Practitioner.

Bachelor of Science in Business Administration from the University of the People.

Key Skills

AWS Cloud Big Data Systems CloudFormation Scripts

Terraform Scripts Project Management Data Visualization

SQL Server Database Project Execution

Documentation

Soft Skills

Adaptable Communicator

Strategic Thinker Collaborative

Team Player Problem-Solving

Technical Skills

APACHE: Apache Ant, Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Spark, Apache Tez, Apache Zookeeper, Cloudera Impala, HDFS, Hortonworks, MapReduce.

SCRIPTING: HiveQL, MapReduce, XML, FTP, Python, UNIX, Shell scripting, LINUX.

OPERATING SYSTEMS: Unix/Linux, Windows 10, Ubuntu, Apple OS.

FILE FORMATS: Parquet, Avro & JSON, ORC, Text, CSV.

DISTRIBUTIONS: Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS).

DATA PROCESSING (COMPUTE) ENGINES: Apache Spark, Spark Streaming, Flink.

DATA VISUALIZATION TOOLS: Pentaho, QlikView, Tableau, PowerBI, matplotlib.

COMPUTE ENGINES: Apache Spark, Spark Streaming, Storm, AWS EC2.

DATABASE: Microsoft SQL Server Database (2005, 2008R2, 2012), Database & Data Structures, Apache Cassandra, Amazon Redshift, DynamoDB, Apache HBase, Apache Hive, MongoDB.

SOFTWARE: Microsoft Project, Primavera P6, VMWare, Microsoft Word, Excel, Outlook, PowerPoint; Technical Documentation Skills.

BIG DATA: Hadoop, Hive, Flume, Sqoop, Airflow, Nifi, Spark, Spark Streaming, HBase, Pig, Yarn, Kafka, Zookeeper.

Work Experience

Sr. Big Data Engineer

NFL, New York, NY, USA, April 2023 to Present

As a dedicated member of the ingestion team, I played a pivotal role as a Python Engineer and Developer.

Spearheading the development of end-to-end data ingestion pipelines from scratch, demonstrating expertise in Python coding and engineering practices.

Collaborating with a talented group of professionals, including three developers, a project manager and delivery team lead, a QA specialist, and a Data Analyst, our team encompassed diverse expertise to ensure comprehensive project execution.

Designing, developing, and implementing large-scale, distributed data processing systems using Hadoop, Spark, and related technologies, enabling efficient data analysis and insights generation.

Rectifying issues encountered in existing pipelines and constructing new pipelines from scratch, employing Python as our primary programming language.

Managing seamless transfer of files from the Vendor Drop zone to the raw zone, where the original files were deposited without any transformations.

Orchestrating the movement of the data files to the refined zone after implementing necessary transformations and manipulations, rendering them ready for utilization by Data Analysts who would leverage the data for diverse analyses and report generation.

Gaining invaluable experience in problem-solving, Python development, and the intricate process of handling data pipelines. By collaborating closely with team members and actively contributing to the efficient movement and transformation of data, worked towards enhancing our team's performance and overall project success.

Leveraging Python scripts to orchestrate the transfer of vendor files across various ingestion points and storage locations, facilitating accessibility for Data Analysts.

Implementing data transformations and manipulation techniques within Python scripts to ensure data cleanliness and suitability for storage in a data warehouse.

Using Git for robust source control, creating dedicated branches for tasks, and merging changes to the main branch.

Facilitating the production deployment process through pull requests.

Employing AWS Glue and crawler functionalities to create tables that seamlessly integrated with the data pipelines.

Developed multiple programs using Scala and deployed them on the Yarn cluster, comparing the performance of Spark with Hive and SQL.

Validating data accuracy and compliance with requirements by leveraging AWS Athena for querying and result verification.

Creating data process workflows using Spark, Hive, and Pig, ensuring efficient and accurate data analysis.

Automating, configuring, and deploying instances on AWS and Data centers.

Implementing secure storage of AWS credentials by utilizing parameter store, eliminating hardcoded values in Python scripts, and enhancing security practices.

Collaborating within an agile environment, utilizing Jira and Confluence to manage tasks.

Actively participating in two-week sprints, meticulously tracking progress, and providing comprehensive reports during daily stand-up meetings.

Facilitating communication and coordination with project stakeholders, attending meetings to gather requirements and align project details.

Involved in splitting JSON files into Data Frames to be processed in parallel for better performance and fault tolerance.

Developing Spark programs using PySpark and maintaining the data warehouse in AWS Redshift.

Engaging with Senior Managers and top-level executives to provide progress updates and ensure alignment with organizational goals and expectations.

Sr. Big Data Engineer

Mediaocean, New York, NY, USA, February 2021 to March 2023

Managed complete Big Data flow from data ingestion to its destination in Redshift.

Designed and implemented a scalable, secure, and cost-effective cloud-based solution to process and analyze massive volumes of data on AWS using AWS Kinesis, AWS Glue, EMR, Spark Redshift, and Quick Sight.

Implemented end-to-end solutions using DevOps tools such as CICD to deploy Big Data pipelines.

Deployed EC2 Instance and performed Auto Scaling, snapshot backup, and managing template.

Developed ETLs to pull data from various sources and transform it for reporting applications using PL/SQL.

Developed AWS CloudFormation templates to create a custom infrastructure for our pipeline.

Developed performance tuning in Spark program for different source system domains and inserted it into a harmonized layer.

Performed Data Engineering with AWS cloud services including AWS Cloud services planning, designing, and DevOps support like IAM user, group, roles & policy management.

Created modules for Apache airflow to call different services in the cloud including EMR, S3, Athena, Crawlers, Lambda functions, and Glue jobs.

Optimized data storage in Kafka Brokers within the Kafka cluster by partitioning Kafka Topics.

Contributed to loading data from the UNIX file system to HDFS.

Involved in running Hadoop jobs for processing millions of records and data which were updated daily/weekly.

Monitored resources, such as Amazon DB Services, CPU, Memory, and EBS volumes.

Implemented usage of Amazon EMR for processing Big Data across the Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift.

Implemented AWS Lambda functions to run scripts in response to events in Amazon Dynamo DB table or S3.

Decoded raw data and loaded it into JSON before sending the batched streaming file over to the Kafka producer.

Specified nodes and performed data analysis queries on Amazon Redshift Clusters on AWS.

Reflecting the live stream data from DB2 to the HBase table using Spark Streaming and Apache Kafka.

Optimized data transfer over the network using Snappy, etc.

Used ETL to transfer the data from the target database to HDFS to send it to the MicroStrategy reporting tool.

Sr Cloud/Sr Data Engineer

Alcon, Fort Worth, TX, June 2019 to January 2021

Worked extensively with AWS including doing Docker deployments on ECS.

Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets, and ingested data through AWS Kinesis Data Stream and Firehose from various sources to S3.

Transformed and analyzed the data using Pyspark, and Hive, based on ETL Mapping.

Worked with Amazon Web Services (AWS) and was involved in ETL, Data Integration, and Migration.

Documented the requirements including the available code which should be implemented using Spark, Amazon DynamoDB, Redshift, and Elastic Search.

Expertise in implementing Spark using Python and Spark SQL for faster testing and processing of data.

Imported and exported data into HDFS and Hive using Sqoop.

Used Spark SQL for creating and populating the HBase warehouse.

Worked with Spark Context, Spark -SQL, Data Frames, and Pair RDDs.

Extracted data from different databases and scheduled Apache Airflow workflows to execute the task daily.

Developed PySpark application to read data from various file system sources, apply transformations and write to NoSQL database.

Worked on Hadoop, Spark, and similar frameworks, NoSQL, and RDBMS databases including Redis and MongoDB.

Attended meetings with managers to determine the company's Big Data needs and developed Hadoop systems.

Finalized the scope of the system and delivered Big Data solutions.

Collaborated with the software research and development teams and built cloud platforms for the development of company applications.

Training staff in data resource management.

Collected data using REST API, built HTTPS connection with client-server, sent GET request, and collected response in Kafka Producer.

Imported data from web services into HDFS and transformed data using Spark.

Worked on AWS Kinesis for processing huge amounts of real-time data.

Sr Big Data Engineer

J.D. Power, Westlake Village, California, December 2017 to May 2019

Created Spark Scala Datasets in Azure Databricks, defining the schema with Scala case classes.

Execute long-running jobs for pre-processing products and warehouse data in Azure Synapse Analytics to cleanse and prepare the data before consuming.

Used Azure Stream Analytics to divide streaming data into batches as input to Azure Databricks engine for batch processing.

Data required extensive cleaning and preparation for ML Modelling, as some observations were censored without any clear notification.

Set up Azure Logic Apps for pipeline automation.

Repartitioned datasets after loading Gzip files into Data Frames and improved the processing time.

Used Azure Event Hubs to process data as a Kafka Consumer.

Implemented Azure Databricks using Python and utilized Data Frames and Azure Spark SQL API for faster processing of data.

Involved in the implementation of analytics solutions through Agile/Scrum processes for development and quality assurance.

Interacted with data residing in Azure Data Lake Storage using Azure Databricks to process the data.

Populated data frames inside Azure Databricks jobs, Azure Spark SQL, and Data Frames API to load structured data into Azure Databricks clusters.

Monitored background operations in Azure HDInsight.

Captured and transformed real-time data into a suitable format for Scalable analytics using Azure Stream Analytics.

Forwarded requests to source REST Based API from a Scala script via Azure Event Hubs Producer.

Sr Big Data Engineer

Merrill Edge's, Atlanta, GA, July 2015 to November 2017

Implemented Spark streaming for real-time data processing with Kafka and handled large amounts of data with Spark.

Wrote streaming applications with Spark Streaming/Kafka.

Used SQL to perform transformations and actions on data residing in HDFS.

Configured various Big Data ecosystems tools such as Elastic Search, Logstash, Kibana, Kafka, and Cassandra.

Built Real-Time Streaming Data Pipelines with Kafka, Spark Streaming, and Cassandra.

Participated in various phases of data processing (collecting, aggregating, and moving from various sources) using Apache Spark.

Managed structured data via Spark SQL then stored into Hive tables for downstream consumption.

Analyzed and tuned the Cassandra data model for multiple internal projects/customers.

Interacted with data residing in HDFS using Spark to process the data.

Coordinated Kafka operation and monitoring with DevOps personnel.

Worked with Elasticsearch and Logstash (ELK) performance and configure tuning and imported data from web services into HDFS using Apache Flume, and Apache Kafka and transformed data using Spark.

Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.

Developed ETL pipeline to process log data from Kafka/HDFS sequence file and output to Hive tables in ORC format.

Responsible for designing and deploying new ELK clusters.

Defined the Spark/Python (PySpark) ETL framework and best practices for development and wrote Python code that tracks Kafka message delivery.

Used Cloudera Manager for installation and management of a multi-node Hadoop cluster.

Big Data Engineer

Publix Super Market Inc., Lakeland, FL, September 2013 to June 2015

Sourced data using APIs with data available in JSON to be converted to Parquet and Avro formats.

Used Kafka to ingest Data and create topics for data streaming.

Utilized Spark Streaming for data processing and creating DStreams from data received from Kafka.

Stored results of processed data in Hive.

Worked on large data warehouse analysis services servers and developed the different reports for the analysis from those servers.

Wrote Hive scripts to process HDFS data and wrote shell scripts to automate workflows to pull data from various databases into the Hadoop framework for users to access the data through Hive views.

Implemented Kafka streaming to send data streams from the company APIs to HDFS, processing the data with Spark Structured Streaming.

Developed DAG data pipeline to onboard and change management of datasets.

Scheduled jobs using Apache Airflow.

Used Kafka on publish-subscribe messaging as a distributed commit log.

Collected log data from various sources and integrated it into HDFS using Flume and staged data in HDFS for further analysis.

Developed SQL queries to Insert, update and delete data in a data warehouse.

Previous Experience

Technical Consultant

System One, Fremont, CA October 2008 to August 2013

Led 24x5 operation team for Incidents, Major Incidents, Service Requests & Change management, and the performance of the team, ensuring high availability.

Managed bridge calls with internal and external teams to resolve Incidents by being the first point of contact.

Identified Problems through evaluation of reports and problem control using RCA methods and workaround documentation.

Spearheaded governance calls with all Stakeholders for reporting the metrics to measure the project status, and review Process Documents, and updates.

Contributed towards resolving technical product queries customers through chat and e-mails.



Contact this candidate