Big Data Engineer

Location:

San Diego, CA

Posted:

February 07, 2023

Contact this candidate

Resume:

Igor Lavin

Big Data Engineer

Phone:

619-***-****

Email:

************@*****.***

Profile Summary

8+ years of expertise with Data Analytics, Hadoop, AWS, Cloud systems, large information tools, and frameworks.

Experienced Data Engineer providing mentoring to team members, and liaison for team with stakeholders, business units, data scientists/analysts and making sure all teams collaborate smoothly.

Skillful in major vender Hadoop distribution such as Cloudera and Hortonworks.

Creation of UDF functions in Python and Scala.

Add value to Agile/Scrum processes such as Sprint Planning, Backlog, Sprint Retrospective, Requirements Gathering and provide planning and documentation for projects.

Incremental load of data from database, partitioning and bucketing in Hive, and Spark SQL required for optimization.

Expertise in collection of log information from numerous sources and integrating it into HDFS.

Expertise on the usage of multiple nodes of Hadoop, and Spark clusters.

Familiar with developing Oozie workflows for programing, and orchestrating ETL methods.

Develop in Python for Spark SQL to perform transformations and actions on information residing in Hive.

Used Spark Streaming to divide streaming information into batches.

Robust hands-on expertise in Hadoop Framework (e.g., HDFS design, Hive, Sqoop, HBase, MongoDB, Cassandra, Oozie, Spark RDDs, Spark Data Frames, Spark Datasets, Spark MLlib, etc.).

Worked on disaster management with Hadoop cluster.

Extensively used Apache Kafka to gather logs from server log files and process using spark consumer and finding error messages across the cluster.

Skilled with BI tools like Tableau and PowerBI, data interpretation, modeling, data analysis, and reporting with the ability to assist in directing planning based on insights.

Technical Skills

●Big Data:

oApache: Hadoop, YARN, Hive, Kafka, Oozie, Spark, Tez, Zookeeper, HiveQL, Spark, Spark Streaming, Cassandra, Nifi.

oHDFS Hortonworks, MapReduce, MapReduce, HBase, Kafka, Sqoop, Airflow.

●Cloud: AWS – EC2, EMR, RDS Aurora, Redshift, CloudWatch, CloudFormation, Lambda, S3 & Glacier, DynamoDB.

●Languages: Java, Python, Shell scripting, Scala.

●Operating Systems: Unix/Linux, Windows 10, Ubuntu, Mac OS.

●Databases: MySQL, PostgreSQL, Oracle.

●Data Stores: Data Lake, Data Warehouse, SQL Database, RDBMS, NoSQL Database, Amazon Redshift, Apache Cassandra, MongoDB, SQL, MySQL, Oracle, and more

●Data Pipelines/ETL: Flume, Airflow, Apache Spark, Nifi, Apache Kafka

●Data Cleansing: Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS) (I have never use it, and they are not teaching them IN HERE)

●Programming & Scripting: Spark, Python, Java, Scala, Hive, Kafka, SQL

●Databases: Cassandra, Hbase, Redshift, DynamoDB, MongoDB, MS Access, SQL, MySQL, Oracle, PL/SQL, SQL, RDBMS

●Database Skills: Database partitioning, database optimization, building communication channels between structured and unstructured databases.

Work Experience

Big Data Engineer

Choice Hotel

San Diego, CA. (Remote)

March 2022 – Present

Environment: AWS

Technologies: Redshift, Glue, Airflow, Tableau

Project Synopsis: Migration to AWS while retiring legacy

Responsibilities:

Refactored Airflow Dogs and Glue Jobs

Developed tables and views in redshift to fit client’s needs.

Imported Data into redshift from Vertical and EDW

Worked with clients to remedy malformed data

Used Tableau for BI

Developed and Orchestrated Glue jobs via airflow for nightly processing

Developed Glue jobs

Hands-on with AWS data migration between database platforms Local SQL Servers to Amazon RDS and EMR HIVE.

Optimized Hive analytics SQL queries and created tables/views and wrote custom queries and Hive-based exception processes.

Designed and developed data pipelines in an AWS environment using S3 storage buckets

Applying transformation using AWS Lambda functions given a triggering event for event driven architecture.

Using AWS RDS Aurora for storage historical relational data.

Using AWS Athena for data profiling and infrequent queries.

Creating from scratch AWS glue jobs to transform big schedule amount of data.

Managing the E.T.L. process of the pipeline using tools like Alteryx and Informatica.

Utilized AWS Redshift to store large of data on the Cloud.

Used Spark SQL and DataFrames API to load structured and semi-structured data into Spark Clusters.

Using airflow tool with Bash and Python operators in order to automate pipeline process

Joined, manipulated, and drew actionable insights from large data sources using Python and SQL.

Worked on AWS to form and manage EC2 instances and Hadoop Clusters.

Used Spark-SQL and Hive Query Language (HQL) for obtaining client insights.

Finalized the data pipeline using Redshift a AWS data storage option.

Big Data Developer

State Street Global

Boston, MA

October 2021 – March 2022

●Helped map a project data pipeline architecture.

●Used AWS Step function to run a data pipeline.

●Developed Airflow Dags to track any errors when a script is run, capture errors and store them in a database.

●Configured an AWS CloudFormation template architecture for Redshift event notification to SNS.

●Performed SQL table view conversion to Splunk Queries.

●Developed Airflow Dag to datamine logs.

●Set up Airflow server on EC2 instance and configured Airflow.cfg and added Postgres credentials to use Postgres database for metadata.

●Applied Jira for project task tracking.

●Handled versioning with Git and set up a Jenkins CI to manage CI/CD practices.

●Customized Kibana for dashboards and reporting to provide visualization of log data and streaming data.

●Developed AWS Cloud Formation templates to create a custom infrastructure of the pipeline.

●Applied Secret Manager for secure and convenient storage and retrieval of API keys, passwords, certificates, and other sensitive data across the cloud.

Cloud Data Engineer

Allstate Insurance

San Jose, CA

March 2019 – October 2021

●Populated a Data Lake using AWS Kinesis from various data sources such as S3.

●Used Amazon EMR for processing Big Data implementing tools like Hadoop, Spark, and Hive.

●Authored AWS Lambda functions to run Python scripts in response to events in S3.

●Created AWS Cloud Formation templates to create infrastructure in the cloud.

●Implemented optimizations in Spark nodes and improved the performance of the Spark Cluster.

●Processed multiple terabytes of data stored in S3 using AWS Redshift and AWS Athena.

●Designed, built, and maintained a database to analyze the life cycle for checking transactions.

●Developed ETL jobs in AWS Glue to extract data from S3 buckets and load it into the data mart in Amazon Redshift.

●Implemented and maintained EMR, Redshift pipeline for data warehousing.

●Used Spark, Spark SQL, and Spark Streaming for data analysis and processing.

●Analyzed the SQL scripts and designed the solution to implement using PySpark.

●Hands-on application with Microsoft Azure Cloud Services (PaaS & IaaS), Storage, Active Directory, Application Insights, Internet of Things (IoT), Azure Search, Key Vault, Visual Studio Online (VSO) and SQL Azure.

●Implemented Spark using Scala and SparkSQL for faster testing and processing of data.

●Worked under an Agile methodology and contributed to the helping of creation of user stories.

●Worked with AWS Lambda functions for event-driven processing using AWS boto3 module in Python.

●Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.

●Performed exploratory data analysis in Python using Pandas.

●Developed AWS Cloud Formation templates to create custom infrastructure of our pipeline.

AWS Big Data Engineer

Cirrus Logic

Austin, TX

November 2016 – March 2019

●Configured access for inbound and outbound traffic RDS DB services, DynamoDB tables, EBS volumes to set alarms for notifications or automated actions on AWS.

●Developed AWS Cloud Formation templates to create a custom infrastructure of our pipeline.

●Implemented AWS IAM user roles and policies to authenticate and control access

●Specified nodes and performed the data analysis queries on Amazon redshift clusters on AWS.

●Worked on AWS Kinesis for processing huge amounts of real-time data.

●Developed multiple Spark Streaming and batch Spark jobs using Scala on AWS EMR, RDS, Cloud Formation, AWS IAM and Security Group in Public and Private Subnets in VPC.

●Worked on streaming the analyzed data to Hive tables using spark for making it available for visualization and report generation by the BI team.

●Worked with AWS Lambda functions for event-driven processing to various AWS resources.

●Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift.

●Automated AWS components like EC2 instances, Security groups, ELB, RDS, Lambda and IAM through AWS Cloud Formation templates.

●Installed, Configured and Managed AWS Tools such as ELK, Cloud Watch for Resource Monitoring.

●Implemented security measures which AWS provides, employing key concepts of AWS Identity and Access Management (IAM).

●Developed Python Scripts for data ingestion code using Python and perform ETL and processing phases using the Apache Hive, Spark using Pyspark, and SQL Spark scripting.

●Created multiple batch Spark jobs using Scala migrated from Spark Java.

●Launched and configured Amazon EC2 (AWS) Cloud Servers using AMIs (Linux/Ubuntu) and configuring the servers for specified applications.

●Ingested data through AWS Kinesis Data Stream and Firehose from various sources to S3.

●Responsible for Designing Logical and Physical data modeling for various data sources on AWS Redshift.

●Configured and managed Google Cloud Platform (GPC).

●Utilized automated SQL queries to pull data from an Azure SQL database.

Hadoop Data Engineer

Tata

New York, NY

June 2013 – November 2016

●Involved in a project to build-out managed cloud infrastructure, with improved systems and analytics capability. The project involved a lot of POCs and research to try to design an innovative big data system for advanced data management and analytics for customers.

●Involved in meetings with cross-functional team of key stakeholders to derive a set of functional specifications, requirements, use cases and project plans.

●Involved in researching various available technologies, industry trends and cutting-edge applications.

●Designed and set-up POCs to test various tools, technologies, and configurations, along with custom applications.

●Used the image files of an instance to create instances containing Hadoop installed and running.

●Designed a cost-effective archival platform for storing big data using Hadoop and its related technologies.

●Connected various data centers and transferred data between them using Sqoop and various ETL tools.

●Extracted data from RDBMS (Oracle, MySQL) to hive tables using Sqoop incremental load using sqoop jobs.

●Used the Hive tables to verify the data stored in the Hadoop cluster.

●Worked with the client to reduce churn rate, read, and translate data from social media websites using flume to read data from twitter.

●Collected the business requirements from subject matter experts like data scientists and business partners.

●Involved in the Design and Development of technical specifications using Hadoop technologies.

●Loaded and transformed large sets of structured, semi-structured, and unstructured data.

●Used different file formats like Text files, Sequence Files, Avro.

●Loaded data from various data sources into HDFS using Kafka.

●Tuned and operated Spark and its related technologies like SQL.

●Used shell scripts to dump the data from MySQL to HDFS.

●Used NoSQL databases like MongoDB for implementation and integration.

●Used Oozie to automate/schedule business workflows which invoke Sqoop and hive jobs.

Linux System Administrator

S.C. Johnson

Racine, WI

February 2011 – June 2013

●Performed systems administration tasks on Linux-based systems.

●Configuring DNS and DHCP on clients’ networks.

●Built, installed, and configured servers from scratch with OS of RedHat Linux.

●Performed Red Hat Linux Kickstart installations on RedHat 4.x/5.x, performed Red Hat Linux Kernel Tuning, memory upgrades.

●Installation, configuration and troubleshooting of Solaris, Linux RHEL, HP-UX, AIX operating systems

●Provided technical support services across the enterprise.

●Created database tables with various constraints for clients accessing FTP.

●Applied OS patches and upgrades on a regular basis and upgraded administrative tools and utilities and configured and added new services.

●Installed and configured Apache, Tomcat, and Web Logic and Web Sphere applications.

●Delivered remote system administration using tools such as Telnet, SSH, and Rlogin.

Education

Bachelor of Science - Applied Math Statistics - Stony Brook University

Certification

AWS Certified Cloud Practitioner (Issue Date: April 23, 2021; Expiration Date: April 23, 2024)

Contact this candidate