Sr. Big Data Engineer

Location:

North Philadelphia, PA, 19122

Posted:

July 05, 2023

Contact this candidate

Resume:

KEVIN MONTANO

Big Data/ Cloud/ Hadoop Developer

Email: ************@*****.***; Phone: 215-***-****

PROFILE SUMMARY

•8+ years combined experience in database/IT, Big Data, Cloud, and Hadoop

•Expertise in designing, implementing, and managing big data solutions using Hadoop-based technologies such as HDFS, Hive, HBase, and Spark

•Hands-on experience with Amazon Web Services (AWS) and cloud services such as EMR (Elastic MapReduce), EC2 (Elastic Compute Cloud), S3 (Simple Storage Service), EBS (Elastic Block Store), and IAM (Identity and Access Management) entities and roles for building highly available, scalable, and fault-tolerant systems

•Insightful experience in data concepts and technologies, including AWS pipelines (data flow and processing in AWS), cloud repositories such as Amazon AWS, MapR,

•Responsible for managing ETL processes from databases such as SQL Server, MySQL, Oracle into Hadoop-based storage systems (HDFS, Hive)

•Processed data In Azure Databricks for ingestion into an Azure Synapse data warehouse

•Integrated Azure SQL Database, Azure Synapse Analytics, or Azure Data Lake Storage as data storage and warehousing solutions within the data engineering pipeline.

•Implemented data partitioning, sharding, and indexing techniques for optimized data storage and retrieval, improving query performance and reducing latency.

•Data Import/Export: Experience in importing/exporting terabytes of data between HDFS/S3/Storage and Relational Database Systems using Kafka and Sqoop (ingesting tools)

•SQL and Database Development: Proficient in writing SQL queries, Stored Procedures, Triggers, Cursors, and Packages for data manipulation and processing and experience in various SQL databases such as SQL Server, Oracle, PostgreSQL, and MySQL

•Demonstrated experience in developing Snowflake complex queries, Cloud Procedures, Views and Functions.

•Demonstrated analytical insights through data visualization Python libraries and tools such as Matplotlib, Tableau, Power BI

•Converting Hive/SQL queries into Spark transformations using Spark RDD (Resilient Distributed Dataset) and Data Frame (distributed collection of data organized into named columns)

•Skilled in using Spark SQL to perform data processing on data residing in Hive/Redshift Spectrum (data warehousing service)

•Writing Hive/Hive QL scripts to extract, transform, and load data into databases, and configure Kafka cluster with Zookeeper for real-time streaming

•Used oozie and airflow for orchestrating the pipeline..

TECHNICAL SKILLS

IDE:

Jupyter Notebooks, Eclipse, IntelliJ, PyCharm

PROJECT METHODS:

Agile, Test-Driven Development

BIG DATA PROCESSING:

Spark, Spark Streaming, Kafka, and Hadoop(Map Reduce)

CLOUD PLATFORMS:

Amazon AWS Microsoft Azure

DATABASE & NOSQL TOOLS:

Apache HBase, SQL, MongoDB, DynamoDB, Oracle

PROGRAMMING LANGUAGES:

Java, Python, Scala, PySpark

SCRIPTING:

Hive, MapReduce, SQL, Spark SQL, Shell Scripting

CONTINUOUS INTEGRATION (CI-CD):

Jenkins, GitLab, GitHub

FILE FORMAT AND COMPRESSION:

CSV, JSON, Avro, Parquet, ORC

Orchestration:

Oozie, Airflow.

ETL TOOLS:

Apache Flume, Kafka, Sqoop, Spark, AWS Glue, Azure Data Factory

DATA VISUALIZATION TOOLS:

Tableau, Power BI

SECURITY:

Kerberos, Apache Ranger

AWS:

Glue, DynamoDB, Redshift, Lambda, S3, EC2, Kinesis, EMR, QuickSight, Athena, VPC, IAM

Azure

Data Factory, Databricks, SQL Database, Synapse Analytics,

HDInsight, Data Lake Storage and Cosmos DB

PROFESSIONAL EXPERIENCE

Sr. Big Data Engineer

Allstate Insurance, Northfield Township, IL Jan 2022 - Present

The Allstate Corporation is an American insurance company, headquartered in Northfield Township, Illinois.

•Develop multiple Spark Streaming and batch Spark jobs in Scala and Python on AWS to extract, transform, and consolidate data from diverse record formats

•Design logical and physical data models for various data sources on AWS Redshift to ensure efficient data storage and retrieval

•Implement schema for a custom Hbase database to optimize data organization and retrieval

•Create Apache Airflow DAGs using Python for streamlined workflow management and automation of data processing tasks

•Utilize AWS Lambda functions with AWS boto3 module in Python for event-driven processing, enabling efficient handling of real-time data

•Execute Hadoop/Spark jobs on AWS EMR using programs and data stored in S3 Buckets, enabling scalable data processing on the cloud platform

•Worked on POCs to ETL with S3, EMR(Spark) and Snowflake.

•Configure access for inbound and outbound traffic for RDS DB services, DynamoDB tables, and EBS volumes, and set up alarms for notifications or automated actions on AWS, ensuring secure and efficient data flow

•Develop AWS Cloud Formation templates to create custom infrastructure for the pipeline, enabling smooth deployment and management of resources

•Implement AWS IAM user roles and policies to authenticate and control access, ensuring secure data handling and authorization

•Specified nodes and performed data analysis queries on Amazon Redshift clusters on AWS, enabling efficient data retrieval and analysis

•Develop Spark programs using PySpark and created User Defined Functions (UDFs) using Python in Spark, enabling customized data processing tasks

•Work with AWS Kinesis for processing huge amounts of real-time data, enabling real-time insights and decision-making

•Develop scripts for collecting high-frequency log data from various sources and integrating it into AWS using Kinesis, staging data in the Data Lake for further analysis, enabling efficient data ingestion and processing

•Load data into AWS Snowflake (Snowpipes) as Data warehouse

•Develop, design, and test Spark SQL jobs with Scala, and Python Spark consumers, ensuring accurate data processing and analysis

•Work on CI/CD pipeline for code deployment using different tools (Git, Jenkins, CodePipeline) from developer code check-in to production deployment, ensuring a smooth and efficient code deployment process

•Create and maintain ETL pipelines in AWS along with Snowflake using Glue, Lambda, and EMR for data extraction, transformation, and loading tasks, ensuring efficient and automated data processing workflows

Sr. Cloud Engineer

Alteryx, Irvine, CA Feb 2020 – Dec 2021

American computer software company based in Irvine, California, with a development center in Broomfield, Colorado and offices worldwide. The company's products are used for data science and analytics. The software is designed to make advanced analytics automation accessible to any data worker.

•Ingested data from diverse sources, including flat files and APIs, utilizing Kafka from Salesforce and other systems and ensured seamless and efficient data flow into the system for further processing

•Well experienced in using Azure to process and analyze large datasets.

•Worked on was utilizing Azure Data Factory to schedule and manage the movement and transformation of data from a variety of sources, such as on-premises databases and SaaS applications, into Azure Data Lake Storage.

•Utilized Azure Databricks to process and cleanse the data before loading it into an Azure Synapse Analytics (formerly SQL Data Warehouse) for reporting and visualization.

•Integrated Azure Data Lake Service with other Azure services, such as Azure HDInsight and Azure Synapse Analytics, to create a comprehensive data processing and analysis platform

•Mapped the ingested data to the appropriate Databricks schema and efficiently loaded it into Delta Lake, a robust data lake solution

•Followed Agile Scrum processes for Software Development Lifecycle (SDLC), working in 2-week Sprints and participating in daily 30-minute standups (Scrums)

•ETL Workflow and Data Pipeline Development: Designed and developed ETL workflows and data pipelines using tools such as Apache NiFi, Apache Airflow, Talend, and Informatica

•Coordinated with team members, tracked progress, and adapted to changing requirements, ensuring efficient project management and timely delivery

•Utilized Jira Agile development software to track tasks, sprints, stories, and backlog management, ensuring effective project tracking and coordination

•Contributed to the design, code, configuration, and documentation of components managing data ingestion, real-time streaming, batch processing, data extraction, and transformation

•Played a critical role in ensuring smooth and efficient data processing throughout the project

•Created and managed Kafka topics and consumers for streaming data, ensuring smooth and efficient data ingestion and processing

•Extracted data from various sources, transformed it into a usable format, and loaded it into the target system or data store, ensuring data quality and consistency.

•Worked with various data formats such as CSV, JSON, Avro, Parquet, and ORC, applying data transformation and manipulation techniques to extract and transform data into the required format, enabling seamless data processing and analysis

•Created comprehensive documentation of design, code, configuration, and user guides, recording all relevant information on Confluence pages.

•Performed the migration of large data sets to Databricks (Spark), create and administer cluster, load data, configure data pipelines, loading data from ADLS Gen2 to Databricks using ADF pipelines

•Created various pipelines to load the data from Azure data lake into Staging SQLDB and followed by to Azure SQL DB

•Created Databrick notebooks to streamline and curate the data for various business use cases and also mounted blob storage on Databrick

•Utilized Azure Logic Apps to build workflows to schedule and automate batch jobs by integrating apps, ADF pipelines, and other services like HTTP requests, email triggers etc

Sr. Cloud Engineer

Salesforce, San Francisco, CA May 2018 – Jan 2020

Cloud-based software company headquartered in San Francisco, California. It provides customer relationship management software and applications focused on sales, customer service, marketing automation, e-commerce, analytics, and application development.

•Implemented and managed large-scale big data solutions on cloud platforms such as Amazon Web Services (AWS) and Microsoft Azure, utilizing services such as Amazon S3, Amazon EMR, Amazon Redshift, Amazon Kinesis, Azure Blob Storage, Azure Data Factory, Azure Databricks, and Azure HDInsight

•Designed and implemented end-to-end big data pipelines, including data extraction, transformation, and loading (ETL), using technologies such as AWS Lambda, AWS Glue, AWS Step Functions, Azure Data Factory, and Azure Functions, ensuring secure and scalable data processing

•Utilized Spark and Python to develop custom data processing scripts, such as word frequency analysis from multiple files, generating comparative tables, and calculating interchanges between different credit cards, enhancing data analysis capabilities

•Implemented real-time data processing and analytics using Azure Stream Analytics, Azure Functions, AWS Kinesis Data Streams, and AWS Lambda, enabling real-time insights and decision-making

•Configured and optimized Spark and Hive environments, including Spark config files, environment path, Spark home, external libraries, and Hive Query Language (HQL), ensuring efficient data processing and analysis workflows

•Developed and maintained data ingestion workflows using Flume, Kafka, and shell scripts, automating data transfer from various databases into the Hadoop framework, providing users access to data through Hive-based views

•Implemented performance optimizations such as partitioning Kafka Topics, configuring Flume agent batch size, capacity, transaction capacity, roll size, roll count, and roll intervals, ensuring efficient and scalable data processing within the Kafka and Flume clusters

•Leveraged Airflow and Spark for data scrubbing, processing, and transformation tasks, ensuring data quality and accuracy in data pipelines

•Utilized Spark SQL to convert DataFrames and add business logic for data processing and analysis, enhancing data processing capabilities

•Created pipelines using PySpark, Kafka, and Hive to gather new product releases data for a country in a given week, enabling timely insights for business decision-making

•Provided connections to various Business Intelligence tools, such as Tableau and Power BI, to access data warehouse tables, enabling seamless integration of data into BI dashboards and reports

Data Engineer

Oxagile – New York, NY Jul 2016 – Apr 2018

Oxagile is a reliable software development partner with over 17 years of experience in video domain. Oxagile has made the way from a promising tech startup to a mature software vendor serving over 1 billion users. Our solid knowledge around 30+ verticals and business domains makes us a go-to partner in helping businesses run at their best.

•Implemented a prototype for real-time analysis using Spark Streaming and Kafka in the Hadoop system

•Consumed data from Kafka queue using Storm and deployed application jar files into AWS instances

•Extracted data from RDBMS (Oracle, MySQL) to Hadoop Distributed File System (HDFS) using Sqoop

•Utilized NoSQL databases like MongoDB for implementation and integration

•Collected business requirements from subject matter experts and data scientists

•Loaded and transform large sets of structured, semi-structured, and unstructured data using Hadoop, Spark, Hive for ETL pipeline, and Spark streaming, directly on Hadoop Distributed File System (HDFS)

•Configured Oozie workflow engine scheduler to run multiple Hive, Sqoop, and Pig jobs in the Hadoop system

•Built a Full-Service Catalog System with a complete workflow using Elasticsearch, Logstash, Kibana, and Kinesis

•Loaded data from various data sources into Hadoop Distributed File System (HDFS) using Kafka

•Integrated Kafka with Spark Streaming for real-time data processing in Hadoop

•Transferred data using the Informatica tool from AWS S3 and utilized AWS Redshift for cloud data storage

•Used different file formats like Text files, Sequence Files, and Avro for data processing in the Hadoop system

•Imported data from disparate sources into Spark RDD for data processing in Hadoop

•Designed a cost-effective archival platform for storing big data using Hadoop and its related technologies

•Used image files to create instances containing Hadoop installed and running

•Streamed analyzed data to Hive Tables using Sqoop, making it available for data visualization

•Tuned and operated Spark and its related technologies like Spark SQL and Spark Streaming

•Utilized Hive JDBC to verify the data stored in the Hadoop cluster

•Connected various data centers and transferred data using Sqoop and ETL tools in the Hadoop system

•Developed a task execution framework on EC2 instances using SQL and DynamoDB

•Used shell scripts to dump data from MySQL to Hadoop Distributed File System (HDFS)

• Authored queries in AWS Athena to query from files in S3 for data profiling

Hadoop HYPERLINK "https://www.linkedin.com/jobs/view/data-platform-engineer-at-acuity-brands-354*******"Developer

Accenture, Chicago, IL Mar 2015 – Jun 2016

Irish-American professional services company based in Dublin, specializing in information technology services and consulting. Innovative and comprehensive services and solutions that span cloud; systems integration and application management; security; intelligent platform services; infrastructure services; software engineering services; data and artificial intelligence; and global delivery through our Advanced Technology Centers.

•Used the Fair Scheduler to efficiently allocate resources across the cluster, ensured optimal performance and resource utilization for all applications

•Utilized Zookeeper for providing coordinating services to the Hadoop cluster, ensuring smooth and efficient workflow coordination

•Documented Technical Specifications, Dataflow, Data Models, and Class Models in the Hadoop system, ensuring comprehensive documentation for future reference

•Leveraged Zookeeper and Oozie for seamless cluster coordination and scheduling workflows in Hadoop, optimizing cluster performance

•Managed installation, commissioning, and decommissioning of data nodes, NameNode recovery, capacity planning, and slots configuration in Hadoop, ensuring smooth operations of the cluster

•Conducted production support activities, including monitoring server and error logs, proactively identifying and resolving potential issues, and escalating when necessary, ensuring high availability and reliability of the system

•Implemented partitioning and bucketing in Hive for effective organization of Hadoop Distributed File System (HDFS) data, improving data retrieval and processing efficiency

•Automated build processes and regular jobs like ETL using Linux shell scripts, streamlining workflow and reducing manual efforts

•Imported data from MySQL and Oracle to Hadoop Distributed File System (HDFS) using Sqoop regularly, ensuring data availability for analysis

•Created Hive external tables to store Pig script output and performed data analysis to meet business requirements, leveraging data processing capabilities of Hadoop

•Successfully loaded files to HDFS from Teradata and loaded data from HDFS to Hive, ensuring seamless data transfer between systems

•Utilized Sqoop and Flume for efficient data transfer between databases and HDFS, and streaming log data from servers, optimizing data processing in Hadoop

•Loaded data into HBase for faster access to products in all stores without compromising performance, improving data retrieval efficiency

•Installed and configured Pig for ETL jobs, implementing regular expressions for data cleaning, ensuring data accuracy and quality

•Loaded data from the Linux file system to Hadoop Distributed File System (HDFS), enabling seamless data integration from different sources

•Responsible for building scalable distributed data solutions using Hadoop, ensuring efficient and optimized data processing

•Implemented ETL processes to move data between Oracle and Hadoop Distributed File System (HDFS) using Sqoop, optimizing data integration and retrieval

EDUCATION

BS (Computer Science) from the University of Illinois at Springfield

Contact this candidate