Sr. Big Data Engineer

Location:

Orlando, FL

Posted:

April 27, 2022

Contact this candidate

Resume:

Professional Summary

●**+ years in IT.

●* years in Hadoop, Big Data, and Cloud.

●Proficient in extracting and generating analysis using Business Intelligence Tool, Tableau for better analysis of data.

●Effective in HDFS, YARN, Pig, Hive, Impala, Sqoop, HBase, Cloudera.

●Spark Architecture including Spark Core, Spark SQL, Spark Streaming, Spark

●ETL, data extraction, transformation and load using Hive, Pig and HBase.

●Very Good knowledge and Hands-on experience in Cassandra, Flume and YARN.

●Experience in implementing User Defined Functions for Pig and Hive.

●Extensive Knowledge in Development, analysis and design of ETL methodologies in all the phases of Data Warehousing life cycle.

●Excellent understanding of Hadoop architecture and its components such as HDFS, Job Tracker, Task Tracker, Name Node, and Data Node.

●Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.

●Hands-on use of Spark and Scala APIs to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.

●Prepare test cases, documenting and performing unit testing and Integration testing.

●Writes complex SQL queries with databases like DB2, MySQL, SQL Server and MS SQL Server.

●Experience in importing and exporting data using Sqoop and SFTP for Hadoop to/from RDBMS.

●Extensive experience with Databases such as MySQL, Oracle 11G.

●Kafka messaging system to implement real-time Streaming solutions using Spark Streaming.

●Expertise with the tools in Hadoop Ecosystem including HDFS, Pig, Hive, Sqoop, Storm, Spark, Kafka, Yarn, Oozie, Zookeeper etc.

Technical Skills

Programming /Scripting

Spark, Spark Streaming, Scala, Kafka, MapReduce, SQL, Python, Visual Basic

IDEs

Jupyter Notebooks, Eclipse, IntelliJ, PyCharm, Visual Studio Code

Integrations

Ajax, REST API, Spark API,

Database

Apache Cassandra, Apache HBase, MongoDB, Oracle, SQL Server, DB2, Sybase, RDBMS

Data Storage

Data Lake, Data Warehouse, DAS, NAS, SAN

File Management

HDFS, Parquet, Avro, JSON, Snappy, Gzip, ORC

Data Analysis

MapReduce, Hive, Hive QL, SQL, RDDs, DataFrames, Datasets,

Methodologies

Agile, Kanban, Scrum, DevOps, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing

Cloud

AWS, Google Cloud Platform, Elastic Cloud

Apache SOLR, Elasticsearch, Apache Lucene,

Cloud Services & Distributions

AWS, Azure, Anaconda Cloud, Elasticsearch, Solr, Lucene, Cloudera, Databricks, Hortonworks, Elastic MapReduce

Big Data Processing

Apache Hive, Apache Cassandra, Apache Hadoop, Apache Hadoop, Apache HCatalog, SciPy, Pandas, Mesos, Apache Tez, Apache ZooKeeper

Build Tools

Apache Ant, Apache Maven, SBT

File Formats

JSON, Avro, Parquet, ORC, XML, HDFS

Version Control

GitHub, BitBucket

Continuous Integration

Jenkins, Hudson, Spinnaker

Testing

JUnit, Unit Testing, Functional Testing, Test-Driven Development

BI and Data Visualization

Kibana, Tableau, Splunk

Data Processing

Kibana, Tableau, Sqoop, Presto, Apache Flume, Apache Airflow, Apache Hue, YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Spark, Spark Streaming

Hadoop

Apache SOLR, Cloudera Impala, Cloudera, Hortonworks

Professional Experience

Sr. Big Data Engineer 03.2021 – Present

Anthem, Orlando, FL

As part of a Data Science and Engineering team, I collaborated with technologists to build an end-to-end solution for the program integrity/anomaly detection system that integrated claims, pharmacy, and 3rd-party data to develop comprehensive analytics for the identification of potential fraud, waste, abuse, and improper payment. As a Sr. Big Data Engineer, I coordinated and led the migration of the Claim Examiner module from On-Prem to a Cloud environment (AWS).

Unloaded data from Snowflake into AWS S3 buckets along with Advanced Engineering Design Lab (AEDL) team.

Scheduled and built Dags in Apache Airflow to process History and Incremental data for Digital Claims processing.

Built several POCs for connectivity, volume, Glue, Crawler, EMR.

Programmed Scala/Spark code to support test runs in On-Prem Hadoop environment as well in AWS.

Conducted end-to-end runs on EMR.

Designed solutions for DocumentDB connectivity and solved data challenges running PySpark scripts in Glue jobs.

Used Terraform for infrastructure provisioning.

Used Digital and Workos AWS accounts for initial POCs before deploying into Prod.

Ran a crawler execution on S3 data made available by AEDL team.

Scheduled all job runs in Apache Airflow and constantly monitored CloudWatch logs and EMR logs.

Enhanced 7 DEX submodules to identify unused data and optimized all steps involved in loading, processing, and writing billions of records.

Worked on Airflow scheduler tool to run Spark jobs using Spark Operator.

Created User Defined Functions (UDF) using Scala to automate some business logic in the applications.

Composed Scala classes to extricate information from MongoDB.

Programmed and executed Mongo queries.

Parsed JSON files received and loaded for further transformation.

Used Impala open-source parallel processing query engine to optimize data processing.

Used Amazon Athena service to enable interactive queries of large-scale datasets from cloud storage.

Key Technologies: Snowflake, S3, Airflow, Glue, Crawler, EMR, Scala, Spark, DocumentDB, PySpark, Terraform, CloudWatch, MongoDB, JSON, Impala, Athena

Sr. Data Engineer 02.2020 – 03.2021

Lowes, Tampa, FL

Helping to build thriving businesses and guide the Retail industry to solve complex problems and make digital transformation a reality for clients.

As a Data Engineer within our North American team, collaborate with a team of technologists, such as Data Scientists, Analyst, Business owners to produce enterprise scale solutions for our clients’ needs. This position is focused on building out a customer data hubs/profile databases and building data warehouse solutions from On-Prem to the Cloud (GCP) for Recommendations items in live site (lowes.com)

Transitioned Recommendation standalone and embedded algorithms from Hadoop/Hive to Google Cloud BigQuery.

Performed checking on base input tables for last time updated and monitor with Stackdriver and Data Studio Explorer.

Moved scripts and Products, Taxonomy and Sales tables to QA and Prod from HDFS to Google Cloud Platform.

Validated and compared results from both sources (Hive, BQ) to find discrepancies in the recommendation items.

Identified dependency tables in Hadoop, convert them to ORC files and migrate them to BQ using Google Cloud Storage as intermediate repository in our Data Lake.

Scheduled SQL Jobs (algorithms) in Cloud Composer and Apache Airflow using Trigger DAG operators and Sensor operators.

Ran Python scripts in GCP Console and Jupyter notebooks.

Ingested collection feedback data from a Kafka topic to BigQuery using Google Dataflow template.

Updated datasets on a daily basis between Hadoop and GCP.

Monitored Oozie Jobs performance and execution in Ambari.

Collaborated with Data Science team to constantly evaluate performance and logic of all algorithms to detect opportunity areas.

Key Technologies: Hive, GCP, BigQuery, Python, Cloud Composer, Apache Airflow, Hadoop, Dataflow, Ambari, Cloud Storage, SQL, Data Studio Explorer, Stackdriver, Kafka.

Sr. Hadoop Developer 12.2019 – 02.2020

Verizon Wireless, Temple Terrace, FL

Verizon MARS team works with some of the largest data sets collected from the largest, more reliable network. I was responsible for providing fast, clean, and relevant data for Verizon’s network and field organizations as well as key measurements and insights about performance to front line employees serving our customers.

Verizon MARS project is the processing, storing and facilitating the easy retrieval of customer usage data and call records. Since new usage types are introduced each year, I worked as a Hadoop Developer designing and developing enterprise grade Hortonworks Hadoop platform-based application.

Built shell scripts to provide total volume of data ingested in daily/weekly/monthly and summary reports by usage type. Compress, store and clean-up source files.

Designed and development Analytical/Semantic platform on Hadoop using MapReduce, HBase, Hive, Spark, Kafka, Elastic Search, Sqoop and build visualization dashboards using Tableau/Kibana/Qlik.

Updated Hive, Teradata and Oracle queries to include new enhancements as new usage types are generated.

Monitored on a daily basis the Hadoop cluster services such as Ambari, Yarn and Oozie jobs.

Created Oozie jobs and shell scripts for the files ingestion process into HDFS.

Used Tez engine to process the ingestions into Hive and store the data in compressed partitions.

Oversees the technical design, development, and maintenance of databases and master files on large complex projects.

Responsible for design, implementation, recovery and load strategy.

Communicated architectural approaches and makes database structural recommendations.

Coordinated with other Information Systems departments to ensure implementation of databases and monitoring of database performance.

Ensured architectural integrity and consistency across the entire project.

Key Technologies: Hive, Spark, SQL, Teradata, Oracle, Hortonworks HDP, Apache Oozie, Hadoop, Sqoop, Tableau, Ambari, Yarn, HDFS.

Big Data Engineer/Big Data Developer 09.2018 – 10.2019

Capital One, McLean, VA

I was involved in migrating a big data environment from an on-premise system to an AWS Cloud platform. This data environment was for a Credit Line Increase Program to manage credit limit offers to customers according to rules based on expense, income and other criteria. Capital One collects data from different sources and runs inclusion and exclusion rules, model calculation, segmentation, and ability to pay processes to determine which customers are eligible for a Credit Line increase.

Worked in a private cloud environment on Amazon AWS using a Multistage deployment environment (Dev – QA – Prod) .

This project required an understanding of business rules, business logic, and use cases to be implemented.

Responsible to review and understand how Quantum (Spark Wrapper) is used to ingest and process batch and real-time data using Apache Spark, Scala and SQL.

The project involved sources of data as disparate as CSV, Parquet, Avro, Kafka, Snowflake Tables, etc.

Developed Quantum workflows to read parquet files from S3 buckets and apply transformations, joins, filters, and SQL queries to different dataframes and create output datasets.

Synchronized and ensured high availability of data sources through AWS regions.

Prepared use cases, mock data and error scenarios to test workflows execution in an EMR Cluster to be deployed from Dev to Test to QA.

Verified and validated that ability-to-pay AWS Lambda triggered jobs appropriately to execute the cluster and process the accounts.

Developed and coded exclusion rules workflow to connect it to Ability-to-Pay external process using Spark, Quantum, SQL and Python.

Work within and across Agile teams to design, develop, test, implement, and support technical solutions across a full stack of development tools and technologies.

Created and configured YAML files to set input, output cluster parameters, Centrify groups, jar file versions and EC2 instance information, like Security groups and Subnet IDs.

Automate deployments on AWS using GitHub and Jenkins.

Set up the CI/CD pipelines using Jenkins, Maven, GitHub and AWS.

Design and implement ELK (ElasticSearch, Logstash and Kibana) stack solution for Proactive Monitoring of applications logs and statistics.

Developed a consumer using J2EE technologies (Node.js) to read and send app metrics to a Kinesis Data Stream to be ingested by Logstash, Elasticsearch and visualized in Kibana dashboards.

Onboard AWS Lambda and AWS Cloudwatch Logs to Splunk using HEC Token (HTTP Event Collector).

Used GitHub for control version and Jira for issues and project tracking.

Key Technologies: Quantum (Big Data framework), Spark, SQL, Python, AWS, S3 buckets, AWS Lambda, AWS Cloudwatch, EMR Cluster, Splunk, ELK, Kibana, Elasticsearch, Node.js.

Hadoop Cloud Engineer 11.2016 – 09.2018

Thermo Fisher Scientific Inc. Waltham, MA

Worked on data analysis pipelines for the company’s Affymetrix microarray analysis product line which includes products and tools used by researchers studying plant and animal genomics and transcriptomics, including basic research and industrial application of technologies for breeding, population diversity and conservation, trait analysis, and more. This research is used in cancer research, pharmacological trials and more. The products and tools provided collect data and the big data system provides a place to gather and make use of that data.

Met with stakeholders, SMEs, and Data Scientists to gather, determine and document requirements.

Created a lambda architecture variety consisting of near real-time data processing with Spark Streaming, Spark SQL; and Spark clusters.

Deployed Hadoop clusters of HDFS, Spark Clusters and Kafka clusters on virtual servers in Azure environment.

Use Azure HDInsights as interface and to manage the online clusters.

Worked on an Azure Cloud environment implementing Azure HDInsights.

Ran the Database Migration Assistant to upgrade the existing SQL Server to Azure SQL Database.

Performed streaming data ingestion to the Spark distribution environment, using Kafka.

Built a prototype for real-time analysis using Spark streaming and Kafka.

Worked on escalated tasks related to interconnectivity issues and complex cloud-based identity management and user authentication, service interruptions with Virtual Machines (their host nodes) and associated virtual storage (Blobs, Tables, Queues).

Used Spark DataFrame API over Azure HDConnect platform to perform analytics on Hive data.

Extensively used transformations like Router, Aggregator, Normalizer, Filter, Join, Expression, Source Qualifier, Unconnected and connected lookup, Update strategy and store procedure, XML transformations along with error handling and performance tuning.

Using Sqoop to extract the data back to relational databases for business reporting.

Extensively worked on Datastage sever and parallel job controls and sequencers. Designed and developed parallel jobs by using different types of stages such as transformers, Aggregator, Merge, Join, Lookup, Sort, Remove duplicates, Funnel, Filter, Pivot, Shared container for developing jobs.

Implemented all SCD types using server and parallel jobs. Extensively implemented error handling concepts, testing, debugging skills and performance tuning of targets, source, transformation logics and version control to promote the jobs.

Involved in loading and transforming large sets of structured, semi-structured and unstructured data.

Involved in transforming data from legacy tables to HDFS and HBase tables using Sqoop.

Key Technologies: Azure HDInsights, StateStage Python, Sqoop, Spark, Hadoop, HDFS, Spark Data Frame, Azure HDConnect, Azure internet of things, Stream Analytics,Cosmosdb,Spark,Cloud Service,IOT,Hadoop,Big data,,Data Analytics,Virtualization,Cloud Automation, Hive, Kafka.

Prior Experience

Coca-Cola 07.2009 – 09.2016

Coca Cola – FEMSA- Mexico

Database Design and Development

Database Development & Administration using SQL Server as well as access support to In-house systems for sales analysis and forecasts. Development of AD-Hoc systems for sales forecasts Extraction, download and processing of SAP BW / BO2 system information.

Application Development and Systems Analysis

Coordination of projects in the areas of Systems, Market Intelligence and Marketing. • Management and analysis of information in SQL databases. • Acquisition of computer equipment, servers and software for management. • Programming of systems and macros with databases. • Planning and elaboration of work plans

Education

Bachelor of Computer Science

University of Veracruz, VERACRUZ, MEXICO

Certification

Big Data 101

Cognitive Class

Contact this candidate