I am a Big Data Engineer with * years of professional experience in the IT industry. As an experienced Big Data consultant, I will ensure the successful delivery of high-quality big data solutions. I combine an understanding of the business case, a variety of skills, frameworks, best practices and coding skill. Additionally, I have a strong work ethics, the ability to work well with teams, to create the right platforms, pipelines and reporting tools for clients.
Used Spark SQL and Data Frame API extensively to build Spark applications.
Experienced in working on CQL (Cassandra Query Language), for retrieving the data present in Cassandra cluster by running queries in CQL.
Learn and adapt to perform for the CICD tool (GITHUB, Jenkins) chain that is available at Customer environment or proposed to be made available.
Configured the ELK stack for Jenkins logs, and syslogs
Used Spark to work on streaming analyzed data to HBase and make available for visualization and report generation by the BI team.
Used Spark Structured Streaming to structure real time data frame and update it in real time.
Prototyped analysis and joining of customer data using Spark in Scala and processed it to HDFS.
Implemented Spark in EMR for processing Big Data across our Data Lake in AWS System
Developed AWS strategy, planning, and configuration of S3, Security groups, IAM, EC2, EMR and Redshift.
Experience integrating Kafka with Avro for serializing and deserializing data. Expertise with Kafka producer and consumer.
Experience in configuring, installing and managing Hortonworks & Cloudera Distributions.
Involved in continuous Integration of application using Jenkins.
Implemented Spark and Spark SQL for faster testing and processing of data.
Experience writing streaming applications with Spark Streaming/Kafka.
Utilized Spark Structured Streaming to update the data frame in real time and process it
Experienced in Amazon Web Services (AWS), and cloud services such as EMR, EC2, S3, RDS and IAM entities, roles, and users.
Knowledgeable of deploying the application jar files into AWS instances.
Creation of Kafka brokers in structured streaming to get structured data by schema.
Handling schema changes in data stream using Kafka.
Skilled in HiveQL, custom UDFs written in Hive, and optimizing Hive Queries, as well as writing incremental imports into Hive tables.
DATA ARCHITECT April 2021- Present
Capital One Financial,
Working on a project team focused on a two-part project:
oCredit Batch Processing – Focus is developing a batch processing tool that reconciles every credit card transaction with VISA and MasterCard. Runs a Java program in an ECS cluster.
oCredit Policies: Focus is developing credit policies. A credit policy in Capital One is a set of rules that determines if a credit card application is accepted or rejected. Runs a Scala program in an EMR cluster.
Conducting work in alignment with an Agile project methodology using Jira and applying daily standups.
Implementing and configuring Splunk for log monitoring.
Designing and configuring dashboards and charts to show statistics about daily executions using Splunk with automatic alerts onboarded with PagerDuty with email, SMS, and phone.
Programming a Java-based online web application form and rules engine to process and evaluate information collected according to the product selected by the applicant.
Modifying existing Java code to process a new file, generate a Data Frame with its content and join it with the existing Data Frames in the program. It is the first time that Avro files are onboarded in the project.
Programming a Business Rules Management System (BRMS) called Dools to return Fico scores from multiple credit bureaus for factoring into business determinations.
Performing tests on every business rule in a local environment using Cucumber and ATDD.
Applying Git, which is merged and submitted to Jenkins CI for QA testing using a web interface to match an existing product with the recently developed policy.
Storing rulesets, credit policy data, and product data in a PostgreSQL database and configuring the database for the rules engine to decide what rules need to be evaluated when applying for a product.
Writing SQL scripts with DML statements to match a product with its rules.
Upgrading Spark version from 2.2 to 2.4 to support the requirement that the infrastructure be recreated every 45 days to ensure security policies are adhered to.
Configuring system to ensure all data and infrastructure are replicated in East and West regions in AWS. There is a Trex activity to switch between regions every 3 months to test that high availability is provided in case of failure.
Building technical infrastructure in Terraform scripts to avoid dependency on AWS as cloud service provider.
Enabling availability to APIs and components for all teams and projects by applying JFrog Artifactory.
Documenting project work using Atlassian Confluence.
Implementing SPHINX for internal technical system security.
BIG DATA ENGINEER March 2018- April 2021
Jersey City, NJ
Created training program to form professionals as Machine Learning Developers.
Trained IT professionals in Python and Spark that at the end of the program will be able to understand capabilities, features, custom solutions and limitations in order to deliver high-quality solutions to a model of the data processing by using the PySpark programs for proof of concept.
Built Real-Time Streaming Data Pipelines with Kafka, Spark Streaming and Hive.
Spark streaming implemented for real-time data processing with Kafka.
Handled large amounts of data utilizing Spark.
Wrote streaming applications with Spark Streaming/Kafka.
Used Spark SQL to perform transformations and actions on data residing in Hive.
Responsible for designing and deploying new ELK clusters.
Log monitoring and generating visual representations of logs using ELK stack.
Implement CI/CD tools Upgrade, Backup and Restore
Created Infrastructure design for ELK Clusters.
Worked with Elasticsearch and Logstash (ELK) performance and configure tuning.
Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.
Handled schema changes in data stream using Kafka.
Support for the clusters, topics on the Kafka manager.
Responsible for Kafka operation and monitoring, and handling of messages funneled through Kafka topics.
Coordinated Kafka operation and monitoring with dev ops personnel; formulated balancing impact of Kafka producer and Kafka consumer message(topic) consumption.
Versioning with Git and set-up a Jenkins CI to manage CICD practices.
Pulled data and populated the data in Kibana.
Kibana dashboard designed over Elasticsearch for visualizing the data
Used Kibana to create custom dashboards, data visualization and reports.
Built Jenkins jobs for CI/CD infrastructure from GitHub repos
AWS BIG DATA ENGINEER Oct 2016- March 2018
Implemented AWS IAM user roles and policies to authenticate and control access.
Specified nodes and performed the data analysis queries on Amazon redshift clusters on AWS.
Developed AWS Cloud Formation templates to create custom infrastructure of our pipeline.
Working on AWS Kinesis for processing huge amounts of real time data.
Developed multiple Spark Streaming and batch Spark jobs using Scala and Python on AWS.
Ingestion data through AWS Kinesis Data from various sources to S3.
Cloud Formation, AWS IAM and Security Group in Public and Private Subnets in VPC.
Worked with AWS Lambda functions for event-driven processing to various AWS resources.
Implemented Spark in EMR for processing Big Data across our Data Lake in AWS System.
Worked with Amazon AWS IAM console to create custom users and groups.Hands-on work with AWS EMR and S3.
Automated AWS components like EC2 instances, Security groups, ELB, RDS, Lambda and IAM through AWS cloud Formation templates.
Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift.
Implemented security measures AWS provides, employing key concepts of AWS Identity and Access Management (IAM).
Installed, Configured and Managed AWS Tools such as ELK, Cloud Watch for Resource Monitoring.
AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3).
Launched and configured The Amazon EC2 (AWS) Cloud Servers using AMI's (Linux/Ubuntu) and configuring the servers for specified applications.
Responsible for Designing Logical and Physical data modelling for various data sources on AWS Redshift.
Implemented AWS Lambda functions to run scripts in response to events in Amazon Dynamo DB table or S3 bucket using Amazon API gateway.
AWS Kinesis used for real time data processing.
Experienced in Amazon Web Services (AWS), and cloud services such as EMR, EC2, S3, ELB and IAM entities, roles, and users.
Developed AWS Cloud Formation templates to for RedShift.
BIG DATA DEVELOPER Aug 2015- Oct 2016
United Health Group
St. Paul, MN
Wrote Hive queries and optimized the Hive queries with Hive QL.
ETL to Hadoop file system (HDFS) and wrote HIVE UDFs.
Experienced in importing real-time logs to HDFS using Flume.
Created UNIX shell scripts to automate the build process, and to perform regular jobs like file transfers.
Managed Hadoop clusters and check the status of clusters using Ambari.
Moved Relational Database data using Spark to transform in and move into Hive Dynamic partition tables using staging tables.
Developed scripts to automate the workflow processes and generate reports.
Transferred data between a Hadoop ecosystem and structured data storage in a RDBMS such as MySQL using Sqoop.
Involved in writing incremental imports into Hive tables.
Extensively worked on HiveQL, join operations, writing custom UDFs, and skilled in optimizing Hive Queries.
Spark API over Hadoop YARN to perform analytics on data in Hive.
Developed Shell Scripts, Oozie Scripts and Python Scripts.
Download data through Hive in HDFS platform.
Developed job processing scripts using Oozie workflow to run multiple Spark Jobs in sequence for processing data.
Used Ambari for maintaining heathy cluster.
Configured Hadoop components (HDFS, Zookeeper) to coordinate the servers in clusters.
Hive partitioning, bucketing, and joins on Hive tables, utilizing Hive SerDe’s.
Wrote shell scripts to automate workflows to pull data from various databases into Hadoop.
Loaded into Hbase tables and Hive tables consumption purposes.
BIG DATA DEVELOPER April 2014- Aug 2015
FTR Transportation Intelligence
Experience in configuring, installing and managing Hortonworks (HDP) Distributions.
Enabled security to the cluster using Kerberos and integrated clusters with LDAP at Enterprise level.
Worked on tickets related to various Hadoop/Big data services which include HDFS, Yarn, Hive, Oozie, Spark, Kafka.
Worked on Hortonworks Hadoop distributions (HDP 2.5)
Performed cluster tuning and ensured high availability.
Cluster coordination services through Zookeeper and Kafka.
Coordinates with monitors cluster upgrade needs, and monitors cluster health and builds proactive tools to look for anomalous behaviors.
Managing Hadoop clusters via Command Line, and Hortonworks Ambari agent.
Monitored multiple Hadoop clusters environments using Ambari.
Worked with cluster users to ensure efficient resource usage in the cluster and alleviate multi-tenancy concerns.
Managed cluster using Ambari
Managed and scheduled batch jobs on a Hadoop Cluster using Oozie.
Monitored Hadoop cluster using tools like Ambari.
Performed cluster and system performance tuning.
Run multiple Spark jobs in sequence for processing data.
Performed analytics on data using Spark.
Moved data from Spark and persist it to HDFS.
Used Spark SQL and UDFs to perform transformations and actions on data residing in Hive.
Master of Science in Data Science
SAINT PETER’S UNIVERSITY
Jersey City, NJ
Overall GPA: 3.95/4.0
Bachelor of Science in Computer Science
PONTIFICAL CATHOLIC UNIVERSITY OF PERU
Accessing Hadoop Data using Hive (IBM)
Moving Data into Hadoop (IBM)
Spark Fundamental (IBM)
Tableau Desktop Specialist (Tableau)
Analytics Academy: Google Analytics (Google)
Deep Learning A-Z: Hands-on Artificial (Udemy)
Neural Networks (Udemy)
NLP – Natural Language Processing with Python (Udemy). Time Series Analysis and Forecasting (Udemy)
Professional Technical Skills
DATABASE AND DATA WAREHOUSE
Cassandra, Hbase, Amazon Redshift, DynamoDB, MongoDB, Oracle, PostgreSQL, MySQL, Hive
DATA STORES (repositories)
Data Lake, HDFS, Data Warehouse, S3
Spark, Scala, Hive, Pig, Java, PySpark, Keras and TensorFlow
VBA, Python (Jupyter Notebook, Pandas, Numpy, Matplotlib, Scikit-
learn, Boto3, Psycopg2, BeautifulSoup, GeoPandas, Rasterio)
R-Programming, MATLAB, C++, C#
DEVELOPMENT TOOLS, AUTOMATION, CICD
Git, GitHub, MVC, Jenkins, CI CD, Jira, Agile, Scrum
ELK LOGGING & SEARCH
Elasticsearch, Logstash, Kibana
Flume, Spark, Kafka, Hive, Pig, Spark Streaming, SparkSQL, Data Frames, Kinesis, Spark, Spark Streaming, Spark Structured Streaming
BIG DATA PLATFORMS
Cloudera CDH, Hortonworks HDP, Amazon Web Services (AWS)/Amazon Cloud
AWS IAM Formation, AWS Redshift, AWS RDS, AWS EMR, AWS S3, EC2, AWS Lambda, AWS Kinesis, AWS ELK, AWS Cloud
Tableau, Power BI, Excel, Kibana