Post Job Free

Resume

Sign in

Data Engineer

Location:
Fresno, CA
Posted:
March 23, 2023

Contact this candidate

Resume:

Professional Summary

* +years in Hadoop, Big Data, and Cloud.

Hands on with Spark framework on both batch and real-time streaming data processing.

Hands-on experience processing data using Spark streaming API and Spark SQL.

Skilled in AWS, Redshift, DynamoDB and various cloud tools.

Streamed over millions of messages per day through Kafka and Spark Streaming.

Move and transform Big Data for insightful information using Sqoop.

Build Big Data pipelines to optimize utilization of data and configure end-to-end systems.

Use Kafka for data ingestion and extraction into HDFS Hortonworks system.

Use Spark SQL to perform preprocessing using transformations and actions on data residing in HDFS.

Create Spark Streaming jobs to divide streaming data into batches as an input to Spark engine for data processing.

Construct Kafka brokers with proper configurations for the needs of the organization using Big Data.

Write Spark DataFrames to NoSQL databases like Cassandra.

Build quality for Big Data transfer pipelines for data transformation using Kafka, Spark, Spark Streaming, and Hadoop.

Design and develop new systems and tools to enable clients to optimize and track using Spark.

Work with highly available, scalable and fault tolerant big data systems using Amazon Web Services (AWS).

Provide end-to-end data solutions and support using Hadoop big data systems and tools on AWS cloud services as well as on-premise nodes.

Well versed in Big Data ecosystem using Hadoop, Spark, Kafka with column-oriented big data systems such as Cassandra and HBase.

Implement Spark in EMR for processing Big Data across our Data Lake in AWS System.

Work with various file formats (delimited text files, click stream log files, Apache log files, Avro files, JSON files, CSV, XM).

Use Kafka and HiveQL scripts to extract, transform, and load the data into multiple databases.

Perform cluster and system performance tuning on Big Data systems.

Technical Skills

PROGRAMMING Python, Scala, Java

SCRIPTING Python, Unix Shell Scripting

SOFTWARE DEVELOPMENT Agile, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Gradle, Git, GitHub, SVN, Jenkins, Jira

DEVELOPMENT ENVIRONMENTS Eclipse, IntelliJ, PyCharm, Visual Studio, Atom

AMAZON CLOUD Amazon AWS (EMR, EC2, EC3, SQL, S3, DynamoDB, Cassandra, Redshift, Cloud Formation)

DATABASE NoSQL: Cassandra, HBase, Mongo, SQL: SQL, MySQL, PostgreSQL

HADOOP DISTRIBUTIONS Cloudera, Hortonworks

QUERY/SEARCH SQL, HiveQL, Apache SOLR, Kibana, Elasticsearch

BIG DATA COMPUTE Apache Spark, Spark Streaming, SparkSQL MISC: Hive, Yarn, Spark, Spark Streaming, Kafka, Flink

VISUALIZATION: Kibana, Tableau, PowerBI, Grafana

Professional Experience

Data Engineer – Thomson Reuters Corporation, Remote –Jun 2022 – Present

Thomson Reuters provides the intelligence, technology and human expertise to find trusted answers. But, what does that mean exactly? We provide trusted data and information to professionals across 3 different industries: Legal; Tax and Accounting; and News & Media.

Company is a diversified financial services company and bank holding company that provides financial planning products and services, including wealth management, asset management, insurance, annuities, and estate planning.

Worked on AWS to form and manage EC2 instances and Hadoop Clusters.

Clusters.

Wrote shell scripts for log files to Hadoop cluster through automatic processes.

Implemented AWS Fully Managed Kafka streaming to send data streams from the company APIs to Spark cluster in AWS Databricks.

Streamed data from AWS Fully Managed Kafka brokers using Spark Streaming and processed the data using explode transformations.

Developed AWS Cloud Formation templates to create custom infrastructure of our pipeline.

Performed streaming data ingestion process through PySpark.

Finalized the data pipeline using DynamoDB as a NoSQL storage option and a python lambda function.

Hands-on with AWS data migration between database platforms Local SQL Servers to Amazon RDS and AWS Redshift.

Optimized Python code and SQL queries, created tables/views, and wrote custom queries and Hive-based exception processes.

Implemented a data warehouse solution in AWS Redshift.

Deployed the large knowledge Hadoop application mistreatment Talend on Cloud AWS.

Utilized AWS Redshift to store Terabytes of data on the Cloud.

Used Spark SQL and DataFrames API to load structured and semi-structured data into Spark

Used Spark-SQL and Hive Query Language (HQL) for obtaining client insights.

Ingested large data streams from company REST APIs into EMR cluster through AWS kinesis.

Developed consumer intelligence reports based on market research, data analytics, and social media.

Implemented Spark using Scala and SparkSQL for faster testing and processing of data.

Automated AWS components like EC2 instances, Security groups, ELB, RDS, Lambda and IAM through AWS Cloud Formation templates.

Joined, manipulated, and drew actionable insights from large data sources using Python and SQL.

Data Engineer – Deloitte (US Region Delivery Center), Fresno, CA April 2021 – May 2022

At Deloitte's US Delivery Center (USDC) we help clients achieve a higher level of service in operational efficiency and business value. The USDC leverages scale, talent, and a center delivery model to provide high quality, cost-effective service with standardized processes and procedures.

The US Delivery Center brings a multidisciplinary approach that merges Deloitte’s independently recognized industry, business, and technical experience with leading operating approaches, refined as a result of building and deploying hundreds of solutions.

All locations are strategically positioned to be able to scale and deliver quickly on the most complex projects at any stage.

Created PySpark streaming job in Azure Databricks to receive real time data from Kafka.

Defined Spark data schema and set up development environment inside the cluster.

Processed data with natural language toolkit to count important words and generated word clouds.

Worked as part of the Big Data Engineering team to design and develop data pipelines in an Azure environment using ADL Gen2, Blob Storage, ADF, Azure Databricks, Azure SQL, Azure Synapse for analytics and MS Power BI for reporting.

Used Azure Data Factory to orchestrate data pipelines.

Created a pipeline to gather data using PySpark, Kafka, and Snowflake.

Used Spark Streaming to receive real-time data using Kafka.

Worked with unstructured data and parsed out the information by Python built-in function.

Configured a Python API Producer file to ingest data from the Slack API using Kafka for real-time processing with Spark.

Developed Spark programs using Python to run in the EMR clusters.

Ingested Images responses in Kafka producer written in Python.

Started and configured master and slave nodes for Spark/HBase.

Designed Spark Python jobs to consume information from S3 Buckets using Boto3.

Set up cloud compute engine in managed and unmanaged mode and SSH key management.

Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance.

Wrote simple SQL scripts on the final database to prepare data for visualization with Tableau.

Used spark streaming as Kafka Consumer to process consumer data.

Wrote Spark SQL to create and read Cassandra tables,

Wrote streaming data into Cassandra tables with spark structured streaming.

Wrote Bash script to gather cluster information for Spark submits.

Developed Spark UDFs using Scala for better performance.

Managed Hive connection with tables, databases, and external tables.

Installed Hadoop using Terminal and set the configurations.

Interacted with data residing in HDFS using PySpark to process data.

Configured Linux on multiple Hadoop environments setting up Dev, Test, and Prod clusters within the same configuration.

Hadoop Engineer – Serviperf / The Coca-Cola Company, Remote

March 2019 – February 2020

Worked in Data & Analytics Technologies organization that is responsible for building cloud-based analytics products for APAC, EMEA, Americas and Corporate that directly impacts Coca-Cola's business growth globally

Configured Kafka Producer with API endpoints using JDBC Autonomous REST Connectors.

Configured a multi-node cluster of 10 Nodes and 30 brokers for consuming high volume, high-velocity data.

Used GCP Big Query to store data.

Created GCP Big Query SQL queries to gather data for business units.

Implemented parser, query planner, query optimizer, and native query execution using replicated logs combined with indexes, supporting full relational KQL queries, including joins.

Developed distributed query agents for performing distributed queries against shards.

Wrote Producer/Consumer scripts to process JSON response in Python.

Developed JDBC/ODBC connectors between Hive/Snowflake and Spark for the transfer of the newly populated data frame.

Developed scripts for collecting high-frequency log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis.

Wrote complex queries the API into Apache Hive on Hortonworks Sandbox.

Utilized GCP Big Query to query the data to discover trends from week to week.

Configured and deployed production-ready multi-node Hadoop services Hive, Sqoop, Flume, Airflow on the Hadoop cluster with latest patches.

Created Hive queries to summarize and aggregate business queries by comparing Hadoop data with historical metrics.

Worked on various real-time and batch processing applications using Spark/Scala, Kafka and Cassandra.

Loaded into ingested data into Hive Managed and External tables.

Built the Hive views on top of the source data tables and built a secured provisioning.

Used Cloudera Manager for installation and management of single-node and multi-node Hadoop cluster.

Implemented parser, query planner, query optimizer, and native query execution using replicated logs combined with indexes supporting full relational Kibana Query Language (KQL) queries, including joins.

Performed upgrades, patches and bug fixes in Hadoop in a cluster environment

Wrote shell scripts for automating the process of data loading.

Evaluated and proposed new tools and technologies to meet the needs of the organization.

Database Developer – Serviperf / Kimberly-Clark – Remote

November 2017 – February 2019

Responsible for designing and developing enterprise data solutions on Cloud Platforms integrating Azure services and 3rd party data technologies. Worked closely with a multidisciplinary agile team to build high quality data pipelines driving analytic solutions. These solutions generated insights from our connected data, enabling Kimberly-Clark to advance the data-driven decision-making capabilities of our enterprise.

Provide technical support to the client portfolio of accounting software throughout Latin America. Mainly, problems related to databases are solved within the software (Microsoft SQL Server Management).

Construction and customization of integration systems using technologies such as Raas, Saas, API and web services. Designed and performed of web applications with C# ASP.NET framework, SQL Management and UI.

Wrote shell scripts for time-bound command execution.

Worked with application teams to install operating system, updates, patches, and version upgrades.

Utilize conceptual knowledge of data and analytics, such as dimensional modeling, ETL, reporting tools, data governance, data warehousing, structured and unstructured data to solve complex data problems.

Big Data Engineer – Citigroup – New York, NY

November 2016 to November 2017

Built the infrastructure required for extraction transformation and loading of data for a variety of data sources using AWS technologies

Worked with stakeholders (e.g., Executives, Product, Data and Design teams) to assist with data-related technical issues.

Worked with Airflow to schedule Spark applications.

Created multiple Airflow DAGs to manage parallel execution of activities and workflows.

Designed multiple applications to consume and transport data from S3 to EMR and Redshift.

Developed PySpark application to process consecutive datasets.

Created EMR clusters using Cloud Formation.

Created Lambda Applications triggered based on events over S3 buckets.

Created Spark programs using Scala for better performance.

Adjusted Spark Applications shuffle partition size to execute maximum level of parallelism.

Used Elastic Search to monitor log applications.

Performed incremental appends of datasets.

Optimized Spark using map side join type transformations to reduce shuffle.

Applied Kafka Stream library.

Education

ITESM MX, MECHATRONICS ENGINEER



Contact this candidate