Senior Data Engineer

Location:

Seattle, WA

Posted:

November 11, 2022

Contact this candidate

Resume:

PROFESSIONAL PROFILE

*+ years’ experience working Big Data and Hadoop.

Overall IT experience covers 9+ years.

Understands and articulates the overall value of Big Data; works effectively and proactively with internal and external partners.

Hands-on experience with cloud platforms like AWS and AZURE.

Proficient in various distributions such as Hadoop Apache ecosystems, Microsoft Azure and Spark Databricks; working knowledge of AWS and Hortonworks-Cloudera.

Expert in bucketing, partitioning, multi-threading computing and streaming (Python, PySpark)

Expert in Performance Optimization and Query Tuning (MS SQL).

Handles multiple terabytes of mobile ad data stored in AWS using Elastic Map Reduce and Redshift PostgreSQL.

Provides actionable recommendations to meet Hadoop data analytical needs on a continuous basis using Hadoop distributed system and cloud systems.

Displays analytics and insights using data visualization tools Tableau, and Hadoop tools to generate reports and dashboards to drive key business decisions.

Experience with data visualization tools, data analysis, and business recommendations (cost-benefit, invest-divest, forecasting, impact analysis).

Deliver effective presentations of findings and recommendations to multiple levels of stakeholders, creating visual displays of quantitative information.

Cleanse, aggregate, and organize Hadoop HDFS data lake.

Skill with Spark framework on both batch and real-time data processing.

Hands-on experience processing data using Spark Streaming API with Scala.

Experience with using Hadoop clusters, Hadoop HDFS, Hadoop tools and Spark,

Kafka, Hive in social data and media analytics using Hadoop ecosystem.

Highly knowledgeable in data concepts and technologies including AWS pipelines, cloud repositories (Amazon AWS, MapR, Cloudera).

Hadoop ecosystem tools for ETL and analysis, pipelines, and cleaning data in prep for analysis.

Experience in migrating the data using Sqoop from HDFS to Relational Database System and vice-versa according to client's requirement.

Experience data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.

Created different POCs using azure technologies like Azure Data factory, Azure Synapse and Databricks.

Orchestrated Spark jobs using Azure Data Factory.

TECHNICAL SKILLS

APACHE - Hadoop, YARN, Hive, Kafka, Oozie, Spark, Zookeeper, Cloudera Impala, HDFS, Hortonworks, MapReduce

AZURE CLOUD – Databricks, Azure Synapse Analytics, Azure Synapse, Data Factory, Blob, SQL Server.

DATABASES - Microsoft SQL Server Database (2005, 2008R2, 2012), Apache Cassandra, Amazon Redshift, DynamoDB, Apache HBase, Apache Hive

OPERATING SYSTEMS - Unix/Linux, Windows 10, Mac OS

SCRIPTING - HiveQL, MapReduce, XML, FTP, Python, Scala, UNIX, Shell scripting, LINUX

AWS CLOUD – EMR, Glue, Lambda, SQS, SNS, S3, Athena, AWS Redshift, Kinesis.

DATA VISUALIZATION TOOLS - Tableau, PowerBI, DataDog

COMPUTE ENGINES - Apache Spark, Spark Streaming

FILE FORMATS - Parquet, Avro, JSON, ORC, Text, CSV

PROFESSIONAL WORK EXPERIENCE

Senior Data Engineer AWS

Expedia Inc., Seattle, WA / 11.2020 to Current

Expedia Inc. is an online travel agency owned by Expedia Group, an American online travel shopping company based in Seattle. The website and mobile app can be used to book airline tickets, hotel reservations, car rentals, cruise ships, and vacation packages.

Develop, test and update ETLs from E2E using AWS instances (Kinesis Firehose, AIM, S3, AWS Lambda, EMR, EC2) to process terabytes of data and store it in HDFS and Amazon S3.

Create, update and fix spark applications using Scala as main language and Spark ecosystem tools such as data frames with Spark SQL.

Create and manage data pipelines using Python as main language and Airflow as workflow scheduler.

Create queries for Hive HQL, Spark SQL, Presto and Microsoft SQL Server database using Qubole as data warehouse.

CI/CD for Spark and Scala applications using Spinnaker and Jenkins as applications deployment tools and Git/GitHub to track and host code.

Closely work with data science team to provide them data for further analysis.

Used Spark APIs to perform necessary transformations and actions on the real-time data and visualize results on Tableau and Data Dog for performance visualization results.

Use Spark-SQL and Hive Query Language (HQL) for getting customer insights to be used for critical decision making by business users.

Perform streaming data ingestion to the Spark distribution environment using Kafka.

Big Data Developer

International Business Machines Corporation (IBM), Armonk, NY / 03.2017 to 11.2020

International Business Machines Corporation (IBM) is an American multinational technology corporation headquartered in Armonk, New York, with operations in over 171 countries. IBM produces and sells computer hardware, middle ware and software, and provides hosting and consulting services in areas ranging from mainframe computers to nanotechnology.

Created E2E test cases to test data between modules following Data Driven Testing methodology reading data from relational databases, applying actions and transformation with Scala and Spark SQL, and storing data in Hive external tables.

Communicated deliverables status to stakeholders and facilitated periodic review meetings.

Created and managed and external tables to store ORC and Parquet files using HQL.

Created scripts to test and automate data testing using Gherkin, Scala and Jenkins.

Developed Apache Airflow scripts with Python as main language to automate the pipelines.

Worked on importation and claims information between DB2 warehouse and RDBMS.

Performed performance calibration for Spark Steaming (e.g., setting right Batch Interval time, correct level of executors, choice of correct publishing and memory).

Performed gradual cleansing and modeling of datasets.

Worked on CI/CD pipeline for code deployment by engaging different tools (Git/GitHub, Jenkins) in the process right from developer code check-in to deployment.

Implemented different data validations for databases and datasets.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and Python.

Analyzed the SQL scripts and designed the solution to implement using PySpark.

Imported the data from different sources like AWS S3 into Spark RDD.

Imported data from AWS S3 and into Spark RDD and performed transformations and actions on RDDs.

Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.

Big Data Developer

Comerica Incorporated, Dallas, TX / 01.2016 to 03.2017

Comerica Incorporated is a financial services company strategically aligned by the Business Bank, the Retail Bank, and Wealth Management. The Business Bank provides companies of all sizes with an array of credit and non-credit financial products and services. The Retail Bank delivers personalized financial products and services to consumers. Wealth Management serves the needs of high-net-worth clients and institutions.

Developed Spark programs using PySpark APIs to compare the performance of Spark with Hive and SQL.

Created different POCs using azure technologies like Azure Data factory, Azure Synapse and Databricks.

Orchestrated Spark jobs using Azure Data Factory.

Developed Spark scripts by using Scala Shell commands as per the requirement.

Designed and created Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets.

Involved in HBASE setup and storing data into HBASE, which was used for analysis.

Developed Spark jobs to parse the JSON or XML data.

Used the JSON and XML for serialization and de-serialization to load JSON and XML data into HIVE tables.

Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.

Used Scala libraries to process XML data that was stored in HDFS and processed data was stored in HDFS.

Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.

Loaded data into Spark RDD and configured system to perform a memory computation to generate the Output response.

Wrote different Pig scripts to clean up the ingested data and created partitions for the daily data.

Used Impala for querying HDFS data to achieve better performance.

Used Spark-SQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.

Implemented HQL scripts to load data from and to store data into Hive.

Tested on MongoDB NoSQL data modeling, tuning, disaster recovery and backup.

Used Avro, Parquet and ORC data formats to store in to HDFS.

Used Oozie workflow to co-ordinate Pig and Hive scripts.

Deployed to Amazon Web Services (AWS) Cloud services like EC2, S3, EBS, RDS and VPC.

Deployed to various HDFS file formats like Avro, Sequence File and various compression formats like Snappy.

Used Data Stax Spark Cassandra Connector to extract and load data to/from Cassandra.

Used Azure cloud to create different ETL pipelines.

Data Engineer

Suffolk Construction Company, Boston, MA / 04.2015 to 01.2016

Suffolk Construction Company is an American construction contracting company specializing in the aviation, commercial, education, healthcare, gaming and government sectors.

Developed and ran Map-Reduce jobs on YARN clusters to produce daily and monthly reports per requirements.

Developed MapReduce jobs using Java for data transformations.

Developed dynamic parameter file and environment variables to run jobs in different environments.

Wrote custom UDFs in Pig and Hive in accordance with business requirements.

Worked on Hive to create numerous internal and external tables.

Partitioned and bucketed Hive tables; maintained and aggregated daily accretions of data.

Extracted the data from RDBMS (Oracle, MySQL) to HDFS using Sqoop.

Wrote Sqoop scripts to inbound and outbound data to HDFS and validated the data before loading to check the duplicated data.

Optimized HIVE analytics, SQL queries, created tables, views, wrote custom UDFs, and Hive-based exception processing.

Optimized MapReduce jobs by using practitioners for 1-to-many joins, saving execution time; designed and tested reliability of M-R jobs using unit testing in the HBase/HDFS dev/QA platforms.

Wrote MapReduce code to process and parse data from various sources and store parsed data into HBase and Hive using HBase-Hive Integration.

Developed Pig scripts to arrange incoming data into suitable and structured data before piping it out for analysis.

Worked on installing clusters, commissioning & decommissioning of data node, configuring slots, and on name node high availability, and capacity planning.

Executed tasks for upgrading clusters on the staging platform before doing it on production cluster.

Used Zookeeper and Oozie for coordinating the cluster and scheduling workflows.

Developed and maintained continuous integration systems in a Cloud computing environment (Azure).

Used image files of an instance to create instances containing Hadoop installed and running.

Documented Technical Specs, Dataflow, Data Models and Class Models.

Database Administrator

Tata Consultancy Services, Atlanta, GA / 06.2013 to 04.2015

Tata Consultancy Services is a global leader in IT services, consulting & business solutions with a large network of innovation & delivery centers.

Developed database architectural strategies at the modeling, design, and implementation stages.

Analyzed and profiled data for quality and reconciled data issues using SQL.

Translated database designs into actual physical database implementations.

Designed and configured MySQL server cluster and managed each node on the cluster.

Maintained existing database systems.

Optimized large, high-volume, multi-server MySQL databases by applying technical re-configurations.

Standardized MySQL installs on all servers with custom configurations.

Modified database schemas.

Performed security audits of MySQL internal tables and user access, and revoked access for unauthorized users.

Set up replication for disaster and point-in-time recovery.

EDUCATION

TECNOLOGICO DE COLIMA- B.A. Engineer / MECATRONICA

Contact this candidate