Post Job Free

Resume

Sign in

Senior Big Data Engineer

Location:
Oxon Hill, MD, 20745
Posted:
March 19, 2024

Contact this candidate

Resume:

Xavier Ray

Big Data Developer

Phone: 845-***-**** / Email: adz071@r.postjobfree.com

Professional Summary

8+ years of working in the Big Data space including IT experience along with roles including AWS Cloud Data Engineer, Hadoop Develop, and Senior Big Data Developer. Experienced in using Hadoop tools, Apache Spark, and AWS services. Skilled in managing data analytics, data processing, machine learning, artificial intelligence, and data-driven projects.

•Worked with various stakeholders to gather requirements to create as-is and as-was dashboards.

•Experienced in Amazon Web Services (AWS), and cloud services such as EMREC2, S3, EBS, and IAM entities, roles, and users.

•Apply in-depth knowledge and skill to incremental imports, partitioning, and bucketing concepts in Hive and Spark SQL needed for optimization.

•Proven hands-on experience in Hadoop Framework and its ecosystem, including but not limited to HDFS Architecture, MapReduce Programming, Hive, Sqoop, HBase, MongoDB, Cassandra, Oozie, Spark RDDs, Spark Data Frames, Spark Datasets, etc.

•Design and build scalable Hadoop distributed data solutions using native, Cloudera Hortonworks, Spark, and Hive.

•Implement PySpark and Hadoop streaming applications with Spark Streaming and Kafka.

•Program UDF functions in Python or Scala.

•Hands-on with Hadoop-as-a-Service (HAAS) environments, SQL, and NoSQL databases.

•Collect log data from various sources and integrate it into HDFS using Flume; staging data in HDFS for further analysis.

•Expert with the design of custom reports using data extraction and reporting tools, and development of algorithms based on business cases.

•Collect real-time log data from different sources like web server logs and social media data from Facebook and Twitter using Flume, and store in HDFS for further analysis.

•Recommended and used various best practices to improve dashboard performance for Tableau server users.

•Apply Spark framework on both batch and real-time data processing.

•Hands-on experience processing data using Spark Streaming API with Scala.

•Created dashboards for TNS Value manager in Tableau using various features of Tableau like Custom-SQL, Multiple Tables, Blending, Extracts, Parameters, Filters, Calculations, Context Filters, Data source filters, Hierarchies, Filter Actions, Maps, etc.

•Handle large datasets using partitions, Spark in-memory capabilities, broadcasts, and join transformations in the ingestion process.

•Import real-time logs to the Hadoop Distributed File System (HDFS) using Flume.

•Analyzed the MS-SQL data model and provided inputs for converting the existing dashboards that used Excel as a data source.

•Involved in Performance tuning the data-heavy dashboards and reports for optimization using various options like extractors, Context filters, writing efficient calculations, Data source filters, Indexing, and Partitioning in the data source, etc.

•Extend HIVE core functionality using custom User Defined Functions (UDF), User Defined Table-Generating Functions (UDTF), and User Defined Aggregating Functions (UDAF) for Hive.

•Administer Hadoop clusters (CDM).

•Skilled in phases of data processing (collecting, aggregating, moving from various sources) using Apache Flume and Kafka.

•Drive architectural improvement and standardization of the environments.

•Expertise in Spark for reliable real-time data processing capabilities to Enterprise Hadoop.

•Document big data systems, procedures, governance, and policies.

Technical Skills

•IDE: Jupyter Notebooks (formerly iPython Notebooks), Eclipse, IntelliJ, PyCharm

•Project Methods: Agile, Kanban, Scrum, DevOps, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Design Thinking, Lean Six Sigma

•Hadoop Distributions: Hadoop, Cloudera Hadoop, Hortonworks Hadoop

•Cloud Platforms: Amazon AWS - EC2, SQS, S3, Azure, GCP, Elastic Cloud

•Cloud Services: Solr Cloud, Databricks, Snowflake

•Cloud Database & Tools: Redshift, DynamoDB, Cassandra, Apache HBase, SQL, NoSQL, BigQuery, Snowflake, Snowpipes

•Programming Languages: Spark, Spark Streaming, Java, Python, Scala, PySpark, PyTorch

•Scripting: Hive, MapReduce, SQL, Spark SQL, Shell Scripting

•Continuous Integration CI/ CD: Jenkins, Ansible

•Versioning: Git, GitHub, BitBucket

•Programming Methodologies: Object-Oriented Programming, Functional Programming

•File Format & Compression: CSV, JSON, Avro, Parquet, ORC, XML

•File Systems: HDFS, S3

•ELT Tools: Apache Nifi, Flume, Kafka, Talend, Pentaho, Sqoop, AWS Glue jobs, Azure Data factory

•Data Visualization Tools: Tableau, Kibana, Power BI, QuickSight, GCP Looker

•Search Tools: Apache Lucene, Elasticsearch

•Security: Kerberos, Ranger

•AWS: AWS Lambda, AWS S3, AWS RDS, AWS EMR, AWS Redshift, AWS Kinesis, AWS ELK, AWS Cloud Formation, AWS IAM, Glue, Step Functions, Crawler, Athena, QuickSight

•Data Query: Spark SQL, Data Frames, RDDs, PySpark, Spark/Scala

Professional Experience

Nov 2023 – Present

Guardian Life Insurance, Bethlehem, PA

Sr. Data Engineer

Guardian Life Insurance team sought proficiency in Databricks to spearhead the migration of over 40 pipelines from EMR to a Databricks environment, incorporating Databricks Data Asset Bundle. The primary objectives were to optimize performance through leveraging Databricks features and reduce pipeline uptime. Additionally, the role entailed documenting both existing and new infrastructure and facilitating the onboarding of new team members to ensure seamless task completion.

•Collaborated with the platform team to migrate 7 critical pipelines to Databricks utilizing Data Asset Bundles, ensuring seamless transition and enhanced performance

•Employed Databricks to build and manage scalable data pipelines, harnessing data processing and analytics

•Spearheaded the documentation of legacy infrastructures while contributing to the design of new architectural frameworks tailored for complex pipeline logic, ensuring scalability and efficiency

•Analyzed and refactored legacy code to align with the Databricks AWS environment, optimizing functionality and resource utilization

•Used Jira for project management, facilitating efficient task tracking, issue resolution, and collaboration among team members

•Led validation testing for pipelines across higher environments, leveraging comparison tools and SQL to ensure data integrity and accuracy

•Leveraged Git/BitBucket for version control, proficiently creating branches and pull requests to update pipeline code while ensuring seamless collaboration within the team

•Conducted comprehensive data profiling to gain insights into source data characteristics, facilitating comparison and validation processes

•Served as a senior engineer within a 7-member team, providing guidance, mentoring, and task delegation to junior engineers to ensure project success

•Actively participated in daily stand-up meetings and utilized Kanban boards for effective organization, task tracking, and project management

•Established secure connections using Cisco AnyConnect, ensuring data privacy and network security during remote access to corporate resources

•Implemented data processing and analysis tasks using Hive for structured data stored in Hadoop

•Utilized EC2 instances for scalable computing power, ensuring efficient data processing and analytics tasks in the AWS cloud environment

•Worked on Python for scripting and automation tasks, enhancing productivity and efficiency in data engineering workflows

•Employed PySpark for distributed data processing tasks, harnessing its parallel processing capabilities for large-scale data analytics

•Proficiently utilized Visual Studio, Databricks, JIRA, BitBucket, SQL Workbench, CiscoAnyConnect, Hive, EC2, Python, Pyspark, and SQL in the Databricks Cloud Environment on AWS Workspace, demonstrating adeptness in diverse technological environments

Sep 2022 – Nov 2023

Federal Reserve Board of Governors, Washington, DC

Sr. Technical lead

BDP (Board Data Platform)

BDP is a Nifi application used to deliver data utilizing the Cloudera Data Platform. I was brought on to support the team and assist with the upgrading of the application. This consisted of taking operational tickets, debugging workflows, fixing bugs in the application, and further developing the application.

FNR/FMR (Financial Network of Relationships /Financial Market Infrastructure)

FNR is a Nifi application used to scrape data monthly and ingest it into our data warehouses. I was brought on to support the team and assist with the frequent data format changes and the onboarding of data asset

•Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Nifi, Hive, Hadoop, Impala, Hue, Map Reduce Frameworks, HBase, and Microsoft SQL Server.

•Leveraged Azure services such as Azure Databricks to integrate Spark MLlib or other machine learning libraries with ETL pipelines, enabling data-driven insights and predictions within the data engineering workflow

•Continuously monitor and fine-tune Spark job performance by optimizing configurations, leveraging data pruning techniques, and using Spark UI for insights.

•Improved the performance of Spark using optimizations including increased parallelism, broadcast joins, and caching.

•Prepared a contingency environment and onboarded FNR & BDP assets on Azure infrastructure.

•Deleted production environment files and onboarded BDP assets.

•Created ETL pipelines using different processors in Apache Nifi.

•Used SBT to manage dependencies for Scala-based Spark projects, ensuring consistent and reproducible builds.

•Experienced in developing and maintaining ETL jobs.

•Performed data profiling and transformation on the raw data using Spark, Python Pandas, and Oracle

•Developed, designed, and tested Spark SQL jobs with both Scala and PySpark, ensuring accurate data processing and analysis

•Create and maintain ETL pipelines in AWS along with Snowflake using Glue, Lambda, and EMR for data extraction, transformation, and loading tasks, ensuring efficient and automated data processing workflow

•Experienced with batch processing of data sources using Apache Spark on Azure.

•Fed data to predictive analytics models using Apache Spark Scala APIs.

•Created Hive External tables and loaded the data into tables and query data using HQL

•Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.

•Developed Spark code using Scala and Spark-SQL for faster testing and data processing.

•Imported millions of rows of structured data from relational databases using Sqoop import and processed them using Spark, then stored the data in HDFS using CSV format.

Jul 2021 – Sep 2022

Ryder, Miami, FL

Sr. AWS/Cloud Big Data Engineer

Develop a data engineering pipeline for processing and transforming large volumes of data using Spark, PySpark, and Scala. Leverage AWS Glue for ETL (Extract, Transform, Load) tasks and orchestration.

Ryder is a commercial fleet management, dedicated transportation, and supply chain solutions company.

•Utilize PySpark and Scala to perform data transformations such as cleaning, filtering, aggregating, and joining.

•Tune Spark and PySpark jobs for optimal performance, including configuring cluster resources and optimizing data shuffling.

•Monitor and troubleshoot performance bottlenecks.

•Built and analyzed Regression model on Google Cloud using Py Spark.

•Developed designed tested Spark SQL jobs with Scala, and Python Spark consumers.

•Developed AWS Cloud Formation templates to create custom infrastructure for our pipeline.

•Developed Spark programs using Py Spark.

•Created User Defined Function (UDF) using Python in Spark.

•Created and maintained ETL pipelines in AWS using Glue, Lambda, and EMR.

•Developed scripts for collecting high-frequency log data from various sources and integrated it into AWS using Kinesis, and staged data in the Data Lake for further analysis.

•Developed multiple Spark Streaming and batch Spark jobs using Scala and Python on AWS.

•Designed logical and physical data modeling for various data sources on AWS Redshift.

•Defined and implemented schema for a custom HBase.

•Created Apache Airflow DAGS using Python.

•Wrote numerous Spark programs in Scala for information extraction, change, and conglomeration from numerous record designs.

•Implemented AWS IAM user roles and policies to authenticate and control access.

•Worked with AWS Lambda functions for event-driven processing using the AWS boto3 module in Python.

•Create AWS Glue jobs to orchestrate and automate the ETL process.

•Schedule jobs using AWS Glue triggers or other scheduling mechanisms.

•Executed Hadoop/Spark jobs on AWS EMR using programs and stored data in S3 Buckets.

•Configured access for inbound and outbound traffic RDS DB services, DynamoDB tables, and EBS volumes to set alarms for notifications or automated actions on AWS.

•Specified nodes and performed the data analysis queries on Amazon Redshift clusters on AWS.

•Worked on AWS Kinesis for processing huge amounts of real-time data.

•Worked with different data science teams and provided respective data as required on an ad-hoc request basis

•Applied Microsoft Azure Cloud Services (PaaS & IaaS), Storage, Active Directory, Application Insights, Internet of Things (IoT), Azure Search, Key Vault, Visual Studio Online (VSO) and SQL Azure.

•Ensured systems and network Security, maintained performance, and set up monitoring using CloudWatch and Nagios.

•Utilized version controller tools GitHub (GIT), Subversion (SVN), and software build tools Apache Maven and Apache Ant.

•Worked on CI/CD pipeline for code deployment by engaging different tools (Git, Jenkins, Code Pipeline) in the process right from developer code check-in to Production deployment.

Jul 2020 – Jul 2021

Ikea, Coshohocken, PA

Sr. Big Data Engineer

IKEA offers well-designed, functional and affordable, high-quality home furnishing, produced with care for people and the environment.

•Automated the deployment of ETL jobs and infrastructure using AWS services like CloudFormation or Terraform.

•Designed deployment pipelines to ensure scalability and resilience, capable of handling varying data loads and recovering from failures efficiently.

•Developed end-to-end ETL pipelines using AWS Glue, leveraging its serverless, fully managed capabilities to ingest data from diverse sources, perform transformations, and load it into data warehouses or data lakes.

•Created robust data pipelines capable of gracefully handling schema changes, ensuring resilience even as data schemas evolve over time.

•Implemented real-time data processing using Apache Spark Structured Streaming in PySpark, enabling timely insights and actions on incoming data streams.

•Developed Spark applications for batch processing using Scala and Spark scripts to meet business requirements efficiently.

•Implemented Spark-Streaming applications to consume data from Kafka topics and insert processed streams into HBase for real-time analytics.

•Utilized Scala shell commands to develop Spark scripts as per project requirements and customized Kibana for visualization of log and streaming data.

•Analyzed and tuned Cassandra's data model for multiple internal projects, collaborating with analysts to model Cassandra tables from business rules and optimize existing tables.

•Defined Spark/Python ETL frameworks and best practices for development, converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.

•Designed and deployed new ELK clusters and maintained ELK stack (Elasticsearch, Logstash, Kibana) for log data analysis and visualization.

•Applied Microsoft Azure Cloud Services (PaaS & IaaS), including Storage, Active Directory, Application Insights, Internet of Things (IoT), Azure Search, Key Vault, Visual Studio Online (VSO), and SQL Azure.

•Installed and configured Tableau Desktop to connect to Hortonworks Hive Framework for analytics of locomotive bandwidth data stored in Hive databases.

•Implemented advanced procedures like text analytics and processing using in-memory computing capabilities of Apache Spark, written in Scala, and utilized Spark SQL API for faster processing of data.

•Provided continuous discretized Dstream of data with a high level of abstraction using Spark Structured Streaming and moved transformed data to Spark clusters using Kafka for application consumption.

Apr 2018 – Jul 2020

Hearst Communications, New York, NY

Sr. Big Data Engineer

Hearst Communications, Inc., often referred to simply as Hearst, is an American multinational mass media and business information conglomerate.

•Worked in AWS and Google Cloud Platform (GPC) environments.

•Utilize techniques like data partitioning and bucketing to optimize data storage and query performance in data lakes or data warehouses on AWS, such as Amazon S3 and Amazon Redshift.

•Develop custom PySpark User-Defined Functions in Python to perform specialized data transformations or calculations that are not achievable with built-in Spark functions.

•Implement dynamic cluster autoscaling in AWS Glue and AWS EMR (Elastic MapReduce) to efficiently allocate computing resources based on the workload, reducing costs during periods of lower demand.

•Utilize AWS Glue Data Catalog to create a centralized metadata repository for tracking and discovering datasets, making it easier to manage and access data assets.

•Integrate AWS Glue and Spark-based ETL pipelines into a CI/CD (Continuous Integration/Continuous Deployment) pipeline to automate testing, deployment, and monitoring.

•Deployed Databricks clusters for development, testing, and production in an automated way.

•Used Databricks, Jenkins, Circle CI, IntelliJ IDEA, and third-party tools in software development.

•Automated cluster creation and management using Google DataPro.

•Used Databricks running Spark Scala for event processing and transformations on event data.

•Loaded transformed data into staging tables for data analysis with database functions.

•Constructed zero-downtime Spark Scala pipelines that ingested data from Heart’s software products.

•Programmed SQL functions to retrieve data and return results used in reports and other APIs.

•Implemented ETL and ELT pipelines using Python and SQL.

•Migrated data stored in old legacy systems that used batch ETL jobs to load data to stream pipelines that stream real-time events and store them in DataMart.

•Developed Spark jobs using Spark SQL, Python, and Data Frames API to process structured data into Spark clusters.

•Automated, configured, and deployed instances on AWS, Azure environments, and Data centers.

•Migrated SQL database to Azure Data Lake

•Designed and implemented on-screen and log file streaming diagnostics for the streaming pipeline.

•Programmed Python, PowerShell, Bash, and SQL scripts for purposes of validation, migration, decompression, data manipulation, data wrangling, and ETL.

•Used Spark to do transformation and preparation of Dataframes.

•Performed technical analyses to pinpoint sources of operational problems specific to ETL processes identified the sources of problems and applied corrective measures.

Jun 2017 – Apr 2018

Citigroup, New York, NY

Data Engineer

Citigroup Inc. or Citi is an American multinational investment bank and financial services corporation incorporated in Delaware and headquartered in New York City. Worked on project for Regulatory Compliance and Reporting Automation for Citigroup.

•Developed and maintained real-time and batch-processing applications using Spark/Scala, Kafka, and Cassandra to handle large volumes of data efficiently.

•Implemented a regulatory compliance and reporting automation system for a Citi bank to streamline regulatory reporting processes.

•Utilized Apache Hadoop and Spark to process and analyze transactional data to ensure compliance with regulatory requirements.

•Developed data validation and quality checks to ensure accuracy and completeness of regulatory reports.

•Automated the generation and submission of regulatory reports, reducing manual effort and improving efficiency.

•Integrated the compliance and reporting system with the Citi bank's existing infrastructure and regulatory reporting frameworks.

•Loaded ingested data into both Hive managed and External tables, ensuring data availability for downstream applications.

•Developed custom user-defined functions (UDF) for complex Hive queries (HQL), enabling advanced data manipulations and transformations.

•Conducted upgrades, patches, and bug fixes in the Hadoop cluster environment to ensure system stability and performance optimization.

•Automated data loading processes by writing shell scripts, improving operational efficiency and reducing manual intervention.

•Designed and developed distributed query agents for executing distributed queries against shards, enhancing query performance and scalability.

•Developed DBC/ODBC connectors between Hive and Spark to facilitate seamless data transfer and analysis.

•Implemented scripts for collecting high-frequency log data from various sources and integrating it into HDFS using Flume, staging data for further analysis.

•Leveraged Spark DataFrame API over Azure HDConnect platform to perform analytics on Hive data, extracting valuable insights for business decision-making.

Jan 2016 – May 2017

Food Lion, Salisbury, NC

Hadoop Data Engineer

Food Lion is a grocery store chain that operates over 1100 supermarkets in 10 states of the Mid-Atlantic and Southeastern United States.

•Used Zookeeper and Oozie for coordinating the cluster and programming workflows.

•Used Sqoop to expeditiously transfer information between information databases and HDFS and used Flume to stream the log data from servers.

•Enforced partitioning, and bucketing in Hive for higher organization of the info.

•Worked with different file formats and compression techniques to standards.

•Loaded information from the UNIX system to HDFS.

•Used UNIX system shell scripts to alter the build method, and to perform regular jobs like file transfers between totally different hosts.

•Documented Technical Specs, Dataflow, information Models, and sophistication Models.

•Documented needs gathered from stakeholders.

•With success loaded files to HDFS from Teradata and loaded from HDFS to HIVE.

•Involved in researching various available technologies, industry trends, and cutting-edge applications.

•Data ingestion is done using Flume with source as Kafka Source & sink as HDFS.

•Performed storage capacity management, performance tuning, and benchmarking of clusters.

Education

Bachelor’s (Computer Science), Morgan State



Contact this candidate