Resume

Sr Big Data Developer

Location:

Washington, VA, 22747

Posted:

April 05, 2024

Contact this candidate

Resume:

Phone Number: 401-***-****; Email: ad4soe@r.postjobfree.com

Professional Summary

• An accomplished Strategic Professional with 12+ years of immersive proficiency in the realm of Big Data Development and ETL Technologies. Specializing in the creation of innovative solutions through a meticulous analytical approach, I have gained recognition for my contributions to the development of large-scale systems, consistently delivering high-quality results.

• My expertise extends to Amazon Web Services (AWS), where I possess a versatile understanding of cloud technologies. I excel in facilitating seamless knowledge transfer across platforms, including Google Cloud Platform (GCP) and Microsoft Azure, ensuring adaptability in diverse cloud environments.

• Key highlights of my skill set include:

o Mastery in managing data lakes on GCP, implementing best practices for robust data lake architecture.

o Proficiency in Google Cloud Data Prep for efficient data preparation and transformation.

o Utilization of Terraform within Google Cloud for the proficient management and provisioning of GCP resources, enhancing the efficiency of data pipelines.

o Designing intricate workflows with Google Cloud Composer to orchestrate complex data processing tasks.

o Harnessing the power of Google Cloud Pub/Sub for scalable, event-driven messaging solutions.

o Implementation of Google Cloud Audit Logging for meticulous auditing and compliance monitoring.

• In the domain of Azure, I showcase expertise in designing scalable data architectures using services like Azure Data Lake, Azure Synapse Analytics, and Azure SQL. My accomplishments include adept construction and management of data pipelines via Azure Data Factory.

• Within the AWS ecosystem, I demonstrate proficiency in AWS EMR and AWS Lambda for high-volume data processing and analysis. Noteworthy achievements include the design and implementation of a data warehousing solution using AWS Redshift and Athena for streamlined data querying and analysis.

Additional areas of expertise encompass:

o In-depth knowledge of AWS CloudWatch for comprehensive monitoring and management of AWS resources.

o Hands-on experience in implementing data security and access controls through AWS Identity and Access Management (IAM).

o Development of workflows using AWS Step Functions to handle intricate data processing tasks.

o Formulation of comprehensive technical strategies and scalable CloudFormation Scripts for efficient project execution.

• My extensive experience also covers performance optimization for Spark across platforms such as Databricks, Glue, EMR, and on-premises systems. I have actively participated in on-premises data migration into the clou

• My software development proficiency within Big Data and Hadoop Ecosystems, utilizing Apache Spark, ka, and various ETL technologies, has been a cornerstone of my career. I possess hands-on expertise in major components within the Hadoop Ecosystem, enabling the development of robust data solutions.

• Furthermore, I bring:

o Proficiency in Apache NIFI, integrated with Apache Kafka for streamlined data flow and processing.

o Expertise in designing and managing data transformation and filtration patterns using Spark, Hive, and Python for optimal data processing.

o Experience in upgrading MapR, CDH, and HDP clusters, ensuring seamless transitions and enhanced system performance.

o Development of scalable, reliable data solutions for real-time and batch data movement across multiple systems, ensuring seamless integration.

o A proven ability to validate the technical and operational feasibility of Hadoop developer solutions, providing valuable insights into project success.

o Direct engagement with business stakeholders, honing my ability to deeply comprehend project needs and objectives, aligning technical solutions with overarching goals.

Technical Skills:

Programming Languages: Python, Scala, PySpark, SQL

Scripting: Hive, MapReduce, SQL, Spark SQL, Shell Scripting

IDE: Jupyter Notebooks, Eclipse, IntelliJ, PyCharm Vs Code

Database & Tools: Redshift, DynamoDB, Synapse DB, Big Table, AWS RDS, SQL Server, PostgreSQL, Oracle, MongoDB, Cassandra, HBase

Hadoop Distributions: Hadoop, Cloudera Hadoop, Hortonworks Hadoop

ELT Tools: Spark, NIFI, AWS Glue, AWS EMR, Data Bricks, Azure Data Factory, Google Data flow

File Format & Compression: CSV, JSON, Avro, Parquet, ORC

File Systems: HDFS, S3, Google Storage, Azure Data Lake

Cloud Platforms: Amazon AWS, Google Cloud (GCP), Microsoft Azure

Orchestration Tools: Apache Airflow, Step Functions, Oozie

Continuous Integration CI/CD: Jenkins, Code Pipeline, Docker, Kubernetes, Terraform, CloudFormation

Versioning: Git, GitHub

Programming Methodologies: Object-Oriented Programming, Functional Programming

Project Methods: Agile, Kanban, Scrum, DevOps, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Design Thinking, Lean Six Sigma

Data Visualization Tools: Tableau, Kibana, Power BI, AWS Quick Sight

Search Tools: Apache Lucene, Elasticsearch

Security: Kerberos, Ranger, IAM

Professional Experience

Sr. Big Data Engineer

AmerisourceBergen, Conshohocken, Pennsylvania Since Jan 2023 - Present

• Orchestrated the deployment of Amazon EC2 instances and skillfully managed S3 storage, customizing instances for specific applications and Linux distributions.

• Managed critical AWS services including S3, Athena, Glue, EMR, Kinesis, Redshift, IAM, VPC, EC2, ELB, Code Deploy, RDS, ASG, and CloudWatch, constructing a resilient and high-performance cloud architecture.

• Offered expert insights on AWS system architecture, implementing rigorous security measures and protocols to safeguard vital assets and ensure compliance.

• Optimized data processing using Amazon EMR and EC2, enhancing system performance and scalability.

• Designed and fine-tuned data processing workflows utilizing AWS services like Amazon EMR and Kinesis, ensuring efficient data analysis.

• Leveraged AWS Glue for data cleaning and preprocessing, performed real-time analysis using Amazon Kinesis Data Analytics, and harnessed services like EMR, Redshift, DynamoDB, and Lambda for scalable data processing.

• Implemented robust data governance and security protocols through AWS IAM and Amazon Macie, ensuring protection of sensitive data.

• Utilized AWS Glue and Fully Managed Kafka for efficient data streaming, transformation, and preparation.

• Established and managed data storage on Amazon S3 and Redshift, ensuring accessibility and organization of data.

• Seamlessly integrated Snowflake into the data processing workflow, elevating data warehousing capabilities.

• Evaluated on-premises data infrastructure for migration opportunities to AWS, orchestrated data pipelines via AWS Step Functions, and employed Amazon Kinesis for event-driven processing.

• Proficiently utilized Apache Airflow for workflow automation, orchestrating complex data pipelines and automating tasks to boost operational efficiency.

• Implemented AWS CloudFormation to automate the provisioning of AWS resources, ensuring consistent and repeatable deployments across multiple environments.

• Employed an array of tools (Scala, Hive, MapReduce, Sqoop, Pig Latin) within Hadoop ecosystems for data cleaning, transformation, analysis, and querying.

• Successfully executed a comprehensive two-part project at Walgreens Boots Alliance involving Files Provisioning and Files Processing, leveraging AWS Glue, Databricks, and other technologies.

• Fine-tune Kafka configurations to optimize performance, throughput, and latency.

• Developed efficient Spark jobs using Scala and Python, harnessing Spark SQL for swift data processing and analysis. Implemented robust data quality frameworks to ensure accuracy, consistency, and reliability across extensive datasets, conducting thorough profiling, cleansing, and validation processes.

• Conducted streaming data ingestion processes using PySpark, enhancing data acquisition capabilities.

• Transferred meticulously transformed data to various destinations, including databases, data warehouses, flat files, and Excel spreadsheets, ensuring seamless data mapping between incoming and destination fields.

• Integrate Kafka seamlessly with other components of the Big Data ecosystem, such as Hadoop, Spark, and Flink.

• Designed and implemented effective data models tailored for NoSQL databases, considering specific data and application.

• Collaborated cross-functionally to establish robust data quality best practices and governance frameworks.

• Utilized AWS Lambda and Serverless Architecture to run code without provisioning servers, significantly reducing operational costs and complexity.

Sr. Big Data Engineer

PNC Bank, Pittsburgh, Pennsylvania Mar 2020 – Dec 2022

• Architected high-throughput data ingestion pipelines using Google Cloud Dataflow to process streaming and batch data, ensuring minimal latency and high resilience.

• Designed and executed large-scale data migrations to Google BigQuery, employing best practices for data transfer and establishing reliable data streams for real-time analytics.

• Led the adoption of Google BigQuery ML to enable machine learning capabilities directly within the data warehouse, allowing for the creation, training, and deployment of ML models on large datasets.

• Developed streamlined Terraform templates for agile infrastructure provisioning within GCP.

• Proficiently managed Hadoop clusters on GCP using Dataproc and associated tools.

• Efficiently processed real-time data via Google Cloud Pub/Sub and Dataflow, constructing optimized data pipelines and coordinating executions with Apache Airflow.

• Orchestrated robust data storage in Google Cloud Storage and BigQuery, ensuring resilient data warehousing solutions. Skillfully managed compute instances on GCP and containerized applications using Google Kubernetes Engine (GKE).

• Enhanced data processing efficacy by leveraging Google Cloud Data Prep and Dataflow, thereby elevating system performance. Ensured data integrity and security by implementing meticulous governance measures using Google Cloud IAM.

• Implement security measures for Kafka clusters, including encryption, authentication, and authorization.

• Leveraged Google Cloud Dataproc for managing Hadoop/Spark clusters, optimizing resource usage, and reducing operational costs with preemptible VMs and autoscaling.

• Collaborated closely with data scientists and analysts to build comprehensive data pipelines and conduct in-depth analysis within GCP's ecosystem.

• Utilized Databricks for efficient ETL pipeline development, ensuring streamlined data processing workflows.

• Developed and maintained ETL pipelines using Apache Spark and Python on Google Cloud Platform (GCP) for large-scale data processing and analysis.

• Orchestrated event-driven architectures using Google Cloud Functions to execute serverless functions in response to cloud events, reducing operational complexity.

• Manage data flow through Kafka topics and partitions, addressing any bottlenecks or congestion.

• Effectively harnessed a range of GCP services, including Compute Engine, Cloud Load Balancing, Storage, SQL, Stackdriver Monitoring, and Deployment Manager.

• Configured precise GCP firewall rules to manage VM instance traffic, optimizing content delivery and reducing latency through GCP Cloud CDN.

• Led the migration of enterprise-grade applications to GCP, significantly enhancing scalability and reliability while ensuring seamless integration with existing on-premises solutions.

• Fostered innovation by setting up collaborative environments using GCP’s Cloud Source Repositories and integrated tooling for source code management.

• Designed and implemented Snowflake data warehouses, developed data models, and crafted dynamic data pipelines for real-time processing across diverse data sources.

• Employed natural language toolkits for data processing, analyzing keywords, and generating word clouds.

• Mentored budding engineers in ETL best practices, Snowflake, and Python, while managing and evolving ETL pipelines using Apache Spark and Python on Google Cloud Platform (GCP).

• Automated data conversion and ETL pipeline generation using Python scripts, optimizing performance and scalability through Spark, YARN, and GCP Dataflow.

• Contributed expertise across onshore and offshore development models, architecting scalable data solutions that integrate Snowflake, Oracle, GCP, and various other components.

• Collaborated extensively with data scientists and analysts on diverse projects, including fraud detection, risk assessment, and customer segmentation, actively contributing to recovery plan development and implementation.

Sr. Cloud Data Engineer

Progressive Corporation, Mayfield Village, Ohio Feb 2018 – Feb 2020

• Built modern web apps with microservices by integrating diverse Azure technologies (Cosmos DB, App Insights, Blob Storage, API Management, and Functions).

• Successfully implemented data transformations with Azure SQL Data Warehouse (SQL processing) and Azure Databricks (Python and Scala/Java processing), encompassing critical tasks like cleaning, normalization, and standardization.

• Leveraged Azure Databricks to efficiently process and store petabytes of data from diverse sources, seamlessly integrating Azure SQL Database, external databases, Spark RDDs, and Azure Blob Storage.

• Within Azure HDInsight, meticulously modeled Hive partitions and implemented caching techniques in Azure Databricks to optimize RDD performance and accelerate data processing.

• Extracted and profiled data from diverse formats like Excel spreadsheets, flat files, XML, relational databases, and data warehouses, using SQL Server Integration Services (SSIS) effectively.

• Engineered complex, maintainable Python and Scala code within Azure Databricks, enabling seamless data processing and analytics. Transitioned MapReduce jobs to PySpark in Azure HDInsight for further performance gains.

• Orchestrated ETL workflows with UNIX scripts in Azure Blob Storage and Linux Virtual Machines, automating processes and ensuring error handling and file operations.

• Create comprehensive documentation for Kafka configurations, processes, and best practices.

• Utilized Google Bigtable for high-throughput and scalable NoSQL database solutions, enabling rapid read and write operations for large datasets.

• Planned and architected Azure Cloud environments for migrated IaaS VMs and PaaS role instances, ensuring accessibility and scalability.

• Developed Apache Airflow modules for seamless workflow management across cloud services.

• Communicate effectively with team members and management about Kafka-related initiatives, challenges, and solutions.

• Optimized query performance and empowered data-driven decision making through Snowflake's scalable data solutions and SQL-based analysis.

Data Engineer

The Ford Motor Company, Dearborn, GA Jan 2015 – Jan 2018

• Utilized PySpark to efficiently ingest structured and unstructured financial data from various sources.

• Utilized Python libraries, including NumPy, SciPy, Scikit-Learn, and Pandas, within the GCP ecosystem to enhance the analytics components of the data analysis framework, along with PySpark data frames and RDDs.

• Built scalable and high-performance data processing applications by leveraging PySpark libraries.

• Utilized Python and Scala for data manipulation, analysis, and modeling tasks within Spark and other data processing frameworks.

• Designed and optimized data models and schema using Hive for querying structured data stored in Hadoop Distributed File System (HDFS).

• Implemented real-time data ingestion and processing pipelines with Apache Kafka, ensuring reliable and scalable data streaming capabilities.

• Integrated Apache HBase for storing and accessing large-scale semi-structured data, optimizing data retrieval performance for analytical queries.

• Orchestrated workflow scheduling and automation using Apache Oozie, ensuring timely execution of data processing tasks and job dependencies.

• Leveraged Impala for interactive querying and analysis of data stored in Hadoop clusters, enabling ad-hoc analytics and exploration.

• Collaborated with business intelligence teams to develop data visualization dashboards and reports using Power BI, providing actionable insights to stakeholders.

• Deploy and configure Kafka clusters, ensuring optimal performance and resource utilization.

• Contributed to the development of data engineering best practices and standards, including code review, documentation, and performance optimization techniques.

• Participated in cross-functional teams to design and implement data solutions, ensuring alignment with business requirements and scalability.

• Conducted comprehensive assessments of existing on-premises data infrastructure, implementing CI/CD pipelines via Jenkins for streamlined ETL code deployment and version control using Git.

• Designed Spark SQL queries for real-time financial data extraction, integrated Spark Streaming for Kafka data consumption, and configured fault-tolerant Kafka clusters.

• Ensured data quality through automated testing in CI/CD pipelines and implemented ETL pipeline automation using Jenkins and Git.

Hadoop Administrator

Marathon Petroleum, Findlay, Ohio Jan 2012 – Dec 2014

• Managed Apache Hadoop clusters at Marathon Petroleum, configuring and maintaining components such as Hive, HBase, Zookeeper, and Sqoop for robust application development and data governance.

• Demonstrated expertise in administering, configuring, and maintaining Apache Hadoop clusters. Utilized Hive, HBase, Zookeeper, and Sqoop for robust application development and data governance at Marathon Petroleum.

• Seamlessly managed Cloudera CDH & Hortonworks HDP installations and upgrades across various environments at Mara-thon Petroleum, ensuring uninterrupted operations.

• Managed commissioning, decommissioning, and recovery of Data Nodes at Marathon Petroleum, orchestrating multiple Hive Jobs using the Oozie workflow engine.

• Improved MapReduce job execution efficiency through targeted monitoring and optimization strategies at Marathon Petro-leum, enhancing performance at a granular level.

• Implemented robust High Availability protocols and automated failover mechanisms with Zookeeper at Marathon Petrole-um, eliminating single points of failure for Name nodes.

• Collaborated effectively with data analysts at Marathon Petroleum, providing innovative solutions for analysis tasks, and conducting comprehensive reviews of Hadoop and Hive log files.

• Enabled efficient data exchange between RDBMS, HDFS, and Hive using Sqoop at Marathon Petroleum, facilitating seamless data transfer protocols.

• Optimized MapReduce job execution efficiency at a granular level through targeted monitoring and optimization strategies for Hadoop clusters at Marathon Petroleum.

• Facilitated seamless data exchange between relational databases (RDBMS), HDFS, and Hive using Sqoop at Marathon Petro-leum, implementing efficient data transfer protocols.

• Integrated CDH clusters seamlessly with Active Directory at Marathon Petroleum, strengthening authentication with Ker-beros for enhanced security.

• Oversaw efficient commissioning, decommissioning, and recovery of Data Nodes at Marathon Petroleum. Utilized Oozie workflow engine to orchestrate multiple Hive Jobs.

• Implemented robust High Availability protocols and automated failover mechanisms with Zookeeper at Marathon Petrole-um, eliminating single points of failure for Name nodes.

• Maximized fault tolerance by configuring and utilizing Kafka brokers for data writing and leveraged Spark for faster pro-cessing when needed at Marathon Petroleum.

• Utilized Hive for dynamic table creation, data loading, and UDF development at Marathon Petroleum. Collaborated with Linux server admin teams to optimize server hardware and OS functionality.

• Proposed and implemented continuous improvements to enhance the efficiency and reliability of Kafka clusters at Marathon Petroleum.

• Proactively partnered with application teams at Marathon Petroleum to facilitate OS updates and execute Hadoop version upgrades. Streamlined workflows using shell scripts for seamless data extraction from diverse databases into Hadoop.

• Managed and worked with NoSQL databases like MongoDB, Cassandra, or HBase at Marathon Petroleum.

Data Analyst Jan 2011 – Dec 2011

Experian, Costa Mesa, CA

• Collect and integrate data from various sources, ensuring accuracy and consistency.

• Manage databases and data systems, including fixing coding errors and other data-related problems.

• Perform detailed data analysis to identify trends, patterns, and insights that can help in decision-making processes.

• Generate regular reports and ad-hoc analyses to present to management, highlighting key findings and recommendations.

• Develop statistical models to forecast future trends and provide insights that support business objectives.

• Use predictive modeling to increase and optimize customer experiences, revenue generation, ad targeting, and other busi-ness outcomes.

• Work closely with various departments, including marketing, sales, and product development teams, to support their data analysis needs and inform their strategies.

• Collaborate with stakeholders to understand their requirements and deliver data-driven insights and solutions.

• Create and maintain dashboards and visualizations to help stakeholders quickly understand the data and its implications.

• Utilize tools like Tableau, Power BI, or like make data accessible and understandable to non-technical users.

• Ensure compliance with data governance and privacy policies, especially given the sensitive nature of credit data.

• Implement and maintain quality control measures to guarantee the integrity of the data used for analysis.

• Stay updated with the latest trends, tools, and technologies in data analytics and finance/credit industries.

• Explore innovative ways to leverage data for competitive advantage and operational efficiency.

• Conduct market research and analyze customer behavior to identify opportunities for business growth.

• Work with customer feedback and transaction data to improve product offerings and customer satisfaction.

Education

Master of Science, Business Analytics

William & Mary, Mason School of Business, Williamsburg, VA

Bachelor of Procurement & Supply Chain Management

Makerere University Business School, Kampala, Uganda

Contact this candidate