Big Data Engineer

Location:

Washington, DC

Salary:

150000

Posted:

July 05, 2025

Contact this candidate

Resume:

Samuel Allan

+ Dallas, TX ***** ) 551-***-**** * **************@*****.*** :https://www.linkedin.com/in/samuel-allan-b8656634a/

CERTIFIED AZURE CLOUD BIG DATA ENGINEER/ARCHITECT

Accomplished IT data manager with 11+ years of success in big data architecture, data engineering, data analysis, and database administration for multiple companies across various sectors. Proven track record of team leadership, building, and development to drive operational efficiency.

Critical thinker and problem-solver, highly skilled in strategic planning with a strong focus on business intelligence. Adept in requirements gathering, analyzing business requirements, creating enterprise data architecture, and building scalable data pipelines that automate.

Vastly proficient in partnering with data analysts and data scientists to develop dashboards and models with the right datasets that aid stakeholders in the data-driven informed decision-making process while steering all business planning. A solid history of meeting facilitation, presentation, and project management.

Progressive thought leader bilingual in English and Spanish and effective in collaborating cross-functionally with product managers, data scientists, business intelligence teams, and SMEs to design and launch enterprise data-driven strategies companywide.

CORE EXPERTISE

Big Data Architect, Big Data Engineering, Data Governance, Microsoft Fabric, Python, SQL, Spark Scala, PySpark, Spark SQL, Prompt Engineering, Data Analyst, Data Science, Machine Learning, Azure Stacks, Azure Databricks, Relational Database, Dimensional Data Modeling, Data Warehouse, Data Lakehouse, Medallion Architecture, Databricks Unity Catalog, Microsoft Azure Purview, Pyapacheatlas SDK, Delta Live Tables, ETL/ELT Pipelines, Kusto, Cdata Sync, Azure Synapse Analytics, Azure Data Factory, Azure Data Lake Storage Account Gen2, Azure Cosmos DB, Azure Functions, Dimensional Data Modeling, Synapse Link, OLTP and OLAP Star Schema, Lucidchart, dbt, Fivetran, Snowflake Virtual Data Warehouse, GitOps CI/CD Pipeline, Spark Auto Loader (cloudFiles), Erwin Data Modeler, Lucidchart, Docker, Gather Business Requirements and KPI’s, Azure DevOps CI/CD Pipeline, Power Point Presentation, Azure container instance, Data.world, Azure Virtual Machines, Windows, Linux, ADF Scala Expressions, ADF Triggers, ADF Mapping Data Flows, Self-Hosted Integration Runtime, Managed Vnet Integration Runtime, Azure Logic Apps, ADLS Gen2, Azure Key Vault, REST API, SOAP API, ODATA API, C#, SSIS, SSAS, SSRS, On-Premise Microsoft SQL Server, Azure Active Directory, Change data capture (CDC), Azure Data Factory Metadata-Driven Copy, Incremental Change data capture (CDC), GCP, AWS, Data Source (On-Premise SQL and Oracle Database, SFTP, DealCloud, iLevel, SAP Ariba, Ecc, S/4Hana, SuccessFactors, SOAP, OData & REST API), ChatGPT, Copilot, AutoSys Workload Automation, and Power BI.

PROFESSIONAL EXPERIENCE

TREDENCE INC, San Jose, California Dec 2020 – Apr 2025

Title: Lead Sr. Azure Cloud Big Data Engineer/Azure Cloud Big Data Solutions Architect

Project Sector: Energy Utility, Retail, Transportation, Finance, Business, Healthcare, Oil & Gas Industries.

§Configured a managed virtual network integration runtime in Azure Data Factory that uses a private endpoint to securely connect to data stores and establish a private link to Azure resources.

§Initiated batch and streaming processes for Delta Live Tables. Implemented data quality SQL constraints within DLT. Established Delta Live Tables Pipelines and connected them to the Unity Catalog in a DLT Job creation, facilitating the visualization of directed acyclic graphs (DAGs). Utilized Databricks SQL to analyze delta tables in the Lakehouse. Set up Unity Catalog as a cohesive governance solution for the Databricks workspace and securely shared Delta Lake data with consumers via Delta Sharing.

§Utilized Auto Loader in Databricks to progressively stream Cloud Files from ADLS, laying the groundwork for bronze, silver, and gold tables in the Lakehouse. Connected ADLS with Unity Catalog metastore and external data sources using the Access Connector for Azure Databricks. Applied Change Data Capture with the "APPLY CHANGES INTO" statement in the DLT streaming pipeline, driving updates to the target silver table. Utilized Python and SQL notebooks, crafting transformation logic in both Spark SQL and PySpark.

§Engaged Power BI to delve into Azure Synapse's dedicated SQL pool metadata, devising the Azure Synapse Analyzer Report Dashboard. This was instrumental in enhancing Azure Synapse SQL Dedicated Pool DEV, QA, and PROD performance.

§Extracted finance data from DealCloud and iLevel REST API with the REST connector in ADF. Set pagination rules utilizing QueryParameters and AbsoluteUrl for accessing subsequent page requests in the body$. Analyzed and transformed Nested Json objects data from REST API, ensuring the correct column mappings and setting the appropriate data types before saving as parquet files. Harvested data from SFTP locations with the SFTP connector in ADF, implementing fault tolerance to bypass problematic or absent rows. Designed a dedicated SQL pool for Synapse Data Warehouse, capitalzing on Polybase and the "COPY INTO" SQL statement to populate Fact and Dimension tables. Ensured the setting of Statistics on every column for optimal query performance as advised by Microsoft, while integrating Clustered Columnstore Index with the appropriate distribution.

§Designed and spearheaded the implementation of the Medallion Architecture (Bronze, Silver and Gold layer) within Azure Databricks Unity Catalog Metastore. Utilized tumbling window triggers, alongside LastModifiedDate, in Azure Data Factory copy activity to ensure the transfer of only new and updated files after the full load.

§Utilized Microsoft Fabric OneLake, integrated with ADLS Gen2, to store data in Delta format (Lakehouse). Transformed data using Synapse Data Engineering within notebooks, and subsequently loaded the data into the Microsoft Fabric Synapse Data Warehouse, which inherently utilizes Delta format. Managed workflows and scheduled Spark jobs using Microsoft Fabric Data Factory.

§Maximized Jira for task scheduling and progress tracking. Capitalized on Azure DevOps for submitting pull requests, overseeing approvals to main branch avoiding merge conflicts, and finalizing merge requests prior to releasing CI/CD pipelines to QA and Production environment.

CAPGEMINI, New York, NY May 2017 – Nov 2020

Title: Sr. Azure Cloud Big Data Engineer/Data Modeler

Project Sector: Health Care, Utilities, Food Manufacturing.

§Collaborated with data analyst and data scientist to build data remediation models implementing data quality rules, safeguarding accuracy, and increase accessibility to production data.

§I Inherited a Databricks delta tables with too many data skews. Use repartition to increase partition and coalesce to decrease the various delta table partitions. Also, used Optimize to produce larger evenly balanced parquet files, and ZOrder by indexing with the high cardinality column (large number of distinct values) to colocate related data in the same set of files, thus enhancing the performance and better query retrieval times.

§Utilized the PySpark machine learning library Imputer in Databricks spark dataframe to develop and predict missing null values. Developed Blob Lifecycle management rule-based policy on structured, semi-structured, and unstructured data stored in ADLS by assigning the right Tiers from Hot, Cool, Cold, and Archive. Ensured data was protected from both planned and unforeseeable circumstances by selecting the right Zone Redundancy.

§Assessed technologies and engaged with technology vendors in alignment with the enterprise architecture framework of a client, to determine how we can best use their technology cost-effectively.

§Partnered with business divisions and development staff to create a strategy for long-term data platform architecture. Streamlined business process management efforts to support data and application governance efforts.

§Developed Python user Define Function to read config files/tables. Developed PySpark and Spark SQL script to read, explode and transform Json semi-structured data. Saved curated data in Parquet and Delta format into the Lakehouse. Used com.databricks.spark.sqldw format, JDBC drivers and OAuth 2.0 with a service principal for authentication to save data into Azure synapse analytics dedicated SQL pool data warehouse and using ADLS container as staging. Also had business use cases where curated data were saved into Amazon Redshift, Google BigQuery, and Snowflake virtual data warehouse for analytical purposes.

§Created a Petabyte scale data warehouse in Azure Synapse dedicated SQL Pool based on synapse massive parallel processing (MPP) architecture, and selected the right distribution from Hash, Round-robin, and Replicated Tables to avoid data skew. Developed CREATE TABLE AS SELECT (CTAS) T-SQL statement and copied data into Synapse tables using Polybase, and COPY INTO T-SQL statement.

§Imported datasets from Azure Synapse Dedicated SQL pool and developed a time series interactive report and dashboard with row-level security in Power BI, after modeling the data to create a star schema.

§Leveraged ADF Metadata-Driven Copy (Incremental Copy) to copy databases and files into the Landing/Archive Zone in ADLS as Parquet format. Leveraged Databricks Auto Loader to incrementally copy new arriving data into the Bronze Layer as Delta format. Leveraged PySpark, Spark SQL to read and load the datasets into Spark DataFrame, Dedupe and transformed the data with business logic, developed Merge and SCD Type 2 logic before writing to saveAsTable into Silver Layer as Delta tables. Aggregated the refined data and saved it into the Lakehouse Gold layer as Delta tables.

§Created Azure Event Hub namespace, configured Event Hub, set partition, and snapshot to ADLS. Leveraged Azure Stream Analytics to stream and analyze real-time input data from Azure Event Hub, IoT Hub. Using lookup reference data from Azure SQL database T-SQL SELECT statement, into the output Azure Cosmos DB. Leveraged Azure Synapse Link to process near real-time analytics over operational data in Azure Cosmos DB from Azure Synapse Serverless SQL pool. Configured Apache Kafka producer and consumer to consume real-time event messaging using Amazon MSK, Leveraged Spark Structured Streaming in Databricks to subscribe and readStream from Kafka topic, and writeStream to delta format.

§Designed and developed dimensional data model for procurement, transportation and HR department datasets in SAP Ariba, SAP ECC, SAP Blujay, and SAP SuccessFactors using client Entity Relationship Diagram (ERD) and Erwin Data Modeler by Quest. Gathered business requirements (KPIs), developed business matrix from KPIs, identified the granularity/grains, fact, and dimensions. Designed conceptual, logical, and physical models with the source to target mappings, entities, primary key, foreign key, surrogate key, partition column, and attributes. Also help

§ed client automates their docker image running in batch script to extract Databricks metadata into data. world collector for data governance. Developed POCs and documented the processes in Azure DevOps wiki. Made delivery presentations to stakeholders.

§Deployed Fivetran from Snowflake Partner Connect to ingest data in Amazon S3 bucket into a Snowflake database and transform the data using SQL and dbt (data build tool) transformation in Fivetran and push to Git Feature Branch repository source code version control. Implemented and Automated Unit Testing in DBT from Snowflake Partner Connect. Lead both onshore and offshore teams to migrate Spark applications in semantic layer into Dev, QA, and PROD in an agile methodology. Partnered with a DevOps engineer to develop a Continuous Integration and Continuous Delivery/Deployment (CI/CD) pipeline in Azure DevOps and GitHub GitOps.

§Developed an enterprise data governance framework to deploy a unified data governance solution for a client's data estate. Collaborated with stakeholders, product owners, and managers to implement classification rules and scan the data estate for sensitive PII data due to the exposure of sensitive HR PII data to developers and business users. Ensured data was successfully secured and protected leveraging encryption/decryption operations, dynamic data masking, column-level security, row-level security, transparent data encryption, firewall, and Azure Key Vault for storing secret credentials.

§Leveraged Azure Purview to create a holistic map of a data landscape. Developed a workflow diagram and POC ensuring compliance with data quality, protection, security, and privacy.

§Automated pipeline runs using scheduled and tumbling window triggers, downloaded/configured self-hosted integration runtime to connect ADF to on-premises SQL server, and other legacy sources.

§Configured Azure Logic Apps by connecting it to a company email address and an ADF pipeline to forward error messages in case of pipeline failure. Spearheaded a pipeline deleting subfolders after copying new files automatically from Azure blob storage using delete activity and storage events trigger in ADF.

§Migrated data from On-Premises Legacy Systems (Oracle and Microsoft SQL Server) into Azure Cloud for a company needing to transition to modern cloud technology. Used Azure data lake storage account as landing and staging zones, along with data factory for ELT pipelines before curated data were loaded to Azure Synapse Analytics Dedicated SQL pool, resulting in data being highly available and actionable in Azure cloud for data analysts to consume and create dashboards that assisted stakeholders in making informed data-driven business decisions.

§Secured a HIPAA compliance certification, ingested health care data stored in an FTP location into ADLS and parsed the CCD document in XML file format using ADF.

§Copied data with Change Data Capture from multiple sources using last modified date and ADF connectors into ADLS as parquet format, executed merge, upsert, SCD type 2 logic in ADF mapping data flows.

ANBLICKS, Dallas, TX Jun 2012 – Apr 2017

Title: SQL Database Administrator/MSBI Data Engineer/Data Analyst

Project Sector: Automotive, Technology, & Commercial Real Estate Industries.

§Integrated data from numerous sources into a SQL data warehouse leveraging SSIS. Maintained the SQL data warehouse with ETL packages. Deployed the packages and data mapping and troubleshot all failed SSIS packages. Implemented MS SQL server security protocol and issued object-level permissions. Inspected notifications and settled problems with corrective action measures.

§Submitted database structure to business management to review the project structure and log the database with SQL doc. Led the deployment of processes, including developing SQL user-defined functions, views, and stored procedures.

§Programmed complex SQL queries and rapidly optimized requirements for difficult queries and processes using execution plan. Instituted guidelines for reporting and dashboard structure utilizing SSRS. Actively analyzed data.

EDUCATION & CREDENTIALS

University of Arizona – UAGC, Phoenix, Arizona

Bachelor of Science, Business Information Systems (Continuing)

January 2023 – July 2027

Berkeley College, New York, New York

Bachelor of Science, Computer Science

January 2012 – July 2016

Certifications Available upon Request

Data Analytics (2020) – Colaberry School of Data Science & Analytics, Plano, Texas, USA

Coursera Certification Course

The Path to Insights: Data Models and Pipelines

Foundations of Business Intelligence

Introduction to Data Analytics in Google Cloud

Data Analysis with Spreadsheets and SQL

Introduction to Data Analytics

Google Prompting Essentials

Certifications & Professional Development:

Microsoft Certified: Azure Data Engineer Associate (DP-203)

AWS Solution Architect Certified

Microsoft certified: Azure DevOps engineer expert (az-400

Contact this candidate