Name: Syed Sohib
Contact: +1-312-***-****
Email: ***********@*****.***
LinkedIn: linkedin.com/in/syed-s-62ba6a227
Professional Summary:
Over 7+ years of hands-on experience as a Data Engineer, focused on architecting and implementing scalable distributed data systems and pipelines.
In-depth expertise across cloud platforms (AWS, Azure, GCP), utilizing services like EC2, S3, Lambda, Glue, Redshift, Azure Data Factory, Synapse, and Big Query for end-to-end data solutions.
Specialized in designing and fine-tuning ETL/ELT workflows using SSIS, ADF, Glue, DBT, and Python for both batch and real-time processing.
Experienced in dimensional modeling using Star Schema, Snowflake Schema, and Kimball methodologies tailored for both OLTP and OLAP environments.
Strong command of SQL, Python, and Scala for building robust ETL solutions and optimizing complex data queries.
Adept at crafting SQL scripts, stored procedures, and Python-based data integration logic to support transformation pipelines.
Practical knowledge of big data tools including Hadoop, HDFS, Spark, Kafka, Hive, Sqoop, and MapReduce for high-volume data ingestion and processing.
Built streaming data pipelines leveraging Spark Streaming, Kafka, and Azure Event Hubs to support real-time analytics requirements.
Solid background in cloud data warehousing and migrations, including performance-tuned Snowflake implementations.
Successfully led end-to-end migration of legacy systems to Snowflake, enhancing data accessibility and performance with minimal downtime.
Integrated Thought Spot with modern data warehouses to enable fast, self-service analytics across departments.
Developed advanced dashboards and insights in Thought Spot for executive and operational decision-making.
Knowledgeable in NoSQL systems like Cassandra, Mongo DB, and HBase for storing and retrieving high-velocity, unstructured data.
Proficient in data visualization with Power BI, Tableau, and Thought Spot to create compelling dashboards and data stories.
Experienced in tuning Spark applications, Hive scripts, and SSIS workflows to ensure peak data processing performance.
Designed pipelines to ingest and harmonize data from RDBMS, APIs, flat files, and cloud data stores into unified reporting layers.
Extensive ETL development involving sourcing, mapping, transforming, and loading data from systems like Oracle, Teradata, MSSQL, flat files, and Excel.
Built and supported production-grade ETL pipelines using Informatica Power Center, SSIS, SSAS, and SSRS.
Designed real-time ingestion pipelines using Kafka and AWS Kinesis to capture and process streaming data.
Managed workflow orchestration and dependency handling using Apache Airflow, Oozie, and Azure Synapse Pipelines.
Knowledgeable in data compliance standards such as GDPR and HIPAA for ensuring secure and regulated data processing.
Integrated external systems via REST and SOAP APIs, and used Mule Soft to facilitate enterprise data exchange.
Skilled in Git, Bit bucket, and project tracking tools like Jira, Confluence, and Rally for version control and agile collaboration.
Led cloud infrastructure provisioning using tools like AWS Cloud Formation and Azure Resource Manager templates.
Built data lakes and warehouses using Delta Lake on Data bricks and Snowflake, ensuring scalability and cost-efficiency.
Created impactful visual reports and dashboards using Power BI, Tableau, and QlikView for various business functions.
Technical Skills:
Databases
AWS S3, ADLS, HDFS, MySQL, Mongo DB, AWS Redshift, Snowflake, GCP Cloud Storage, SQL Server
Cloud Platform Services
Azure Data Factory (ADF), Azure Data bricks, Azure Data Lake, Azure Storage, Azure SQL, Azure Data Warehouse, Azure Synapse, Azure Cosmos DB, Azure HD Insight, GCP Big Query, Dataflow, Pub/Sub, Cloud Storage, Dataproc, Cloud Composer, Cloud Functions, Cloud SQL, Alloy DB, AWS (EC2, ECS, Lambda, RDS, S3 etc.), AWS Sage Maker, Kinesis, Athena, Glue, EMR, Cloud Watch, DMS, Aurora, Apache Airflow,
Languages
PowerShell Scripting, Python, Perl, Scala, Java, PL/SQL
Databases & NOSQL Databases
Oracle, MYSQL, Microsoft SQL Server, Snowflake, HBase and Cassandra, Mongo DB, Dynamo DB
RDBMS
Teradata, Oracle Pl/SQL, MS SQL Server, MySQL, PostgreSQL and DB2
Operating Systems
Windows, UBUNTU, LINUX, UNIX
Development Methodologies
Agile/Scrum, Waterfall.
IDE's
Eclipse, Net Beans, GitHub, Jenkins, Maven, IntelliJ, Visual Studio, Jupyter Notebook
Data Modeling
Star Schema, Snowflake Schema, OLAP/OLTP systems.
Data Visualization
Tableau, Python (Matplotlib, Seaborn), R(ggplot2), Power BI, QlikView, D3.js
ETL Tools
ETL Process, Sqoop, SSIS (SQL Server Integration Services), Ab initio, Informatica, DBT
Version Controls
GIT, GIT HUB, JIRA, Bit bucket, Jenkins.
Operating Systems
Windows, Linux, UNIX
Education:
•Masters of Science Information Technology Management from Indiana Wesleyan University.
•Bachelor’s in computer science Indiana Wesleyan University.
Professional Experience:
Client: Sony Interactive Entertainment Aug 2023 - Till date
Role: Sr Data Analyst- Engineer
Project Details: Sony Interactive Entertainment (SIE) is the video game and digital entertainment subsidiary of Sony Group Corporation, best known as the company behind the PlayStation brand.
Responsibilities:
Developing strategy for data cleansing and data migration, creating source to target mapping, designing data profiling and data validation jobs in Data Stage, and creating ETL jobs in data Stage.
Business Intelligence reports development efforts by working closely with Micro strategy, Teradata, and ETL teams.
Implemented and optimized Azure services, such as Azure App Service, Azure Functions, Azure SQL Database, and Azure Cosmos DB, to support application workloads.
Conducted data sharing and collaboration in Snowflake, enabling secure data sharing with external partners and stakeholders.
Developed ingestion pipelines for third-party and internal customer data sources using FTP and S3, migrating ETL logic to PySpark and Hive over a unified cloud-based lake house.
Developed robust dashboards and scorecards for sales and operations teams using Tableau and Excel.
Built Excel-based templates for data cleaning and reporting, improving team efficiency.
Created custom Java UDFs for Hive transformations and designed Apache Flink jobs to process high-frequency event streams from Kafka.
Metadata repositories and data catalogs using AWS Glue Crawlers to support lineage, discoverability, and governance.
Utilized AWS Redshift for structured data transformation and reporting, building tables and executing T-SQL-based ETL processes for risk analytics.
Data migration of compliance datasets from SQL Server to Snowflake, using Python and SnowSQL for automation.
Created and deployed CI/CD pipelines using Azure DevOps, managing data model releases and infrastructure-as-code deployments across staging and production.
Built and managed scalable data models in DynamoDB to support real-time profile enrichment and risk modeling.
Developed interactive dashboards in QlikView and Qlik Sense for risk monitoring and fraud event visualization.
Applied DBT for modular data modeling, testing, documentation, and maintaining transformation logic across Snowflake layers.
Leveraged Databricks for real-time and streaming analytics, integrating fraud scoring and ML model inference into Spark-based pipelines.
Created on-demand data tables in S3 using Lambda and Glue, incorporating PySpark logic for fraud-related batch extracts.
Collaborated with HR and compliance teams to configure Oracle HCM Cloud for performance evaluations and internal risk audits.
Developed Scala-based transformations and UDFs in Spark, integrating batch insights into RDBMS systems via Sqoop.
Monitored and diagnosed performance issues in Cassandra clusters, leveraging diagnostic tools for performance tuning and cluster optimization.
Built automated workflows to load transactional data from S3 to Snowflake using Tasks, Streams, Pipes, and custom stored procedures.
Engineered SSIS packages, stored procedures, triggers, and views to support fraud case management and reporting tools.
Implemented Spark pipelines to aggregate and transform transactional data prior to storing in HBase, enabling both batch and real-time fraud detection jobs.
IAM, EC2, S3, Glue, Lambda, and EMR for secure, scalable, and efficient cloud infrastructure and data engineering solutions.
Monitored Data bricks job performance and health using Airflow, building custom logging and alerting layers.
Automated ETL scheduling using shell scripts and Python, ensuring SLA adherence and reducing manual intervention.
Containerized micro services for fraud detection modules using Kubernetes, migrating legacy VM-based ETL apps to a cloud-native architecture.
Delivered business-critical insights through interactive dashboards and reports in Power BI, Tableau, and Thought Spot.
Designed and orchestrated fraud analytics workflows in Cloud era Hadoop using Oozie with Map Reduce, Pig, Hive, and Sqoop.
Environment: Azure Synapse, Azure Data Factory, Azure DevOps, AWS S3, Glue, Lambda, EMR, Redshift, Snap Logic, Snowflake, SQL Server, PySpark, Hive, Scala, Spark, Kafka, Flink, HBase, SSIS, DBT, Thought Spot, Power BI, Tableau, QlikView, Cassandra, DynamoDB, Kubernetes, Oracle HCM Cloud, Sqoop, Oozie, Map Reduce, Git, Confluence, Airflow, Shell Scripting.
Client: Mayo Clinic, Glenview, IL Nov 2020 - July 2023
Role: Sr Data Engineer
Project Details: Mayo Clinic is a world-renowned nonprofit academic medical center headquartered in Rochester, Minnesota. It is recognized globally for its patient-centered care, medical research, and education.
Responsibilities:
Designed and implemented statistical reporting processes for regular data collection and clinical data analysis.
Used senior level SQL query skills Oracle and TSQL in analysing and validating SSIS ETL database data ware house processes.
Extracted data from the different sources and converted it to the data frames then perform transformations based on the feed Requirement and run the application such that the data will get stored in the Google Cloud in the form of GCS buckets.
Migrating Data from On-Prem SQL Server to Cloud databases Azure Synapse Analytics and Azure SQL DB.
Designed and implemented data movement pipelines from on-premises SQL Server to Azure SQL Database using ADF, API Gateway, SSIS, Talend, and custom .NET and Python code
Developed PL/SQL triggers and master tables for seamless generation of unique identifiers across regulatory datasets.
Integrated Azure Cosmos DB with Azure Functions, Logic Apps, and Event Grid for distributed event-driven data processing.
Built and deployed Spark applications in PySpark and Spark SQL to analyze and aggregate structured and semi-structured data from multiple file formats.
Integration of Qlik Replicate with Azure Delta Lake, enabling real-time replication of key financial datasets for downstream analytics.
Designed ETL pipelines and transformations to load, cleanse, and prepare data for regulatory and financial reporting using Snowflake and advanced SQL.
Orchestrated a one-time bulk data migration of multistate-level customer and transaction data from SQL Server to Snowflake using Python and SnowSQL.
Developed Power BI dashboards connected to multiple sources including Azure SQL and Oracle for business insight and fraud intelligence.
Proposed architecture enhancements focused on Azure cost optimization and infrastructure right-sizing to align with FinOps guidelines.
Automated ETL workflows using Apache Airflow and shell scripting, ensuring reliable execution of daily data pipelines.
Performed data cleansing, feature engineering, and scaling using Pandas and NumPy in Python for risk model readiness.
Collaborated with DevOps and solution architects to enforce standards and best practices for deployment and infrastructure automation.
Supported and maintained secure and scalable data environments using Azure SQL Data Warehouse, Cosmos DB, and Azure Analysis Services.
Designed and implemented monitoring and alerting strategies for data pipeline health using Airflow and Azure-native tools.
Built and optimized SSIS packages to support legacy system integrations and assist in phased migration to modern cloud infrastructure.
Environment: Azure Data Factory, Azure SQL Database, Azure Synapse, Azure Analysis Services, ADLS Gen2, Cosmos DB, Databricks, Spark SQL, PySpark, Python, Scala, SSIS, PL/SQL, Snowflake, Qlik Replicate, Power BI, Airflow, Kafka, Shell Scripting, API Gateway, Talend, .NET, Pandas, NumPy
Client: Independent Health, Buffalo, NY May 2018 - Oct 2020
Role: Data Engineer
Project Details: Independent Health is a not-for-profit health plan serving Western New York, offering a full range of health-related products and services including commercial, Medicare, and Medicaid plans.
Responsibilities:
Designed and maintained data integration programs in Hadoop and RDBMS environments with both RDBMS and NoSQL data stores for data access and analysis.
Developed SQL-based data warehouse environments and created multiple custom database applications for data archiving, analysis, and reporting purposes.
Data Warehouse team and Business Intelligence Architecture team to understand repository objects that support the business requirement and process
Prepared UAT Materials UAT Test Cases to include various steps involved for UAT and to have proper coverage of requirements.
Designed and built scalable data pipelines using Apache Spark, Python, and Scala to process structured and semi-structured logistics data from RDBMS and streaming sources.
Architected and deployed AWS-based data workflows, loading structured data into S3 using Lambda, Glue, and PySpark, then filtered using Elastic search and stored in Hive external tables.
Migrated legacy on-premises data processing applications to AWS, leveraging EC2 and S3 for scalable compute and storage.
Developed real-time analytics pipelines using Google Cloud Dataflow, Big Query, and Cloud Functions.
Designed cost-efficient storage and high-throughput systems using GCP Cloud Storage, Cloud Spanner, and Bitable for multi-terabyte datasets.
End-to-end migration of legacy enterprise data warehouse to Big Query, enhancing performance and reducing maintenance.
Built IoT-based real-time pipelines using Pub/Sub and Cloud Functions to process sensor and shipment data streams.
Created Spark-based applications using Data Frames, RDDs, and MLlib for predictive analytics on package routing and delay patterns.
Tuned and optimized Spark jobs by configuring batch intervals, memory management, and parallelism for real-time and batch jobs.
Wrote custom Kafka consumers in Python to ingest logistics and vehicle telemetry data for fleet optimization.
Implemented and maintained Amazon Redshift for centralized data warehousing and cross-region analytics.
Designed data models and column families in Cassandra; ingested and transformed data from RDBMS and exported transformed datasets to NoSQL stores.
Created and managed Mongo DB instances for high-speed, flexible document-based storage of operational metrics.
Administered Snowflake environments, monitored performance, and scaled resources to meet increased data and concurrency demands.
Utilized core Hadoop components like HDFS, YARN, Sqoop, Hive, and Map Reduce for ingestion, transformation, and reporting use cases.
Imported and exported large data volumes between HDFS and RDBMS using Sqoop, ensuring minimal impact to production environments.
Developed Map Reduce logic and PySpark transformations for large-scale batch processing using NumPy for efficient numerical operations.
Created executive and operational dashboards using Power BI, Tableau, and extracted insights from AWS Athena queries against S3 datasets.
Enforced security best practices and ensured compliance with standards such as GDPR and HIPAA for sensitive logistics and customer data.
Implemented CI/CD workflows with Jenkins for continuous integration and streamlined deployment processes.
Collaborated in Agile teams, attending daily scrums, sprint planning, and retrospectives to deliver production-ready features on time.
Environment: Python, Scala, PySpark, Spark Streaming, Spark SQL, Kafka, AWS EMR, Lambda, S3, Glue, Redshift, EC2, Elastic search, Hive, HDFS, HBase, Cassandra, Mongo DB, Sqoop, Map Reduce, Oozie, Jenkins, NumPy, Power BI, Tableau, AWS Athena, Google Big Query, GCP Cloud Dataflow, Pub/Sub, Cloud Spanner, Bitable, Git, Agile Methodologies, GDPR, HIPAA