Data Engineer Sql Server

Location:

Jersey City, NJ

Posted:

March 11, 2024

Contact this candidate

Resume:

Sajeeda Khatoon

Senior Data Engineer

+1-551-***-****

*************@*****.***

www.linkedin.com/in/sajeedade

PROFESSIONAL SUMMARY

●Over a DECADE of experience in Data Engineer, including profound expertise and experience in traditional data engineering background with expertise in Apache Spark, PySpark, Kafka, Spark Streaming, Spark SQL, Hadoop, HDFS, Hive, Sqoop, Pig, MapReduce, Flume, Beam.

●Extensive experience in relational databases including Microsoft SQL Server, Teradata, Oracle, Postgress and No SQL Databases including MongoDB, HBase, Azure Cosmos DB, AWS DynamoDB, Cassandra.

●Hands on experience with Data modeling, Physical Datawarehouse designing & cloud data warehousing technologies including Snowflake, Redshift, BigQuery, Synapse.

●Experience with major cloud providers & cloud data engineering services including AWS, Azure, GCP & Databricks.

●Created and optimized Talend jobs for data extraction, data cleansing, and data transformation.

●Designed & orchestrated data processing layer & ETL pipelines using Airflow, Azure Data Factory, Oozie, Autosys, Cron & Control-M.

●Hands on experience with AWS services including EMR, EC2, Redshift, Glue, Lambda, SNS, SQS, CloudWatch, Kinesis, Step functions, Managed Airflow instances, Storage & Compute.

●Hands on experience with Azure services including Synapse, Azure Data Factory, Azure functions, EventHub, Stream Analytics, Key Vault, Storage & Compute.

●Hands on experience with GCP services including DataProc, VM, Big Query, Dataflow, Cloud functions, Pub/Sub, Composer, Secrets, Storage & Compute.

●Hands on experience with Databricks services including Notebooks, Delta Tables, SQL Endpoints, Unity Catalog, Secrets, Clusters.

●Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on-premises data & data processing pipelines to cloud including AWS, Azure & GCP.

●Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)

●Hands on experience in MS SQL Server with Business Intelligence in SQL Server Integration Services (SSIS), SQL Server Analysis Services (SSAS), SQL Server Reporting Services (SSRS).

●Experience in Data Governance & Master Data Management through Collibra & Informatica. Standardization to improve Master Data Management (MDM) and other common data management issues.

●Strong expertise in working with multiple databases, including DB2, Oracle, SQL Server, Netezza, and Cassandra, for data storage and retrieval in ETL workflows.

●Experience in usage of Hadoop distribution like Cloudera and Hortonworks.

●Traced and catalogue data processes, transformation logic and manual adjustments to identify data governance issues.

●Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2, MongoDB, HBase and SQL Server databases.

●Extensive experience in loading and analyzing large datasets with Hadoop framework (MapReduce, HDFS, PIG, HIVE, Flume, Sqoop, SPARK, Impala, Scala), NoSQL databases like MongoDB, HBase, Cassandra.

●Expert in Migrating SQL database to Azure data Lake storage, Azure Data Factory (ADF), Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and migrating on premise databases to Azure Data Lake store using Azure Data factory.

●Integrated Kafka with Spark Streaming for real time data processing.

●Experience in designing & developing applications using Big Data technologies HDFS, Map Reduce, Sqoop, Hive, PySpark & Spark SQL, HBase, Python, Snowflake, S3 storage, Airflow.

●Experience in efficiently doing ETL's using Spark - in memory processing, Spark SQL and Spark streaming using Kafka distributed messaging system.

●Crafted intricate data pathways using Informatica Intelligent Cloud Services (IICS), ensuring that information flowed smoothly between different systems and platforms.

●Developed and implemented complex ETL workflows using Talend Data Integration.

●Understanding of structured data sets, data pipelines, ETL tools, data reduction, transformation and aggregation technique, Knowledge of tools such as DBT, DataStage.

●Have good knowledge in Job Orchestration tools like Oozie, Zookeeper & Airflow.

●Written PySpark job in AWS Glue to merge data from multiple tables and in Utilizing Crawler to populate AWS Glue data Catalog with metadata table definitions.

●Excellent performance in building, publishing customized interactive reports and dashboards with customized parameters including producing tables, graphs, listings using various procedures and tools such as Tableau, PowerBI and user-filters using Tableau.

●Practical understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.

TECHNICAL SKILLS

Hadoop/Spark Ecosystem

Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm.

Hadoop Distribution

Cloudera distribution and Horton works

Programming Languages

Scala, Hibernate, JDBC, JSON, HTML, CSS, SQL, R, Shell Scripting

Script Languages:

JavaScript, jQuery, Python.

Databases

Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL, HBase, MongoDB

Cloud Platforms

AWS, Azure, GCP, Databricks, Snowflake, DBT

Distributed Messaging System

Apache Kafka

Data Visualization Tools

Tableau, Power BI, SAS, Excel, ETL

Batch Processing

Hive, MapReduce, Pig, Spark

Operating System

Linux (Ubuntu, Red Hat), Microsoft Windows

Reporting Tools/ETL Tools

Informatica Power Centre, Tableau, Pentaho, SSIS, SSRS, Power BI

EDUCATION

PROFESSIONAL EXPERIENCE

Client: SMBC Bank, New York, NY

Duration: Dec 2021 – Present

Role: Senior Data Engineer - Azure

Responsibilities:

●Designed and implemented end-to-end data solutions on Azure, leveraging services such as Azure Data Factory, Azure Databricks, Azure Data Lake Storage, and Azure SQL Database.

●Developed data pipelines using Azure Data Factory to orchestrate data movement and transformations across diverse data sources and destinations.

●Built and optimized scalable data processing workflows using Azure Databricks, leveraging Spark for data ingestion, transformation, and analysis.

●Proficient in writing Spark jobs using Scala and Python in Azure Databricks notebooks for data cleansing, feature engineering, and advanced analytics.

●Implemented real-time data processing solutions using Azure Event Hubs, Azure Stream Analytics, and Databricks Structured Streaming.

●Led the implementation of data quality initiatives using Informatica Data Quality (IDQ) to ensure the accuracy, consistency, and reliability of critical data assets.

●Developed and maintained data profiling and data quality assessment processes within IDQ to identify data anomalies, inconsistencies, and data quality issues.

●Designed end-to-end data pipelines in Azure Data Factory for data extraction, transformation, and loading.

●Integrated Teradata as a data source within Azure Data Factory pipelines for seamless data movement.

●Extensive experience in implementing data lake architectures using Azure Data Lake Storage for storing and processing large volumes of structured and unstructured data.

●Designed and developed data warehousing solutions using Azure Synapse Analytics, including data modeling, SQL script development, and query optimization.

●Implemented machine learning workflows on Azure using Azure Databricks and Azure Machine Learning, including data preparation, model training, and deployment.

● Demonstrated experience in handling large-scale data processing using Teradata.

●Integrated and leveraged Azure Cosmos DB, a globally distributed NoSQL database, to handle large-scale and highly responsive applications with low latency requirements.

●Developed data ingestion and extraction processes, ensuring data quality and integrity using Azure Data Factory and related data integration technologies.

●Implemented data security and access controls in Azure data products, ensuring compliance with regulatory requirements and protecting sensitive data.

●Proficient in Teradata database design, development, optimization, and maintenance.

●Utilized Azure Monitor to proactively monitor and troubleshoot data pipelines and services, identifying and resolving performance issues and bottlenecks.

●Integrated Azure Application Insights to monitor the performance and usage of data engineering solutions, identifying areas for optimization and improvement.

●Proficient in using control structures, loops, and conditional logic within Teradata stored procedures.

●Utilized Azure Log Analytics to analyze and visualize log data, facilitating effective troubleshooting and monitoring of data pipelines and systems.

●Proficient in implementing and managing big data clusters on Azure Databricks, including cluster provisioning, monitoring, and optimization.

●Implemented data governance practices in Azure, ensuring data security, privacy, and compliance with industry standards and regulations.

●Experienced in implementing data replication and synchronization using Azure Data Factory and Azure Databricks to enable hybrid data integration scenarios.

●Developed automated data quality checks and monitoring solutions using Azure services such as Azure Monitor and Azure Log Analytics.

●Collaborated with cross-functional teams to gather requirements, design data solutions, and deliver projects on time and within budget.

●Designed and executed data cleansing and data standardization routines in IDQ to transform and enrich data for downstream applications.

●Collaborated with data analysts and business stakeholders to define data quality rules and data quality scorecards within IDQ, aligning with business requirements.

●Integrated PySpark with external data sources such as AWS S3, HDFS, and Kafka for seamless data ingestion and extraction.

●Utilized Spark’s Data Frame API to manipulate complex data structures and perform advanced analytics tasks.

●Provided training and support to hotel staff on using the SynXis reservation system, troubleshooting issues, and optimizing booking workflows.

●Conducted system upgrades and enhancements to ensure the stability and performance of the SynXis reservation platform, minimizing downtime and maximizing revenue opportunities.

●Implemented data partitioning and optimization techniques to improve data processing performance and reduce costs in Azure Databricks.

●Developed and optimized PySpark scripts to process large-scale datasets, improving performance by X%.

●Implemented data encryption and implemented security controls in Azure to ensure data protection and compliance with organizational policies.

●Experience in implementing Azure Data Bricks Auto-Scaling and Auto-Termination policies to optimize resource utilization and cost management.

●Implemented CI/CD pipelines using Azure DevOps to automate deployment of data pipelines and workflows.

●Engineered streamlined processes within IICS to grab data from diverse sources, whip it into shape through transformations, and slot it neatly into its destination databases or data warehouses.

●Took charge of setting up real-time data feeds in IICS, allowing us to stay on the pulse with instant data updates for quick decision-making.

●Developed custom PySpark UDFs (User Defined Functions) to extend functionality and meet specific business requirements.

●Implemented data lineage and metadata management solutions to track and document data transformations and lineage using Azure services.

●Experience in optimizing and fine-tuning Spark jobs and SQL queries in Azure Databricks to improve performance and resource utilization.

●Implemented data archival and data retention policies using Azure services such as Azure Blob Storage and Azure Data Lake Storage.

●Engineered automated data integration processes using Snowflake's features such as tasks and streams, reducing manual intervention and enhancing data pipeline efficiency.

●Developed and deployed machine learning models using Azure Machine Learning, integrating them into production data workflows and Databricks pipelines.

●Implemented data security measures, including role-based access control (RBAC), encryption, and data masking techniques in Azure environments.

●Proficient in troubleshooting and resolving issues related to Azure services, Databricks clusters, and data pipelines.

●Implemented data cataloging and metadata management solutions using Azure Data Catalog and Databricks Delta Lake.

●Implemented data streaming and real-time analytics solutions using Azure Event Hubs, Azure Stream Analytics, and Azure Databricks.

●Experience in migrating on-premises data and workloads to Azure, including re-architecting and optimizing data processes for the cloud.

●Implemented revenue management strategies within the SynXis reservation system, leveraging data analytics and forecasting tools to optimize room pricing and inventory management.

●Developed documentation and training materials on Transparent Data Encryption (TDE) policies and procedures to educate teams on encryption standards and promote a culture of data security awareness.

●Provided technical expertise and guidance to stakeholders on best practices for implementing Transparent Data Encryption (TDE) within cloud environments, ensuring the confidentiality and integrity of sensitive data.

●Implemented data-driven insights and visualizations using Azure services such as Azure Data Explorer, Azure Synapse Studio, and Power BI.

●Implemented data access controls and auditing mechanisms to ensure data governance and compliance with regulatory requirements.

●Stayed up to date with the latest developments in Azure and Databricks, exploring new features.

Environment: Azure SQL, Azure Storage Explorer, Azure Storage, Azure Blob Storage, Azure Backup, Azure Files, Azure Data Lake Storage, SQL Server Management Studio 2016, Teradata, Visual Studio 2015, VSTS, Azure Blob, Power BI, PowerShell, C# .Net, SSIS, DataGrid, ETL Extract Transformation and Load, Business Intelligence (BI).

Client: Bank of America, Charlotte, NC

Duration: Sep 2020 – Dec 2021

Role: Sr. Big Data Engineer - AWS

Responsibilities:

●Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.

●Designed & Implemented Unified Data Processing Layer in Spark on AWS EMR to consolidate data from wide variety of sources.

●Collaborated with data architects to design data infrastructure and storage solutions that support data-centric applications.

●Designed & implemented incremental delta loads to optimize data processing time by 60% & cost by 53%.

●Managed the entire CI/CD process using CircleCI & Git

●Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift

●Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.

●Managed and optimized data pipelines on AWS using Databricks for ETL processes.

●Developed and maintained complex data transformation jobs using Apache Spark on Databricks to process and analyze large datasets.

●Designed and implemented data lakes and data warehouses on AWS using Databricks Delta for structured and semi-structured data storage.

●Utilized Databricks notebooks for data exploration, prototyping, and documentation of data processing workflows.

●Developed and maintained data catalogs and data lineage to track data sources, transformations, and usage.

●Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.

●Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.

●Experience with setting up and managing Databricks clusters for data processing and analysis.

●Proficient in working with Databricks notebooks to develop, test, and deploy ETL pipelines using languages such as Python, Scala, or SQL.

●Experienced in developing scalable and efficient data pipelines using Java frameworks like Apache Spark, Apache Flink, or Spring Batch.

●Created and managed Databricks clusters, optimizing their configurations for cost-efficiency and performance.

●Implemented data quality checks and monitoring processes using Databricks, ensuring data accuracy and reliability.

●Experience with AWS services related to data processing, such as Amazon EMR, Amazon S3, and AWS Glue, as well as proficiency in scripting languages such as Python and Bash for data processing and automation tasks.

●Developed and maintained data transformation pipelines using DBT (Data Build Tool), enabling the creation of structured, reliable, and maintainable SQL-based data models and transformations in Snowflake.

●Designed and implemented Snowflake data warehousing solutions, including schema design, data modeling, and optimization for scalability and performance.

●Knowledge of configuring Databricks jobs and scheduling them to run on a regular basis to ensure the timely processing of data.

●Expertise in integrating Databricks with other tools such as Apache Spark, Apache Hadoop, and cloud services like AWS.

●Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.

●Worked on AWS CLI Auto Scaling and Cloud watch monitoring creation and update.

●Handled AWS Management Tools as Cloud watch and Cloud Trail.

●Stored the log files in AWS S3. Used versioning in S3 buckets where the highly sensitive information is stored.

●Integrated AWS Dynamo DB using AWS lambda to store the values of items and backup the DynamoDB streams.

●Collaborated with data analysts and business stakeholders to understand data requirements and translate them into Snowflake schemas and DBT models, ensuring data accuracy, consistency, and accessibility.

●Implemented data quality checks and validation processes in Snowflake using Snowflake SQL, DBT tests, or custom scripts, ensuring data integrity and adherence to data quality standards.

●Used SQL Server Management Tool to check the data in the database as compared to the requirement given.

●Worked with AWS Athena and Talend to enable ad-hoc querying and analysis of data stored in Amazon S3.

●Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR

●Automated data processing workflows and orchestration using AWS Step Functions, AWS Lambda, or Apache Airflow, ensuring reliable and scalable execution of data pipelines in Snowflake.

●Integrated Snowflake with other AWS services, such as S3, Glue, and Athena, to enable data ingestion, data lake integration, and seamless cross-service data querying and analysis.

●Successfully completed a POC on GCP services such as Big Query, Dataflow, Pub/Sub, and Cloud Storage, demonstrating the ability to quickly learn and work with new cloud platforms.

●Enforced standards and best practices around data catalog, data governance efforts.

●Created DataStage jobs using different stages like Transformer, Aggregator, Sort, Join, Merge, Lookup, Data Set, Funnel, Remove Duplicates, Copy, Modify, Filter, Change Data Capture, Change Apply, Sample, Surrogate Key, Column Generator, Row Generator, Etc.

●Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes.

●Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.

●Created Unix Shell scripts to automate the data load processes to the target Data Warehouse.

●Responsible for implementing monitoring solutions in Ansible, Terraform, Docker, and Jenkins.

Environment: Python, Spark, AWS EC2, AWS S3, AWS EMR, AWS Redshift, AWS Glue, AWS RDS, AWS SNS, AWS SQS, AWS Athena, Snowflake, Data warehouse, Airflow, Data Governance, Kafka, ETL, Terraform, Docker, SQL, Tableau, Git, REST, Bitbucket, Jira.

Client: CYIENT, Hyderabad, India

Duration: Sep 2017 – Aug 2020

Role: Hadoop/Big Data Engineer

Responsibilities:

●Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.

●Developing Spark scripts, UDFS using both Spark DSL and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.

●Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORC FILE

●Strong understanding of Partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.

●Interacted with business partners, Business Analysts, and product owners to understand requirements and build scalable distributed data solutions using the Hadoop ecosystem.

●Developed Spark Streaming programs to process near real time data from Kafka, and process data with both stateless and stateful transformations.

●Experience in report writing using SQL Server Reporting Services (SSRS) and creating various types of reports like drill down, Parameterized, Cascading, Conditional, Table, Matrix, Chart and Sub Reports.

●Used DataStax Spark connector which is used to store the data into Cassandra database or get the data from Cassandra database.

●Wrote Oozie scripts and setting up workflow using Apache Oozie workflow engine for managing and scheduling Hadoop jobs.

●Worked on implementation of a log producer in Scala that watches for application logs, transforms incremental log and sends them to a Kafka and Zookeeper based log collection platform.

●Leveraged Snowflake's unique architecture to build data vaults and data marts, providing a unified platform for storing and accessing enterprise-wide data assets.

●Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.

●Worked with HIVE data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing, and optimizing the HQL queries.

●Built and implemented automated procedures to split large files into smaller batches of data to facilitate FTP transfer which reduced 60% of execution time.

●Developed PIG UDFs for manipulating the data according to Business Requirements and also worked on developing custom PIG Loaders.

●Developing ETL pipelines in and out of data warehouses using a combination of Python and Snowflake Snow SQL Writing SQL queries against Snowflake.

●Experience with implementing DevOps practices such as Infrastructure as Code, Continuous Integration and Deployment (CI/CD), and automated testing. You may also need to work with containerization technologies such as Docker and Kubernetes.

●Had experience with Kubernetes for containerization management and docker

●Created Dockerfiles for Dotnet and java applications with security enabled

●worked on Jenkins and Docker interactively and integrating together for end-to-end automation of builds and deployments.

●Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature.

●Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration.

●Developed data pipeline programs with Spark Scala APIs, data aggregations with Hive, and formatting data (JSON) for visualization, and generating.

●Implemented data governance and security policies in Snowflake, including role-based access controls and data encryption, to ensure compliance with regulatory requirements.

Environment: Apache Spark, Map Reduce, Snowflake, Apache Pig, Python, Java, SSRS, HBase, AWS, Cassandra, PySpark, Apache Kafka, HIVE, SQOOP, FLUME, Apache Oozie, Zookeeper, ETL, UDF

Client: Fusion Tech Solutions, Hyderabad, India

Duration: Jun 2014 – Aug 2017

Role: Hadoop Developer

Responsibilities:

●Installed, configured, monitored, and maintained Hadoop cluster on Big Data platform.

●Configured Zookeeper, worked on Hadoop High Availability with Zookeeper failover controller, add support for scalable, fault-tolerant data solution.

●Wrote multiple MapReduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, SV and other compressed file formats.

●Used Pig UDFs to do data manipulation, transformations, joins and some pre-aggregations.

●Created multiple Hive tables, implemented partitioning, dynamic partitioning, and buckets in Hive for efficient data access.

●Used Flume to collect, aggregate, and store dynamic web log data from different sources like web servers, mobile devices and pushed to HDFS.

●Developed data quality monitoring solutions in Snowflake, implementing checks and alerts to ensure the accuracy and reliability of data.

●Stored and fast update data in HBase, provided key based access to specific data.

●Extracted files from Cassandra and MongoDB through Sqoop and placed in HDFS and processed.

●Configured Spark to optimize data process.

●Conducted performance tuning and optimization of Snowflake queries and data pipelines, improving processing efficiency and reducing costs.

●Worked on Oozie workflow engine for job scheduling.

●Created HDFS Snapshots to do data backup, protection against user errors and disaster recovery.

●Developed interfaces to migrate data from integrated system to Data Warehouse environment, providing availability of information in real time.

●Designed and developed regulatory and statistical reports for regulatory organizations and high-level decision makers.

●Worked with complex SQL views, Stored Procedures, Triggers, and packages in large databases from various servers.

●Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

●Developed complex SQL queries using stored procedures, common table expressions (CTEs), temporary table to support Power BI and SSRS reports.

●Embedded Power BI reports on SharePoint portal page and managed access of reports and data for individual users using Roles.

●Developed and maintained multiple Power BI dashboards/reports and content packs.

●Used Test driven approach for developing the application and implemented the unit tests using Python Unit test framework.

●Migrated successfully the Django database from SQLite to MySQL to PostgreSQL with complete data integrity.

●Developed views and templates with Python and Django's view controller and templating language to create a user-friendly website interface.

●Developed and executed various MySQL database queries from python using python MySQL connector and MySQL database package.

Environment: Hadoop, Spark, Scala, MapReduce, HDFS, Hive, Java, SQL, Cloudera Manager, Sqoop, Oozie, Zookeeper, Apache Pig, Python, Java, SSRS, HBase, Cassandra, PySpark, Apache Kafka, HIVE, FLUME, ETL, UDF.

Contact this candidate