Data Engineer Senior

Location:

Jersey City, NJ

Salary:

50,0000

Posted:

February 15, 2025

Contact this candidate

Resume:

Senior Data Engineer

Mohammed Atif

Phone: +1-347-***-****

Email: ****************@*****.***

Professional Summary:

An accomplished Senior Data Engineer with over 9+ years of experience in designing, developing, and maintaining scalable data pipelines and architectures for high-volume, high-velocity data environments. Proficient in leveraging cloud platforms like AWS and Azure as well as big data tools such as Apache Spark, Hadoop, and Kafka, to drive data-driven decision-making across organizations.

Expertise in designing, building, and maintaining scalable ETL/ELT pipelines for processing large datasets from various sources.

Proficient in working with big data tools and frameworks such as Apache Spark, Hadoop, Kafka, Hive, and HDFS for distributed data processing.

Strong experience with cloud platforms like AWS (S3, Redshift, Glue) and Azure (Data Lake, Synapse) for managing data infrastructure and operations.

Skilled in developing and optimizing data warehouse solutions, designing efficient data models (star, snowflake), and handling complex data transformations.

Skilled in building ETL processes, streamlining data ingestion, and transforming raw data into actionable insights. Strong expertise in SQL, Python, and Scala for data processing, with a deep understanding of data modeling, data warehousing, and data lake architectures.

Advanced proficiency in Python, Scala, and SQL for data extraction, transformation, and automation. Familiar with shell scripting for system-level data operations.

Expertise in building real-time data ingestion and processing solutions using tools like Apache Kafka, Spark Streaming, and AWS Kinesis.

Strong knowledge of relational (SQL Server, MySQL, PostgreSQL) and NoSQL (MongoDB, Cassandra) databases for high-performance data storage and retrieval.

Adept at collaborating with cross-functional teams, including data scientists, analysts, and business stakeholders, to deliver robust, cost-effective, and high-performing data solutions.

Hands-on experience in implementing data governance frameworks, ensuring data quality, security, and compliance with industry regulations (GDPR, HIPAA).

Proven ability to optimize data pipelines and queries for enhanced performance, lower latency, and reduced cost.

Experience working with cross-functional teams (data scientists, analysts, business stakeholders) to align data engineering efforts with business goals.

Proven track record of optimizing data workflows, improving data quality, and implementing governance frameworks to ensure the reliability and security of data assets. Passionate about staying up-to-date with emerging technologies and best practices in data engineering and cloud ecosystems.

Familiar with Agile and Scrum methodologies, contributing to continuous integration (CI/CD) and iterative improvements in data workflows.

Basic understanding of data visualization and reporting tools like Tableau, Power BI, and Qlik for providing actionable insights.

Technical Skills:

Programming Languages

Python, Scala, Java, SQL, Shell Scripting

Big Data Technologies

Apache Spark, Hadoop (HDFS, MapReduce), Hive, HBase, Kafka, Flink

Data Warehousing

AWS Redshift, Azure Synapse, Snowflake, Apache Hive

ETL/ELT Tools

Apache NiFi, Talend, Informatica, AWS Glue, Azure Data Factory

Cloud Platforms

AWS (S3, Redshift, Lambda, EC2), Microsoft Azure (Data Lake, Synapse, Databricks)

Databases

MySQL, PostgreSQL, Microsoft SQL Server, Oracle, MongoDB, Cassandra, DynamoDB

Data Modeling

Star Schema, Snowflake Schema, Dimensional Modeling, Normalization/Denormalization

Real-Time

Data Processing

Apache Kafka, AWS Kinesis, Apache Flink, Spark Streaming

CI/CD & DevOps

Jenkins, Git, GitHub, GitLab, Docker, Kubernetes, Terraform

Data Integration

Apache Sqoop, AWS Glue, Azure Data Factory, Kafka Connect

Data Governance

Data Quality, Data Lineage, Metadata Management, GDPR, HIPAA Compliance

Machine Learning & AI

MLlib (Spark), TensorFlow, Scikit-learn

Version Control

Git, GitLab, Bitbucket

Monitoring & Logging

Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana)

Agile & Project Management

JIRA, Confluence, Scrum, Kanban

Data Visualization

Power BI, Tableau, Qlik Sense

Professional Experience:

Ally Bank - Sandy, Utah March 2022 to Present

Senior Data Engineer

Responsibilities:

Knowledgeable about the latest advancements and best practices in machine learning and deep learning, actively engaging in continuous learning through online courses, workshops, and research papers to stay abreast of emerging trends and methodologies.

Extensive hands-on experience in Python programming language, proficient in building robust and scalable applications for data science and machine learning tasks.

Experience in cloud computing platforms, particularly AWS and Microsoft Azure, for deploying scalable and cost-effective machine learning solutions in cloud environments.

Proficiency in leveraging cloud-based services such as AWS SageMaker and Azure Machine Learning for model training, deployment, and management.

Acted as a JIRA administrator, resolving user issues, conducting system maintenance, and ensuring data integrity and security.

Experienced in numerical computing with NumPy, utilizing its array-based operations and mathematical functions for performing complex computations and statistical analysis on large datasets.

Specialization in Generative Artificial Intelligence (Gen AI) and Large Language Models (LLM), with hands-on experience in developing and fine-tuning models for natural language processing (NLP) tasks.

Skilled in leveraging Scala for Big Data processing frameworks such as Apache Spark to perform batch and real-time data processing, analytics, and machine learning.

Azure Databricks, Azure Storage Account, etc. for source stream extraction, cleansing, consumption, and publishing across multiple user bases.

Methodically documented design decisions and technical specifications, ensuring comprehensive documentation and adherence to industry-standard coding practices for seamless collaboration and knowledge transfer.

Actively engaged in continuous learning and knowledge-sharing initiatives, avidly staying abreast of the latest advancements in AI and machine learning, fostering a culture of innovation and excellence within the team.

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backward.

Collaborate with ML Engineers and Data Scientists to build data and model pipelines and help in running machine learning tests and experiments.

Designed and implemented data pipelines using Apache Spark and Apache Kafka to ingest, process, and store data from various sources.

Developed and deployed RESTful APIs using Python and Flask to expose data and analytics services to internal and external clients.

Developed and implemented machine learning algorithms for tasks like classification, regression, and clustering.

Familiarity with deep learning concepts and architectures, including neural networks, convolutional neural networks (CNNs), and recurrent neural networks (RNNs).

Transformed and Copied data from the JSON files stored in a Data Lake Storage into an Azure Synapse Analytics table by using Azure Databricks.

Experienced in building data warehouses and analytical systems using Scala and related technologies to support reporting, dashboards, and data visualization.

Developed Spark API to import data into HDFS from Teradata and created Hive tables.

Experience in Setting up the build and deployment automation for Terraform scripts using Jenkins.

Improved the performance of queries against tables in the enterprise data warehouse in Azure Synapse Analytics by using table partitions.

Extracted data from HDFS using Hive, and Presto and performed data analysis using Spark with Scala, PySpark, and feature selection and created nonparametric models in Spark.

Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Datawarehouse and improved the query performance.

Worked with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapseanalytics (DW).

Worked on the creation of custom Docker container images, tagging and pushing the images and Docker consoles for maintaining the application life cycle.

Experience building microservices and deploying them into Kubernetes cluster as well as Docker Swarm.

Environment: HDFS, Yarn, MapReduce, Hive, Sqoop, Flume, Oozie, HBase, Kafka, Impala, Spark SQL, Spark Streaming, Eclipse, Informatica, Oracle, Teradata, CI/CD, PL/SQL UNIX Shell Scripting, Cloudera, MS Azure.

Mayo Clinic – Rochester, Minnesota Oct 2020 to Feb 2022

Senior Data Engineer

Responsibilities:

Led or participated in JIRA migration projects, including planning, data migration, testing, and post-migration support to ensure a smooth transition to new versions or platforms.

Proficient in model diagnostics to assess model assumptions, evaluate performance metrics, and identify areas for improvement, utilizing techniques such as residual analysis, ROC curves, and confusion matrices.

Transformed and Copied data from the JSON files stored in a Data Lake Storage into an Azure Synapse Analytics table by using Azure Databricks.

Skilled in data visualization with Matplotlib and Seaborn, creating visually compelling and informative plots, charts, and graphs to communicate insights and findings effectively.

Cultivated strategic partnerships with external subject matter experts (SMEs) and vendors, harnessing their specialized expertise to augment project execution and facilitate seamless knowledge transfer.

Proficiency in building and training machine learning models using TensorFlow and PyTorch, harnessing their deep learning capabilities to develop state-of-the-art neural networks for image classification, natural language processing, and other tasks.

Built and maintained data warehouses and data lakes using SQL and NoSQL databases on AWS and Azure platforms.

Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).

Exemplified over 7 years of progressive experience in machine learning and data science domains, demonstrating a proven track record of delivering high-impact solutions and driving business value.

Exhibited exceptional proficiency in Python programming language, along with extensive command over a comprehensive suite of machine learning libraries and frameworks, including Pandas, NumPy, Matplotlib, Seaborn, TensorFlow, PyTorch, and sci-kit-learn.

Strong understanding of machine learning algorithms and techniques, including supervised learning, unsupervised learning, and reinforcement learning, for solving diverse business problems.

Proficient in data visualization tools and libraries such as Matplotlib, Seaborn, and Plotly, for creating interactive and informative visualizations to communicate insights effectively.

Knowledgeable about statistical analysis techniques and hypothesis testing methodologies, enabling rigorous analysis and interpretation of data-driven insights.

Familiarity with natural language processing (NLP) techniques, including text preprocessing, sentiment analysis, named entity recognition (NER), and topic modeling, for analyzing unstructured text data.

Designing and developing POCs in Spark using Scala to compare the performance of Spark with MapReduce, and Hive. Involved in creating Hive tablets and loading and analyzing data using Hive queries and Designed and developed custom Hive UDFs.

Working knowledge of Azure cloud components (Databricks, Data Lake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, Cosmos DB).

Training the team to run the conversion tool to convert informatica mappings as per the Snowflake required format.

Wrote scripts in Hive SQL/Presto SQL, using Python plugin for both Spark and Presto for creating complex tables with high-performance metrics like partitioning, clustering, and skewing.

Automated the ML model building process by building Data Pipelines further integrating it with the Data cleaning process.

Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.

Wrote spark SQL and spark scripts(pyspark) in data bricks environment to validate the monthly account level customer data stored in S3.

Used the Spark Data Cassandra Connector to load data to and from Cassandra.

Implemented Spark Scripts using Scala, and Spark SQL to access hive tables into Spark for faster processing of data and Loading data from the UNIX file system to HDFS.

Responsible for converting row-like regular hive external tables into columnar snappy compressed parquet tables with key-value pairs.

Loaded the data into Spark RDD and did in-memory data Computation to generate the Output response.

Environment: Spark SQL, HDFS, Hive, Azure, Pig, Apache Sqoop, Java (JDK SE 6, 7), Scala, Shell scripting, Linux, MySQL Oracle Enterprise DB, PostgreSQL, IntelliJ, CI/CD, Oracle, Subversion, Control-M, Teradata, and Agile Methodologies.

Homesite Insurance - Boston, MA April 2018 to Sept 2020

Data Engineer

Responsibilities:

Experienced in deploying machine learning models in production environments, including considerations for scalability, performance monitoring, and model retraining.

Actively stay updated with the latest advancements in machine learning research and techniques, participating in conferences, workshops, and online courses to continuously enhance knowledge and skills in the field.

Performed end-to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, and S3.

Experienced in data preprocessing tasks with Pandas, including cleaning data, handling missing values, and transforming data into the desired format for analysis.

Assisted in developing and testing data solutions for various projects using Python, SQL, and AWS.

Implemented ETL processes using Python and AWS Glue to transform and load data from different sources into a data warehouse.

Installed and configured Hadoop MapReduce, HDFS, and Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.

Load and transform large sets of structured, semi-structured, and unstructured data.

Responsible for managing data coming from different sources and for implementing MongoDB to store and analyze unstructured data.

Supported MapReduce Programs that are running on the cluster and involved in loading data from the UNIX file system to HDFS.

Installed and configured Hive and written Hive UDFs.

Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.

Integrated data quality assurance processes into Dataflow pipelines, including data validation, cleansing, and anomaly detection, to ensure the accuracy, completeness, and consistency of data processed.

Implemented End-to-end solution for hosting the web application on AWS cloud with integration to S3 buckets.

Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update.

Allotted permissions, policies, and roles to users and groups using AWS Identity and Access Management (IAM).

Monitored containers in AWS EC2 machines using Datadog API and ingested, enriched data into the internal cache system.

Used Hibernate ORM framework with Spring framework for data persistence and transaction management and involved in templates and screens in HTML and JavaScript.

Environment: AWS, Hadoop, HDFS, MapReduce, Pig, Sqoop, UNIX, HBase, Java, JavaScript, HTML.

Info Logitech Systems - Hyderabad, India June 2014 to Dec 2017

Data Engineer

Responsibilities:

Analysing different user requirements and coming up with specifications for the various database applications.

Planning and implementing capacity expansion to ensure that the company’s databases are scalable.

Develop AI/ML algorithms and Build, train, and deploy ML models on AWS with Amazon Sagemaker.

Diagnosing and resolving database access and checking on performance issues.

Knowledgeable about implementing data quality checks, validation rules, and governance policies within Scala-based data pipelines to ensure data accuracy, consistency, and compliance.

Skilled in monitoring and troubleshooting AWS Glue jobs using AWS CloudWatch metrics, logs, and alarms to ensure job performance, identify errors or issues, and optimize resource utilization.

Designed and developed specific databases for the collection, tracking, and reporting of data.

Designed coded, tested, and debugged custom queries using Microsoft SQL and SQL Reporting Services.

Experienced in creating data models for Clients' transactional logs, and analysed the data from Cassandra.

Conducted research to collect and assemble data for databases - Was responsible for the design/development of relational databases for collecting data.

Implemented a Continuous Delivery pipeline with Docker, Git Hub, and AWS

Built data input and designed data collection screens – Managed database design and maintenance, administration, and security for the company.

Used Informatic PowerCenter to create mappings, sessions, and workflows for populating the data into dimension, fact, and lookup tables simultaneously from different source systems (SQL server, Oracle, Flat files).

Used several RDD transformations to filter the data injected into Spark SQL.

Used Hive Context and SQL Context to integrate Hive meta store and Spark SQL for optimum performance.

Created mappings using various Transformations like Source Qualifier, Aggregator, Expression, Filter, Router, Joiner, Stored Procedure, Lookup, Update Strategy, Sequence Generator and Normalizer.

Implemented AWS Organization to centrally manage multiple AWS accounts including consolidated billing and policy-based restrictions.

Worked Extensively with SSIS to import, export, and transform the data between Used T-SQL for Querying the SQL Server database for data validation and data conditioning.

Environment: Windows Server, Microsoft SQL Server, Informatica, Query Analyzer, AWS, Enterprise Manager, Import and Export, SQL Profiler.

Contact this candidate