Data Scientist Engineer

Location:

Colonial Heights, VA

Posted:

February 28, 2024

Contact this candidate

Resume:

Mirza Mujtaba Baig

Centreville, VA-*****

Mobile#: 571-***-****

Email: *************@*****.***

Professional Summary:

Over 10+ years of IT work experience with 8+ years of relevant experience in the field of Big Data, Hadoop ecosystem, Java, Python, Spark, Databricks related technologies, 1+ year experience in Oracle Technologies and 8+ months of experience on Java development

Data Scientist familiar with gathering, cleaning and organizing data for use by technical and non-technical personnel. Advanced understanding of statistical, algebraic and other analytical techniques. Highly organized, motivated and diligent with significant background in Data Engineering/DataScience areas.

Meticulous Data Scientist accomplished in compiling, transforming and analyzing complex information through software. Expert in machine learning and large dataset management. Demonstrated success in identifying relationships and building solutions to business problems.

Hardworking and passionate job seeker with strong organizational skills eager to secure entry-level Data Engineer/Data Scientist position. Ready to help team achieve company goals.

Detail-oriented team player with strong organizational skills. Ability to handle multiple projects simultaneously with a high degree of accuracy. To seek and maintain full-time position that offers professional challenges utilizing interpersonal skills, excellent time management and problem-solving skills.

Organized and dependable candidate successful at managing multiple priorities with a positive attitude. Willingness to take on added responsibilities to meet team goals.

Astute Data Engineer/Data Scientist with data-driven and technology-focused approach. Communicates clearly with stakeholders and builds consensus around well-founded models. Talented in writing applications and reformulating models.

Accomplishments:

Experienced professional in research and development

Experienced professional in publications and scientific communities

Experienced professional in analytics, data science, machine learning, data mining, and, cloud computing areas

Education Details:

Ph. D. in Computer Science and Information Systems, University of the Cumberlands, currently pursuing

Master’s in Computer Science and Information Systems, University of Michigan-Flint, 2014

Bachelor’s in Computer Science, Birla Institute of Technology and Science, Pilani – Dubai, 2011

Technical Skills:

Big Data/ Hadoop

Cloudera CDH 5.1.3, Hortonworks HDP 2.0, MongoDB, Python, shell script, Hadoop, HDFS, MapReduce (MRV1, MRV2 YARN), HBase, Pig, Hive, Sqoop, Flume, ZooKeeper, Oozie, Lucene, Cassandra, CouchDB, MongoDB, Kafka, Scala, R, Kafka

Languages

Java, C, HTML, SQL, PL/SQL, Scala

Windows 8, Windows 7, Windows XP/98, UNIX/LINUX, MAC

Databases

Oracle (SQL / PLSQL), MySQL, NoSQL, Teradata

Web Technologies

HTML, DHTML, XML, WSDL, SOAP, Joomla, Apache Tomcat

Databases

Oracle 9g/10g/11g, SQL Server, MS Access

Build Tools

Ant, Maven

Development Tools

Adobe Photoshop, Adobe Illustrator, Eclipse, Linux/Mac OS environment, MS Visio, Crystal Reports

Business Domains

Distributed Systems, Online advertising, Social media advertising

Data Analytics

Python, R

ETL Tools

Talend/Informatica

Professional Development:

Experian PLC. Mar’21-till date

Costa Mesa, CA

Lead AWS Data Engineer

Job Responsibilities:

Participated in business and system requirements sessions.

Extracted and assessed data from databases to drive improvement of product development and business strategies and processes.

Assessed accuracy and effectiveness of new and existing data sources and data analysis techniques.

Applied statistical and algebraic techniques to interpret key points from gathered data.

Discovered stories told by data to present information to scientists and business managers.

Utilized advanced querying, visualization and analytics tools to analyze and process complex data sets.

Set up SQL database on AWS cloud servers to store client data for query analysis.

Involved in creating DAG workflows using the tools of Airflow, and, python scripts.

Involved in developing the workflows and performing the building, execution, and, deployment using the tools of Jenkins, Amazon-EMR, Amazon-Athena, Amazon-ECS, Amazon-EC2, Amazon-EKS, Amazon-ECR, Amazon-CloudWatch, Storage systems related to Amazon-S3, Amazon-MSK, Amazon-VPC.

Performance optimizations based on SQL, PL/SQL,HiveQLscripts, Java, Scala, python scripts, shell scripts and CRON schedule.

Involved in handling JSON data in Spark SQL and resolving technical issues during development, deployment, and support.

Performed testing activities related to Performance Testing, Unit Testing, Load Testing, Functional Testing, Automated Testing for py-spark scripts developed.

Excellent experience in Agile and Scrum methodologies.

Worked on parallel processing of data using in-built functions within shell script in the UNIX environment.

Implemented the algorithms of Clustering, Classification, Regression, Support Vector Machines, Neural Networks for building the prediction models within the technical infrastructure.

Involved in the operations of Spark Streaming, Spark SQL, Scala Programming, RDD Creations and Operations of Data Frames and Datasets for the different use-case’s.

Good conceptual understanding and experience in cloud computing applications using Amazon Web Services (AWS)-EC2, S3, EMR, RDS and Amazon RedShift platforms.

Streamsets pipeline’s organized into Docker container.

Using Gitlab and Jenkins for building and performing the deployment of the code.

Involved in programming with shell scripting, python, scala, sql

Utilized tensorflow, pandas, keras, numpy, pyspark libraries for developing and training the prediction model.

Created data flow documents using the design and integration of the job’s flow

Designed optimal solutions on the SQL databases related to Salesforce, MySQL, Oracle, and, NoSQL databases related to HBase, and handling streaming data sourced through the pipelines of Kafka, Zookeeper environment

Integrate Salesforce and AWS Cloud solutions for client accounts valued at 100K-250K$’s, coordinate with senior developers to ensure alignment with business requirements, and identify opportunities to enhance the user interface.

Lead a team of 20-25+ members for enhancing performance of the salesforce platform by analyzing user data, troubleshooting issues, and resolving bugs

Involved in event streaming of data using AWS Lambda, designing data using JSON schemas.

Deconstructed item descriptions in the “home-care” category to predict features of a given product which were most likely to be relevant to a given customer rating resulting in a lift in conversion rate.

Data Extraction, aggregations and consolidation of data within AWS Glue using Pyspark.

Built a price sensitivity model to offer lower pricing for product's inventory unlikely to be booked resulting in a decrease in product's vacancy.

Performed sentiment analysis to surface reviews most likely to be relevant to a given user for a given product for improving the booking rate.

Streamlined feature selection for model to predict likelihood of a customer buying period.

Built Fuzzy Matching Algorithm using K-Nearest Neighbors to identify non-exact matching duplicates.

Helped build tools for detecting botnets with Machine Learning and Data Mining

Proficient in deep learning, machine learning, graph models and/or reinforcement learning.

Experience with Natural Language Processing, Natural Language Understanding, and, Open-Source tools, GPU processing, mobile marketing analytics.

Expert in R and python scripting, worked in statistical functions using numpy, scipy, visualization using matplotlib and pandas for organizing data, handling different packages related to ggplot2, caret, dplyr, Rshiny, rjson, plyr, scikit-learn.

Design and Develop ETL processes in AWS Glue to migrate data from external sources like S3, ORC/Parquet/text files into AWS Redshift, performing data extraction, aggregations, consolidation.

Experience in developing Spark applications using Spark+SQL in databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data to uncover insights into the customer usage patterns.

Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster.

Identified, measured and recommended improvement strategies for KPIs across business areas.

Created and implemented new forecasting models to increase company productivity.

Worked with stakeholders to develop quarterly roadmaps based on impact, effort and test coordinations.

Developed intricate algorithms based on deep-dive statistical analysis and predictive data modeling.

Built company maven frameworks to test model quality.

Analyzed large datasets to identify trends and patterns in customer behaviors.

Developed polished visualizations to share results of data analyses.

Ran statistical analyses within software to process large datasets.

Modeled predictions with feature selection algorithms.

Devised and deployed predictive models using machine learning algorithms to drive business decisions.

Leveraged mathematical techniques to develop engineering and scientific solutions.

Experience in developing applications using Java and various J2EE technologies including Java, Spring, REST, SOAP, JAX-WS, JAX-RS, Hibernate, JDBC, JSP, Servlets, JSTL, EJB, XML and JMS.

Extensively used JAXB parsers to parse XML into objects.

Developed and implemented a machine learning model that improved customer churn prediction accuracy, reduction in customer attrition, increase in customer retention.

Collaborated with a team of data scientists and software engineers to design and deploy a natural language processing system that achieved maximum accuracy rate in sentiment analysis, leading to more accurate customer feedback analysis and improved product development.

Conducted extensive research and experimentation to optimize an image recognition algorithm, resulting in object detection accuracy and enabling more efficient image processing in real-time applications.

By incorporating a trained Generative Pre-trained Transformer (GPT) model it is possible to generate human-like text. Whether it is generating creative writing pieces, news articles, or even code snippets, showcasing ability to work with state-of-the-art language models can make the resume stand out.

Showing passion by demonstrating AI and music by developing a project involving neural networks for generating music by creating original compositions.

Utilizing Generative Adversarial networks to create a project that stimulates the aging process on facial images which not only highlights your proficiency in GANs, but, also demonstrates your ability to apply generative models to real-world scenarios, such as facial recognition technology.

Integrated ChatGPT into over 900,000 Tesla vehicles in a beta program. This technology powers VAs, offering an advanced level of interaction and customization of drivers. Enables a more intuitive user engagement with car functions through natural language processing capabilities, which enhances in-car experience. It can transform traditional vehicle interfaces into more responsive, user-friendly systems.

Gen AI considers key engineering constraints such as aerodynamics and cabin size. Significantly streamlines the creative process, reducing the number of iterations needed to refine a design. The instrument ensures that the designs are aesthetically pleasing and functionally viable.

Integrated Ansible Playbooks with the Terraform templates to provision the Infrastructure in AWS

Deployed 3-tier architecture infrastructure on AWS cloud using Terraform -- IaC

Migrated IaC base from Terraform 0.11 to 0.12.x latest version

Managed provisioning of AWS infrastructures using CloudFormation and Terraform

Cognizant/Novartis International AG Jul’19-Mar’21

East Hanover, NJ

Lead AWS Data Engineer

Job Responsibilities:

Participated in business and system requirements sessions.

The data was sourced from Salesforce database to the Big Data environment.

Worked on the Cloudera (CDH-6.3.3) for doing operations on the storage, processing, migration of data.

Involved in loading data from the front-end applications similar to Service Now to the HDFS location for analytical purposes.

Involved in creating DAG workflows using the tools of Airflow, and, python scripts.

Involved in the activities of dashboard monitoring and dash creation for pipeline jobs using Grafana for alerting purposes.

Performance optimizations based on SQL, PL/SQL,HiveQL scripts, Java, Scala, python scripts, shell scripts and CRON schedule.

Involved in handling JSON data in Spark SQL and resolving technical issues during development, deployment, and support.

Experience with snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple data sources system which includes loading nested JSON formatted data into snowflake table.

Performed testing activities related to Performance Testing, Unit Testing, Load Testing, Functional Testing, Automated Testing for the PySpark scripts developed

Perform performance optimizations on HiveQL, Spark-SQL, PySpark Scripts

Involved in the activities of RDD Creations and Operations of Data Frames and Datasets for the Use-Case

Transformed the PySpark scripts using AWS Glue dynamic frames with Python and Spark; cataloged and transformed the data using crawlers and scheduled the job and crawler using workflow feature.

Developed end-to-end data analytics framework utilizing Amazon Redshift, Glue, and Lambda enabling business to obtain KPIs faster with reduced costs.

Analyzing salesforce application’s performance, evaluating user feedback, and driving continuous improvement efforts for maintaining long term client relationships.

Integrated salesforce.com application with external systems like Oracle and SAP using SOAP API and REST API.

Proficient knowledge in bash shell scripting.

Involved in scheduling of jobs using Control-M environment.

Excellent experience in Agile and Scrum methodologies.

Implemented the algorithms of Clustering, Classification, Regression, Support Vector Machines, Neural Networks for building the prediction models within the technical infrastructure using python and R programming.

Good conceptual understanding and experience in cloud computing applications using Amazon Web Services (AWS)-EC2, S3, EMR, RDS and Amazon RedShift platforms. Involved in Spark SQL, Spark streaming, Scala, RDD Creations, and operations of datasets, dataframes.

Utilized tensorflow, pandas, keras, numpy, pyspark libraries for developing and training the prediction model.

Evaluate Snowflake design considerations for any change in the application.

Worked on Oracle Databases, Redshift, Snowflake.

Extract, Transform and Load data from source systems to AWS/Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark-SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure services(Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.

Developed Spark jobs on Databricks to perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases.

Experience analyzing data from Azure data storages using Databricks for deriving insights using Spark cluster capabilities, developed batch processing solutions, utilized databricks notebooks for interactive analytics using Spark APIs.

Developed SOAP web services using JAX-WS API

Developed REST web services using JAX-RS API

Extensively used JAXB parsers to parse XML into objects.

Implemented various XML technologies like XML schema, JAXB parsers and XMLBean.

Led a team of AI engineers in the development and deployment of predictive modelling system that accurately forecasted customer demand, resulting in reduction in inventory costs and increase in on-time deliveries.

Implemented an AI system design that enabled seamless human-AI interaction, leading to improvement in customer satisfaction scores and increase in customer engagement.

Conducted rigorous testing and validation of an AI-powered chatbot, acheiving accuracy rate in understanding and responding to customer inquiries, resulting in reduction in customer support response time.

Employing generative techniques for data augmentation in machine learning models. And understanding of how generative approaches can enhance model training by generating synthetic data sets reflecting practical applications of AI within the area of data science.

AI-driven content creation significantly cuts down on the resources typically needed for manual copywriting. Thsi technology has been instrumental in optimizing the retailer's digital marketing efforts, making their online listings more attractive and accessible to potential buyers.

Experience in working with Terraform key features such as Infrastructure as Code, Execution plans, Resource Graphs, change automation and extensively used Auto scaling launch configuration templates for launching amazon EC2 instances while deploying microservices

Implemented cluster services using Kubernetes and Docker to manage local deployments in Kubernetes by building a self-hosted Kubernetes cluster using Terraform and Ansible and deploying application containers

Written automation scripts for creating resources in OpenStack cloud using Python and Terraform modules

Accenture/Johnson Control Inc. Sep’18-Jul’19

Milwaukee, WI

Lead AWS Data Engineer

Job Responsibilities:

Collaborated on ETL (Extract, Transform, Load) tasks, maintaining data integrity and verifying pipeline stability.

As a big data engineer, managed to ingest, validate, and transform program files in end to end data pipeline on AWS.

Wrote pyspark scripts to apply hard checks on data at record level to generate reports to end users and automate the same through AWS Apache Airflow.

Designed and implemented effective database solutions and models to store and retrieve data.

Built databases and table structures for web applications.

Prepared documentation and analytic reports, delivering summarized results, analysis and conclusions to stakeholders.

Developed, implemented and maintained data analytics protocols, standards, and documentation.

Analyzed complex data and identified anomalies, trends, and risks to provide useful insights to improve internal controls.

Contributed to internal activities for overall process improvements, efficiencies and innovation.

Explained data results and discussed how best to use data to support project objectives.

Communicated new or updated data requirements to global team.

Generated detailed studies on potential third-party data handling solutions, verifying compliance with internal needs and stakeholder requirements.

Designed advanced analytics ranging from descriptive to predictive models to machine learning techniques.

Designed compliance frameworks for multi-site data warehousing efforts to verify conformity with state and federal data security guidelines.

Monitored incoming data analytics requests and distributed results to support [Type] strategies.

Ran statistical analyses within software to process large datasets.

Analyzed large datasets to identify trends and patterns in customer behaviors.

The data was sourced from Oracle, Teradata environments to the Big Data environment.

Utilized salesforce with recursive AWS IDE instances for clients utilizing github and amazon for stability.

Involved in Resolving technical issues during development, deployment, and support

Interact with clients to elicit architectural and non-functional requirements like performance, scalability, reliability, availability, maintainability.

Perform performance optimizations on HiveQL, Spark-SQL, PySpark Scripts

Worked on parallel processing of data using in-built functions within shell script in the UNIX environment

Worked on processing of the data using lookup data for the different use-case using shell script in the UNIX environment.

Worked on the reporting script in Python environment for processing of the files based on the different file-naming convention.

Discovered the applications of data science algorithms to the big data platform within Spark environment.

Developed predictive modelling algorithms for strengthening relationships, longevity, and, personalize interactions with customers by utilizing appropriate data mining and machine learning algorithms.

Implemented solutions using the algorithms for neural networks, support vectorization, clustering, classification, regression, decision trees.

Implemented various random forest models, recommendation engine, time series forecasting models for predicting improvements in customer orders for lowering the average customer wait time on the product development.

Performed migration of large data sets to Databricks(spark), create and administer cluster, load data, configure data pipelines, loading data from ADLS Gen2 to Databricks using ADF pipelines.

Migrate data to data pipelines using Databricks, Spark SQL, PySpark, and, Scala.

Developed and implemented an algorithm that improved efficiency of data analysis processes, enabling faster insights generation and decision-making.

Collaborated with a team of researchers to explore and implement cutting-edge AI technologies, resulting in the development of a state-of-the-art AI model that outperformed existing models in image classification accuracy.

Trained AI models using large datasets and optimized training processes, acheiving increase in model accuracy and reduction in training time.

Diving into video synthesis project by creating conditional generative adversarial networks (cGANs) to generate realistic videos by altering scenes or creating entirely new visual content, which showcases the ability to apply generative models to dynamic and time-dependent data.

Utilizing AWS for CLoud-based innovations including generative-AI focussing on improved vehicle safety and features. Combining AWS's computing power with gen AI, the manufacturer aims to create advanced, efficient driving aids. This partnership shows BMW's commitment to using the latest technology for better, safer cars. The use of AWS and AI is a big step towards more intelligent automobiles in the future.

Written terraform for AWS Infrastructure as a code using Terraform to build staging and production environments and defined Terraform modules such as Compute, Network, Operations, and Users to reuse in different environments

Used terraform to write Infrastructure as code and created Terraform scripts for EC2 instances, Elastic Load Balancers, and, S3 buckets

Walmart Inc. Apr’18 - August’18

Bentonville, AR

Data Engineer

Job Responsibilities:

Collaborated on ETL (Extract, Transform, Load) tasks, maintaining data integrity and verifying pipeline stability.

Designed compliance frameworks for multi-site data warehousing efforts to verify conformity with state and federal data security guidelines.

Generated detailed studies on potential third-party data handling solutions, verifying compliance with internal needs and stakeholder requirements.

Built databases and table structures for web applications.

Designed and implemented effective database solutions and models to store and retrieve data.

Prepared documentation and analytic reports, delivering summarized results, analysis and conclusions to stakeholders.

Explained data results and discussed how best to use data to support project objectives.

Planned and installed upgrades of database management system software to enhance database performance.

Design and develop ETL integration patterns using Python on Spark.

Develop framework for converting existing PowerCenter mappings and to PySpark jobs.

Create PySpark frame to bring data from DB2 to Amazon S3.

Provide guidance to development team working on PySpark as ETL Platform.

Optimize the PySpark jobs to run on Kubernetes cluster for faster data processing.

Developed database architectural strategies at modeling, design and implementation stages to address business or industry requirements.

Documented and communicated database schemas using accepted notations.

Established and secured enterprise-wide data analytics structures.

Created tables, views in Teradata, according to the requirements

Perform performance optimizations on Java/JVM frameworks and UNIX Shell Scripts

Prepare estimations, release plan and road map for future releases

Design applications based on identified architecture and support implementation of design by resolving complex technical issues faced by the IT project team during infrastructure set-up, development, deployment and support.

Involved in the operations of spark streaming, spark sql, scala programming, RDD creations and operations of dataframes and datasets.

Building, publishing customized reports and dashboards, report scheduling using Tableau server. Create action filters, parameters and calculated sets for dashboards.

Experience in designing data models in Cassandra and working with Cassandra Query Language.

Query tuning & performance tuning on cluster & suggesting best practice for developers.

Working closely with Cassandra loading activity on history load and incremental loads from Teradata and Oracle Databases and resolving loading issues and tuning the loader for optimal performance.

Implementation of novel iterative development procedures on jupyterlab based IDE AI notebooks.

Developed a new data scheme for data consumption store for machine learning and AI models to quicken the processing time using SQL, Hadoop and Cloud services.

Part of an innovative and diverse team of data scientists, engineers, developers and product managers to strategically capitalize rich data ecosystem to leverage ML and AI technologies to better serve the customer.

Lead multiple AI and machine learning programs within our product suits; Programs incldue -- Predictive analytics service -- baselining and forecasting of performance and security KPIs, Security Analytics -- Anomaly detection service -- clustering of devices based on behavior over time -- NLP, LSTM, KubeFlow, Docker, AWS Sagemaker, AWS Greengrass.

Manage Data Storage and processing pipelines in GCP for serving AI and ML services in production, development and testing using SQL, Spark, Python and AI VM.

Northern Trust Bank May 17 to Mar 18

Chicago, IL

Technology Lead

Job Responsibilities:

Designed documentation protocols and standard operating practices to unify technology management efforts across company.

Participated in verifying compliance with service level agreements by maintaining technical uptime levels.

Led technology governance efforts, planning upgrades, hardware refreshes and software updates.

Drafted technology buying budget plans, prioritizing vital purchases to provide maximal impact from tech spending.

Worked with the application team for developing and automating the tasks of enhancing Log functionality while processing of data logs and enhancing the search functionality

Experience in building, maintaining production web infrastructures in Dev Ops, Amazon Web Services (AWS) cloud platform.

Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.

Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark.

Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema into Hive Parquet tables.

Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.

Analyzed SQL Scripts and designed it by using PySpark SQL for faster performance.

Worked on Installing and configuring the HDPHortonWork2.X and Cloudera (CDH 5.5.1) Clusters in Dev and Production Environments.

Volumetric Analysis for 43 feeds (CurrentApproximate Size of Data (70TB), Based on which size of ProductionCluster is to be decided.

Ability to build complex Splunk infrastructure from the scratch.

Multiple Spark Jobs were written to perform Data Quality checks on data before files were moved to Data ProcessingLayer.

Worked on Capacity planning for the ProductionCluster.

Installed HUE Browser.

Involved in loading data from UNIX file system to HDFS.

Involved in creating Hivetables, loading with data and writing hive queries which will run internally in MapReduce way.

Tuning configuration files for optimized Splunk performance.

Worked on Installation of HORTONWORKS 2.1 in AZURE Linux Servers.

Worked on cluster upgradation in Hadoop from HDP2.1 to HDP 2.3.

Responsible for implementation and ongoing administration of Hadoop infrastructure

Managed and reviewed Hadoop log files.

Importing and exporting data from different databases like MySQL, RDBMS into HDFS and HBASE using Sqoop.

Importing the data in Splunk through inputs. conf, props.conf and transforms.conf

Worked on indexing the HBase tables using Solr and indexing the Json data and Nested data.

Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Datanodes, Troubleshooting, Manage and review data backups, Manage & review log files.

Day to day responsibilities includes solving developer issues, deployments moving code from one environment to another environment, providing access to new users and providing instant solutions to reduce the impact and documenting the same and preventing future issues.

Adding/installation of new components and removal of them through Ambari.

Experience working with Splunk Current Version 6.4.x and 4.x, 5.x and

Contact this candidate