Feroz Khan
Email: ****.****@*****.***
Phone: 872-***-****
SUMMARY:
Over 9+ Years of IT experience and currently working as a Data Engineer for both SQL and Big Data with the help of Hadoop Eco System across internal and cloud-based platforms.
Experience with all stages of the SDLC and Agile Development model right from the requirement gathering to Deployment and production support.
Skilled in working with Matillion ETL, Fivetran, and Snowflake to streamline data ingestion and transformation processes, with strong knowledge of SQL and scripting languages such as Python and Scala.
Proficient in building data pipelines using Python, PySpark, Kotlin and Scala.
Expertise in loading data into Azure Data Lake, Azure SQL Database, and Azure SQL Data Warehouse.
Knowledgeable in Azure Big Data technologies, including Azure Data Lake Analytics and Azure Data Lake Store.
Hands-on experience automating pipeline schedules in Azure Data Factory and created automated jobs using Airflow.
Customized Power BI reports, dashboards, and Azure Data Factory pipelines based on updated source data.
Extensive hands-on experience with AWS services such as EC2, S3, DynamoDB, RDS, Lambda, Kinesis and Redshift.
Competent in Unified Data Analytics using Databricks, including Delta Lake with Python and Spark SQL.
Proficient in Spark Architecture with Databricks and Structured Streaming.
Experience on Web API to create HTTP RESTful services to communicate with UI objects using JSON.
Skilled in multiple databases, including MySQL, Oracle, MS SQL Server, MongoDB and DynamoDB.
Developed Snowflake Schemas and normalized dimension tables, creating a Sub Dimension named Demographic.
Expertise in SQL Server Analysis Services (SSAS) and SQL Server Reporting Services (SSRS).
Strong expertise in Spark and Scala API's, developing customized UDFs in Python, and converting Hive/SQL queries into Spark transformations.
Proficient in creating and managing complex ETL workflows using Matillion, integrating them into CI/CD pipelines, and supporting both batch and streaming data loads.
Good exposure to Python programming.
Experience in layers of Hadoop Framework - Storage (HDFS), Analysis (Pig and Hive), Engineering (Jobs and Workows), extending the functionality by writing custom UDFs.
Developed and deployed a variety of Lambda functions using the built-in AWS Lambda Libraries and Lambda functions written in Scala and using custom libraries.
In-depth knowledge of Snowflake Database and Table structures.
Hands - on experience in Azure Cloud Services: Azure Synapse Analytic, SQL Azure, Data Factory, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake.
Excellent experience on AWS cloud services: (Amazon Redshift and Data Pipeline).
Skill experience in installing, configuring, and using Apache Hadoop ecosystems such as Map Reduce, Hive, Pig, Sqoop, Spark, Kafka and Oozie.
Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s.
Experience in Dimensional Data Modeling experience using Data modeling, Relational Data modeling, ER/Studio, Erwin, and Sybase Power Designer.
Expert in writing SQL queries and optimizing the queries in Oracle, SQL Server, and Teradata.
Working experience in PL/SQL, SQL*Plus, Stored Procedures, Functions & Packages.
Excellent technical and analytical skills with clear understanding of design goals and development for OLTP and dimension modeling for OLAP.
Depth knowledge in Data Warehousing concepts with emphasis on ETL and ELT
Excellent Programming skills at a higher level of abstraction using Scala and Python.
Good working experience on Spark (spark streaming, spark SQL) and Scala.
Experience in Data transformation, Data Mapping from source to target database schemes, Data Cleansing procedures.
Experience in Cloud computing on Google Cloud Platform with various technology like Dataflow, Pub/Sub, Big Query and all related tools.
Strong experience with big data processing using Hadoop technologies Map Reduce, Apache Spark, Hive and Pig.
Excellent knowledge and extensively using NOSQL databases (HBase).
Strong experience in Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export.
Expertise in SQL Server Analysis Services (SSAS) and SQL Server Reporting Services (SSRS) tools.
Must be able to bring best of GCP technologies and leverage open source like terraform, and chef recipes to build the solution.
Expertise in Telecom Data sets like PCRF, IN - events, PRM, billing, CRM, CBE, and OCS sources.
TECHNICAL SKILLS:
Big Data tools: Hadoop3.1, HDFS, Hive 3.2, Kafka 2.8, Scala, Oozie, Pig 0.17, HBase2.4, Sqoop 1.4, AWS, Flink, Apache Spark, Cloud Dataflow, Apache Beam, BigQuery, Dataflow, Airflow, DataStage, TIDAL, Bigtable, Data Lake, Pub/Sub, Ni-Fi.
Data Modeling Tools: Erwin 9.8/9.7, ER/Studio V17, Power Designer, Real-time and batch processing with Cloud Dataflow, Apache Beam, and Apache Spark
Cloud Services: Azure Devops, AZURESQL, Azure Synapse, Azure Data Lake, Azure Data Factory and, GCP, EC2, S3, AWS, RDS, SQS, Big query, Kinesis, Amazon Redshift, DynamoDB and Lambda, Microsoft Azure.
Programming Languages: Java, SQL, PL/SQL, T-SQL, UNIX shells scripting.
Project Execution Methodologies: JAD, Agile, SDLC, Waterfall, and RAD
Database Tools: Amazon RDS, Oracle 12c/11g, Teradata15/14, MDM.
Reporting tools: SQL Server Reporting Services (SSRS), Tableau, Crystal Reports, Strategy, Business Objects
ETL Tools: SSIS, Fivetran, Informaticav10, Building, maintaining, and optimizing ETL workflows with Apache Beam, Cloud Dataflow, Airflow.
Operating Systems: Microsoft Windows 10/8/7, UNIX
Domain which has worked on: Telecom Sets like PCRF, PRM, Billing, IN-Events, OCS, CRM, CBE, Franchisee management.
EDUCATION:
Bachelor of Technology from Osmania University in 2015
WORK EXPERIENCE
Client: Caterpillar Inc, Deerfield, Illinois Dec 2022 – Current
Sr Data Engineer
Responsibilitiess
Working as a Big Data Engineer to extract historical and real-time data by using Hadoop, Map Reduce and HDFS.
Extensively Involved in Big data pipeline development and maintenance.
Worked in SCRUM (Agile) development environment with tight schedules.
Involved in provisioning, installing, configuring, monitoring, and maintaining HDFS, Sqoop, Spark, Kafka, Oozie, Pig and Hive.
Designed and developed ETL pipelines using Matillion ETL to extract, load, and transform data into Snowflake, ensuring optimized job design for scalability.
Maintained complex Matillion Orchestration and Transformation Jobs, leveraging Python scripts, SQL components, and cloud storage stages.
Develop and deploy the outcome using spark and Scala code in the Hadoop cluster running on GCP
Ensured ETL succeeded and loaded data successfully in Snowflake DB.
Designed and implemented database solutions in Azure SQL Data Lakes, Azure Data Lake, Azure Data Factory, Azure Synapse Analytics, Azure SQL.
Used DataStage as an ETL tool to extract data from sources systems, loaded the data into the ORACLE database.
Designed and Developed Data stage Jobs to Extract data from heterogeneous sources, Applied transform logics to extracted data and Loaded into Data Warehouse Databases.
Developed a data pipeline using Kafka to store data into HDFS.
Involved in import the data from different sources like HDFS into Spark RDD and Loading files to HDFS and writing HIVE queries to process required data.
Independently coded new programs and design Tables to load and test the program effectively for the given POC's using Big Data/Hadoop.
Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
Developed Python scripts to clean the raw data.
Involved in developing PySpark to transform data from one Data Lake to other Data Lake.
Developed and deployed the outcome using spark and Scala code in Hadoop cluster running on GCP.
Designed and developed continuous deployment pipelines (CI/CD) in Azure DevOps.
Create and maintain the data pipelines using Matillion ETL, Fivetran.
Evaluated Fivetran and Matillion for streaming and batch data ingestion into Snowflake.
Worked on building Aptitude Operational Data Store (ODS) model in an Oracle Ex-data database.
Involved in using Azure Data Factory to create Pipelines to transfer data across systems.
Developed various ETL process for complete end to end Data Integration.
Integrated visualizations into a Spark application using Databricks.
Managed, Configured and scheduled resources across the cluster using Azure Kubernetes Service.
Utilized Oozie workflow to run Pig and Hive Jobs Extracted files from MongoDB through Sqoop and placed in HDFS.
Designed, build and managed ELT data pipeline, leveraging Airow, python, dbt, Stitch Data and GCP solutions.
Model, lift and shift custom SQL and transpose LookML into DBT for materializing incremental views.
Developed automated jobs to migrate SVN repos to multiple GitHub repos with history and created one-way bridge to update Git contents.
Involved in design data model and creating the schema on SQL Azure.
Migrated on-premises Oracle ETL process to Azure Synapse Analytics.
Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports by our BI team.
Architected and implemented scalable data pipelines using Google Cloud Dataflow and Apache Beam, automating data processing and transformations across multiple datasets and sources.
Led the migration of batch data processing workflows from legacy systems to Cloud Dataflow, reducing processing times by 30%.
Integrated Matillion ETL pipelines into Azure DevOps CI/CD process for streamlined development and release management.
Collaborated with analytics teams to implement incremental load strategies and reusable job templates using Matillion variables and parameterization.
Installed and configured Hadoop Map Reduce and HDFS.
Automated various data extraction, transformation, and loading tasks with Python.
Designed and customizing data models for Data warehouse supporting data from multiple sources on real time.
Worked on catapulting data from Teradata to snowflake to consume on Databricks.
Involved in designing Azure Data factory/Data Lake solutions.
Performed job functions using Spark APIs in Scala for real time analysis and for fast querying purposes.
Designed and maintained GIT Repositories, views, and access control strategies.
Worked on complex SNOW SQL and Python Queries in Snowflake.
Configured Input & Output bindings of Azure Function with Azure Cosmos DB collection to read and write data from the container whenever the function executes.
Built and maintained the Hadoop cluster on AWS EMR and has used AWS services like EC2 and S3 for small data sets processing and storage.
Created BigQuery authorized views for row level security or exposing the data to other teams.
Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing of data.
Involved in performance tuning and monitoring of T-SQL blocks.
Designed & developed various departmental reports by using SAS, SQL, PL/SQL, and MS Excel.
Designed and implemented ETL pipelines between from Snowflake DB to the Data Warehouse using Apache Airflow.
Developed the AWS Data pipelines from various data resources in AWS including AWS API Gateway to receive responses from AWS Lambda retrieve data converted responses into JSON format and store them in AWS Redshift.
Developed the scalable AWS Lambda code in Python for nested JSON files, converting, comparing, sorting, etc.
Worked with Azure Blob Storage usage and handling files.
Environment: Hadoop3.1, Agile, Spark3.0, Python, Azure DevOps, CI/CD, Kafka2.8, Oozie, Pig0.17, Hive3.2, Snowflake DB, EC2, S3, HDFS, PySpark, AWS, BI, ETL, MongoDB, SQL, Airflow, Kinesis, Datastage 8.1, PL/SQL, Azure Synapse, Fivetran, Oracle, Scala, Tableau, T-SQL, GIT.
Client: UBS, Moline, Illinois August 2019 – Nov 2022
Data Analyst / Data Engineer
Responsibilities:
Worked as a Data Modeler/Data Engineer to generate Data Models using Erwin and developed relational database system.
Designed of Redshift Data model, Redshift Performance improvements/analysis
Performed the Data Mapping, Data design (Data Modeling) to integrate the data across the multiple databases into EDW.
Used the DataStage Designer to develop processes for extracting, cleansing, transforming, integrating and loading data into staging tables.
Used the Data Stage Director and its run-time engine to schedule running the solution, testing and debugging its components, and monitoring the resulting executable versions on ad hoc or scheduled basis.
Created shell script to run data stage jobs from UNIX and then schedule this script to run data stage jobs through scheduling tool.
Developed robust Matillion ETL jobs to support batch processing into Amazon Redshift and Snowflake, optimizing transformation logic for high-volume datasets.
Automated data quality validation steps using Python within Matillion Transformation Components.
Optimized pipeline for performance and cost, leveraging dynamic work rebalancing in Cloud Dataflow and applying windowing strategies in Apache Beam.
Developed a real-time data processing pipeline using Apache Beam for streaming data from GCP Pub/Sub to BigQuery.
Implemented End to End solution for hosting the web application on AWS cloud with integration to S3 buckets.
Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update.
Allotted permissions, policies and roles to users and groups using AWS Identity and Access Management (IAM).
Developed scripts to load incoming Telecom datasets to base table and then processing it.
Conducted data modeling JAD sessions and communicated data-related standards.
Created scripts for importing data into HDFS/Hive using Sqoop.
Used ER/Studio and the many too many relationships between the entities are resolved using associate tables.
Defined cubes, dimensions, user hierarchies, aggregations in an OLAP environment.
Developed normalized Logical and Physical database models for designing an OLTP application.
Developed SQL scripts for creating tables, Sequences, Triggers, views, and materialized views.
Designed and Developed PL/SQL procedures, functions, and packages to create Summary tables.
Created Teradata objects like Tables and Views.
Utilized Cloud Dataflow to manage and scale the streaming pipeline, reducing latency and increasing throughput for real-time analytics.
Used Jenkins pipelines to drive all micro-services builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes.
Designed both 3NF Data models and dimensional Data models using Star and Snowflake schemes.
Created Rich dashboards using Tableau Desktop and prepared user stories to create compelling dashboards to deliver actionable insights.
Worked on creating Restful Services using ASP.NET Core Web API for Employee Travel Application Module and created Typescript reusable components and services to consume REST API's using Component based architecture provided by Angular 2 for dependency project.
Handled importing data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
Designed data load workflows using Matillion Task Scheduler, managing dependencies and alert configurations for business-critical data pipelines.
Developed Matillion metadata-driven jobs to dynamically load various source files using parameterized variables and external configuration tables.
Supported BI developers by aligning Matillion ETL outputs with downstream Power BI and SSRS reporting systems.
Collaborated with DevOps to version and deploy Matillion jobs across environments using Git and custom Jenkins pipelines.
Connected to Amazon Redshift through Tableau to extract live data for real time analysis.
Designed and Developed Oracle database Tables, Views, Indexes with proper privileges and maintained and updated the database by deleting and removing old data.
Created mapping tables to find out the missing attributes for the ETL process.
Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
Designed and developed the conceptual then logical and physical data models to meet the needs of reporting.
Implemented a proof of concept deploying this product in Amazon Web Services AWS.
Involved in Normalization/DE-Normalization techniques for optimum performance in relational and dimensional database environments.
Developed Pig scripts to parse the raw data, populate staging tables and store the refined data.
Developed various T-SQL stored procedures, triggers, views and adding/changing tables for data load, transformation, and extraction.
Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
Worked with reversed engineer Data Model from Database instance and Scripts.
Worked closely with the SSIS, SSRS Developers to explain the complex data transformation using Logic.
Environment: ER/Studio, OLAP, OLTP, SQL, PL/SQL, HDFS, Hive, Sqoop, Oracle, Fivetran, Teradata, Kubernetes, Map Reduce, ETL, AWS, Pig, T-SQL, SSIS, SSRS.
Client: Macys, Atlanta GA Jan 2018 – July 2019
Data Analyst/Data Modeler
Responsibilities:
Worked on Data Analysis, Data Modeling and Data Profiling identifying Data Sets, Source Data, Source Data Definitions and Data Formats.
Created physical and logical data models using Power Designer.
Conducted JAD sessions with management, vendors, users, and other stakeholders for open and pending issues to develop specifications.
Created customized report using OLAP Tools such as Crystal Report for business use.
Used Python library Beautiful Soup for web scrapping to extract data for building graphs.
Involved in extensive data analysis on Teradata, and Oracle Systems Querying and Writing in SQL.
Designed and developed a business intelligence dashboard using Tableau Desktop, allowing
executive. and cross businesses processes.
Implemented ETL pipelines using Matillion to migrate legacy workloads into a modern Snowflake-based data warehouse.
Developed reusable Matillion job templates to process and standardize diverse retail data sources including CRM and Inventory.
Designed the dimensional Data Model of the data warehouse.
Extensively involved in writing PL/SQL, stored procedures, functions, and packages.
Interacted with business users to understand the business requirements.
Enforced referential integrity in the OLTP data model for consistent relationship between tables and efficient database design.
Worked on the development of tools which automate AWS server provisioning, automated application deployments, and implementation of basic failover among regions through AWS SDK’s.
Created AWS Lambda, EC2 instances provisioning on AWS environment and implemented security groups, administered Amazon VPC's.
Worked with data compliance teams, Data governance team to maintain data models, Metadata, Data Dictionaries define source fields and its definitions.
Used Python scripts to update the content in database and manipulate files.
Created SSIS Packages using various Transformations.
Led performance tuning efforts within Matillion, optimizing transformation components and leveraging parallel execution strategies.
Integrated Matillion with Tableau for automated data refresh in executive dashboards.
Used Star/Snowflake schemas in the data warehouse architecture. NgRx
Developed dashboards using Tableau Desktop.
Created reports using SQL Reporting Services (SSRS) for customized and ad-hoc Queries.
Created partitions and indexes for the tables in the data mart.
Developed database triggers and stored procedures using T-SQL cursors and tables.
Worked on Teradata and its utilities - tpump, fast load through Informatica.
Loaded multi format data from various sources like flat-file, Excel, MS Access and performing file system operation.
Written simple and advanced SQL queries and scripts to create standard and ad hoc reports for senior managers.
Environment: Power Designer, OLAP, BI, Python, AWS, SQL, PL/SQL, OLTP, SSIS, Tableau, Oracle, SSRS, MS Access, Informatica, Teradata, Snowflake, T-SQL.
DXC Technology- Ashburn, VA. Sep 2016 – Dec 2017
Data Analyst/Data Modeler
Responsibilities:
Worked on Data Analysis, Data Modeling and Data Profiling identifying Data Sets, Source Data, Source Data Definitions and Data Formats.
Created physical and logical data models using Power Designer.
Conducted JAD sessions with management, vendors, users, and other stakeholders for open and pending issues to develop specifications.
Created customized report using OLAP Tools such as Crystal Report for business use.
Used Python library Beautiful Soup for web scrapping to extract data for building graphs.
Involved in extensive data analysis on Teradata, and Oracle Systems Querying and Writing in SQL.
Designed and developed a business intelligence dashboard using Tableau Desktop, allowing
executive. and cross businesses processes.
Designed the dimensional Data Model of the data warehouse.
Extensively involved in writing PL/SQL, stored procedures, functions, and packages.
Interacted with business users to understand the business requirements.
Enforced referential integrity in the OLTP data model for consistent relationship between tables and efficient database design.
Worked with data compliance teams, Data governance team to maintain data models, Metadata, Data Dictionaries define source fields and its definitions.
Used Python scripts to update the content in database and manipulate files.
Created SSIS Packages using various Transformations.
Used Star/Snowflake schemes in the data warehouse architecture.
Developed dashboards using Tableau Desktop.
Used DataStage as an ETL tool to extract data from sources systems, loaded the data into the ORACLE database.
Created reports using SQL Reporting Services (SSRS) for customized and ad-hoc Queries.
Created partitions and indexes for the tables in the data mart.
Developed database triggers and stored procedures using T-SQL cursors and tables.
Worked on Teradata and its utilities - tpump, fast load through Informatica.
Loaded multi format data from various sources like flat-file, Excel, MS Access and performing file system operation.
Written simple and advanced SQL queries and scripts to create standard and ad hoc reports for senior managers.
Environment: Power Designer, OLAP, BI, Python, SQL, PL/SQL, OLTP, SSIS, Tableau, Oracle, SSRS, MS Access, Snowflake Informatica, Teradata, T-SQL.