Data Engineer Senior

Location:

North Richland Hills, TX

Salary:

130000

Posted:

January 30, 2025

Contact this candidate

Resume:

Name: Ranjan Mishra Senior Data Engineer

Email:******.*****@*****.***

Ph#:469-***-****

Professional Summary:

Over 8+ years of experience in Data Engineering, including data pipeline design, development, and implementation across roles like Data Engineer, Data Developer, and Data Modeler.

Strong understanding of the Software Development Life Cycle (SDLC), with expertise in requirements analysis, design specification, and testing using both Waterfall and Agile methodologies.

Experience with the Apache Spark ecosystem, including Spark Core, Spark-SQL, and Spark Streaming, for building distributed, high-performance systems using Scala and PySpark.

Proficient in developing distributed applications and optimizing big data processing through techniques such as broadcasting, shuffle parallelism, caching/persisting DataFrames, and cluster resource tuning.

Migrated data seamlessly between HDFS and relational database systems using Sqoop, with comprehensive knowledge of Hadoop architecture, including HDFS, MapReduce, and its ecosystem components.

Hands-on expertise with Kafka, design producers and consumers, for streaming millions of real-time events using Spark Streaming and Python.

Extensive use of NoSQL databases, including MongoDB, Cassandra, and HBase, for scalable data solutions.

Experienced in leveraging Azure services like Data Factory, Data Lake Gen2, Synapse Analytics, Cosmos DB, Stream Analytics, Blob Storage, and Databricks for end-to-end data solutions.

Proficient in AWS Cloud (EMR, Redshift, EC2) and GCP (BigQuery, Cloud Storage, Cloud Dataflow, Dataproc) for modern cloud-based architectures.

Hands-on expertise in Snowflake for data warehouse solutions, including data extraction, automated ETL processes, and data loading from Data Lakes.

Designed and implemented dimensional data models, including Star Schema and Snowflake Schema, for Data Warehouses, OLAP, and Operational Data Stores (ODS).

Developed and managed ETL pipelines with tools like Apache Airflow, Matillion, dbt, Talend, and Fivetran to automate and transform data workflows.

Skilled in analyzing and manipulating large structured and unstructured datasets using Pandas, NumPy, and SQLAlchemy, extracting actionable insights.

Built and deployed machine learning pipelines, including feature engineering, training, scoring, and evaluation, using Spark ML libraries.

Proficient in creating insightful, interactive visualizations using Tableau and Power BI, developing advanced charts like line plots, scatter plots, and time series visualizations.

Strong coding skills in SQL, including advanced stored procedures, triggers, functions, and complex queries in Oracle, T-SQL, and PL/SQL.

Experienced in tools like Git, Jenkins, and Enterprise GitHub for source code and build management.

Knowledgeable in metadata management with platforms like Azure Purview and Looker for crafting intuitive business dashboards.

Strong communication skills, work ethics, and leadership abilities, ensuring seamless collaboration and delivering high-quality solutions in cross-functional teams.

Technical Skills:

Databases

Snowflake, AWS RDS, Teradata, Oracle, MySQL, Microsoft SQL, Postgre SQL.

NoSQL Databases

MongoDB, Hadoop HBase and Apache Cassandra.

Programming Languages

Python, Scala, MATLAB, R.

Cloud Technologies

Azure, AWS, GCP

Data Formats

CSV, JSON, Parquet

Querying Languages

SQL, NO SQL, PostgreSQL, MySQL, Microsoft SQL

CI/CD Integration Tools

Jenkins, Git

Scalable Data Tools

Apache Spark, Hadoop, Hive, MapReduce, Apache Kafka, Spark Streaming, Sqoop

Operating Systems

Linux (Red Hat), Unix, Windows, macOS

Visualization Tools

Tableau, Power BI, Matplotlib

Professional Experience:

Client: Verizon, Dallas, TX Sep 2022 – Till Date

Role: Senior Data Engineer

Responsibilities:

Gathered and defined business requirements with users, analyzing potential technical solutions.

Developed Spark applications using Spark-SQL in Azure Databricks for data extraction, transformation, and aggregation across multiple file formats to uncover customer usage insights.

Designed and developed Scala workflows to pull data from cloud-based systems and apply transformations.

Managed large datasets using Spark in-memory capabilities, partitions, broadcasts, and optimized joins during ingestion.

Migrated MapReduce programs to Spark using Scala and PySpark, improving processing efficiency.

Built PySpark jobs running on Kubernetes for faster data processing and used Kafka pipelines for real-time data ingestion into HDFS.

Developed RESTful APIs in Python to track revenue and perform revenue analysis.

Created visualizations in Tableau and Power BI, including dashboards, pie charts, and heat maps for business reporting.

Designed and maintained data pipelines using Azure Data Factory and Azure Databricks, enabling seamless and efficient data processing and analysis workflows.

Utilized Fivetran for automatic data integration from third-party sources into Snowflake, ensuring smooth and efficient data workflows.

Leveraged dbt to build data models and run transformations in Snowflake, enhancing reporting accuracy and performance.

Developed real-time data processing jobs and pipelines using Spark Streaming with Kafka for dynamic data ingestion.

Managed and maintained user accounts, roles, and permissions on platforms like Jira, MySQL, and both production and staging servers.

Engaged in data architecture tasks such as data profiling, data analysis, data mapping, and the creation of architecture artifacts.

Designed schemas and data models for Snowflake to support advanced analytical queries and reporting capabilities.

Migrated data from AWS Redshift to partitioned datasets on AWS S3 to enhance data organization and accessibility.

Built Kafka Producers and Consumers based on software requirements to handle real-time data processing.

Utilized Kafka’s publish-subscribe model to track real-time data events and trigger processes for data orchestration workflows.

Installed and configured Apache Airflow for S3 buckets and Snowflake, creating DAGs to automate workflows and data processing.

Developed optimized MapReduce programs for filtering and cleaning unstructured data and pre-processing large datasets on Hortonworks.

Conducted ETL testing activities, transforming and uploading data into Data Warehouse servers.

Automated and monitored pipelines using Apache Airflow, scheduling Hive and Snowflake operations.

Optimized Snowflake data models and queries for cost efficiency.

Designed 3NF and dimensional models (Star and Snowflake schemas) for ODS and OLTP systems.

Ingested and processed data in Azure Databricks using Azure Data Factory, T-SQL, Spark-SQL, and U-SQL for Azure Data Lake Analytics.

Managed Azure resources, including SQL Database, Azure Analysis Services, and Azure Data Warehouse.

Used Git for version control and Jira for task management in an Agile environment.

Created and optimized PL/SQL stored procedures for data transformation.

Environment: Spark, Scala, PySpark, Map Reduce, Python, Azure, Docker, Kubernetes, Restful, HDFS, Tableau, Snowflake, Apache Kafka, Hive, Git, Jira, Snowflake, Apache Airflow, Power BI, ETL, Agile and SQL.

Client: American Airlines, Dallas, TX June 2020 – August 2022

Role: Data Engineer

Responsibilities:

Developed Spark jobs for data cleaning and ingestion into Hive tables for analytical purposes, handling custom file formats using Custom Input Formats.

Processed streaming and batch data using Spark, PySpark, and integration with NoSQL databases for large-scale data handling.

Designed and implemented scalable ETL pipelines for data cleansing, validation, and loading, including Slowly Changing Transformations for maintaining historical data in a Data Warehouse.

Created Scala projects with sbt and executed them using spark-submit, reading multiple data formats from HDFS for processing.

Designed and implemented scalable ETL pipelines using Metallion, processing structured and unstructured data in formats like JSON, CSV, and text-delimited files.

Built configurable data pipelines for scheduled updates to customer-facing data stores using Python.

Developed MapReduce jobs for data preprocessing and integrated with Apache Airflow for scheduling and orchestrating Hive, Spark, and MapReduce jobs.

Designed Data Marts and implemented dimensional modeling using Star and Snowflake schemas for reporting and analysis.

Worked with Talend for building ETL pipelines, integrating structured and unstructured data from various sources into Azure Data Lake and Snowflake.

Implemented real-time stream processing using Apache Flink, processing data streams in real-time and reducing latency in analytics.

Migrated traditional systems to Azure, leveraging tools like Azure Data Factory (ADF), Azure Data Lake (ADLS), Azure SQL Database, Azure Blob Storage, Azure Synapse Analytics, and Azure Service Bus.

Created advanced visualizations and dashboards using Tableau, Power BI, Matplotlib, and seaborn, incorporating action filters, parameters, and calculated sets for dynamic reporting.

Designed and modified database objects (Tables, Views, Indexes, Constraints, Triggers) and implemented Stored Procedures and Functions using SQL and PL/SQL.

Worked collaboratively in Agile methodologies, attending daily scrum meetings, sprint planning, and providing feedback during iterative review meetings.

Actively tracked tasks and progress using Jira for Agile project management

Environment: Spark, PySpark, Hive, HDFS, MapReduce, Apache Airflow, Scala, Python, SQL, PL/SQL, Star Schema, Snowflake Schema, Dimensional Modeling, Azure Data Factory (ADF), Azure Data Lake (ADLS), Azure SQL Database, Azure Synapse Analytics, Tableau, Power BI, Jira, Agile, Scrum.

Client: Western Union, Milwaukee, WI Oct 2018– April 2020

Role: Data Engineer

Responsibilities:

Gathered business requirements, performed business analysis, and designed various data products.

Developed Spark jobs in Scala and PySpark for data cleaning, preprocessing, and transformation of large datasets.

Built Spark code using Scala, Spark-SQL, and Spark Streaming for efficient data processing and analysis.

Extracted, transformed, and loaded (ETL) data from multiple sources (e.g., JSON, relational databases) into target systems using Spark Data Frames and Informatica Power Center.

Implemented data validation with MapReduce programs and loaded processed data into Hive tables (ORC format) from CSV files with diverse schemas.

Designed and implemented AWS Data Pipelines using AWS Lambda, API Gateway, S3, and DynamoDB, retrieving data from Snowflake and converting responses into JSON.

Migrated quality monitoring tools from AWS EC2 to AWS Lambda, managing datasets and ensuring data integrity in Snowflake warehouses.

Automated workflows and scripts using Apache Airflow and shell scripting, ensuring the seamless daily execution of production tasks.

Designed ETL workflows with Azure Data Factory, Azure Synapse Analytics, and Azure Logic Apps to efficiently load large datasets into the data warehouse.

Utilized SQL scripting for data modeling, enabling simplified querying and reporting enhanced insights into customer behavior.

Partnered with end users to address data-related and performance issues during the onboarding of new users.

Developed Apache Airflow pipelines to load data from multiple sources into Redshift, while actively monitoring job schedules for efficiency.

Migrated data from Teradata to AWS, improving both accessibility and cost optimization for data storage and processing.

Transitioned reports and dashboards from OBIEE to Power BI, streamlining visualization and reporting capabilities.

Worked with structured, semi-structured, and unstructured data to create data integration solutions on GCP.

Designed and built data warehouse structures, including facts, dimensions, and aggregate tables, using Star and Snowflake schemas for dimensional modeling.

Reviewed and edited complex SQL queries for data analysis and profiling, optimizing joins (inner, left, right) in Tableau Desktop for live and static datasets.

Created dynamic dashboards in Tableau, performing type conversions and connecting to relational data sources.

Followed Agile methodology, participating in daily SCRUM meetings, sprint planning, showcases, and retrospectives.

Environment: Spark, PySpark, Scala, Map Reduce, Tableau, ETL, Power BI, AWS, Azure, Apache Airflow, Snowflake, SQL, Agile and Windows.

Client: Verisk, Nepal Jan 2016 – Dec 2017

Role: Data Engineer

Responsibilities:

Developed near real-time data pipelines using Spark and implemented core transformations with RDD and Dataset API for efficient data processing.

Performed Clickstream Analysis and batch data analysis using MapReduce and Spark with Scala, optimizing MapReduce jobs for efficient use of HDFS through compression techniques.

Migrated legacy MapReduce programs to Spark transformations using Scala for improved performance and scalability.

Built data pipelines and complex ETL processes to handle external client data, integrating data quality checks as part of the pipeline using Python and Spark.

Designed and implemented logistic regression models in Python to predict subscription response rates based on customer transaction history, demographics, and promotional data.

Performed data cleaning and preparation on XML files, ensuring compatibility with downstream systems.

Created and managed data warehouse structures, including facts, dimensions, and aggregate tables, using Star and Snowflake schemas for dimensional modeling.

Conducted data profiling and analysis using SQL queries, identifying vulnerabilities and performing SQL injection, permission checks, and analysis with Python scripts.

Analyzed SQL scripts and implemented solutions in PySpark for optimized data transformations.

Built dynamic reports and dashboards in Tableau Desktop, applying filters and performing type conversions for insights based on business use cases.

Followed Agile SCRUM methodology, participating in weekly walkthroughs, inspections, and sprint planning to ensure project alignment and progress tracking.

Environment: Spark, Scala, PySpark, RDD, Dataset API, MapReduce, HDFS, Python, SQL, XML, Tableau Desktop, Star Schema, Snowflake Schema, ETL processes, Dimensional Modeling, Agile SCRUM, logistic regression models.

Education

Master’s in information technology management, Webster University, San Antonio, TX, 2024.

Contact this candidate