Azure Data Engineer

Location:

Dallas, TX

Posted:

August 06, 2025

Contact this candidate

Resume:

Venkat Sai Resham

Senior Azure Data Engineer

Phone: 802-***-****

Email: *************@*****.***

LinkedIn: https://www.linkedin.com/in/venkat-sai-resham/

PROFESSIONAL SUMMARY

With over a decade of comprehensive IT experience, specializing in Azure Cloud Services, Big Data technologies, Data Modeling, Analysis, ETL Development, Validation, Deployment, Monitoring, Visualization reports, and Requirement gathering.

Demonstrate expertise in the Azure Cloud platform, specifically in Azure Data Lake Storage, Azure Data Factory, Azure SQL, Azure Data Warehouse, Azure Synapse Analytics, Azure Analysis Services, Azure HDInsight, and Databricks.

Strong hands-on experience with Azure Functions (C# and JavaScript) for implementing serverless compute logic in production-grade ETL and event-driven architectures.

Collaborated with cross-functional teams to design and operationalize data cataloging, lineage, classification, and stewardship models in Collibra, ensuring alignment with enterprise compliance and governance standards .

Deep understanding of Microsoft Entra ID / Azure AD, including B2B and B2C external identity federation and enterprise-level identity management.

Proficient in leveraging Microsoft Graph API and designing secure RESTful APIs to access and manage directory data, users, and roles across Azure ecosystems.

Skilled in implementing claims-based authentication workflows and developing custom claims providers for enforcing fine-grained access control policies.

Experienced in implementing metadata management and data governance frameworks using Collibra in combination with Azure-native services, enabling consistent, governed, and discoverable enterprise data assets.

Well-versed in identity governance concepts such as entitlement management, periodic access reviews, and sponsor-based onboarding/offboarding models.

Expertise in OAuth 2.0 and OpenID Connect, with deep knowledge of access token customization and secure authentication flows for enterprise applications.

Strong background in secure development practices aligned with CJIS, NIST, and other regulatory standards, ensuring data protection and compliance in critical systems.

Experience in designing and building a Data Management Lifecycle covering Data Ingestion, Data integration, Data consumption, Data delivery, and integration Reporting, Analytics, and System-System integration.

Experienced in constructing robust data pipelines with Azure Data Factory and Azure Databricks, ensuring seamless data loading and precise database access management for efficient handling of complex workflows.

Hands-on experience with Azure Databricks for building and optimizing ETL data pipelines, leveraging PySpark and Spark SQL, while also showcasing expertise in seamless data transfer from Data Lake to Azure Blob Storage, utilizing best practices.

Architected high-performance data pipelines using Azure Functions for serverless data processing and real-time analytics, while leveraging Azure Event Hubs for efficient data ingestion and utilizing Azure Logic Apps to establish robust data governance frameworks. This approach ensures data quality, security, and regulatory compliance throughout the Azure ecosystem.

Proven track record of implementing real-time data processing solutions with Azure Databricks, facilitating timely insights and decision-making through seamless ingestion, processing, and analysis of streaming data in near real-time.

Proven proficiency in harnessing the power of Azure Synapse Analytics to architect and implement advanced big data and analytics solutions, elevating data warehousing efficiency to new heights.

Automated data storage solutions through Azure Cosmos DB for distributed NoSQL databases, ensuring scalability and high performance. Additionally, leveraged Azure SQL Database for fully managed relational database services, guaranteeing secure and reliable storage for critical business data through partitioning strategies and indexing techniques.

Orchestrated scalable and efficient data engineering pipelines using Azure Kubernetes Service (AKS) and Azure Machine Learning, leveraging deep data engineering knowledge to extract actionable insights and predictions from diverse data sources also automated data classification and labeling with Azure Purview for consistent data governance and regulatory compliance.

Highly proficient in Big Data processing, serving as an expert architect for scalable data pipelines within the Hadoop ecosystem.

Demonstrates mastery in leveraging HDFS for distributed storage, utilizing MapReduce for parallel processing, and harnessing Hive for SQL-like querying. Implements advanced data manipulation and excels in real-time streaming with Kafka and Spark Streaming using Scala. Brings extensive experience with Oozie for workflow orchestration, Sqoop for efficient data movement between relational databases and HDFS, and HBase for handling large-scale, semi-structured data.

Optimized and accelerated data processing in Azure Databricks through expert PySpark performance tuning and optimization. Leveraged in-depth knowledge of Spark configurations, Adaptive Query Execution, and advanced techniques like Apache Spark's Catalyst Engine to ensure efficient resource utilization and optimal query execution within PySpark environments, demonstrating a comprehensive skill set in enhancing data processing workflows.

Possesses a robust background in UNIX, LINUX, and shell scripting, leveraging PowerShell for efficient data manipulation, automation, and scripting to drive efficiency and automation in Hadoop Big Data Development.

Demonstrated practical understanding of Data Pipeline Development and Data Modeling methodologies, encompassing Star-Schema Modeling, Snowflake Schema Modeling, and the design of Fact and Dimension tables, informed by industry best practices honed through extensive hands-on experience.

Extensive experience with data serialization formats, including Parquet, ORC, AVRO, JSON, and CSV.

Ample Knowledge in Algorithm Performance Evaluation (Regression Models and Classification Models), Regularization, Cross Validation. Good knowledge in Data Visualization with Matplotlib, Seaborn, Plotly and GGplot2.

Skilled in EDW, Data Marts, Data Modeling using Erwin, kimball Methodology ODS, and Data warehouse implementations, with a focus on building scalable and efficient data architectures.

Proficient in setting up CI/CD pipelines using tools such as Jenkins, Maven, SonarQube, Nexus, Slack, Azure DevOps, and code pipeline for optimizing ETL workflows.

Database design, Data modeling, migration, and development experience in using stored procedures, triggers, cursor, constraints, and functions. Used My SQL, MS SQL Server, DB2, and Oracle including NoSQL.

Implemented Master Data Management strategies to efficiently eliminate redundant data for Employer and Claimant entities, optimizing data integrity and streamlining information across systems.

Highly proficient in Git and GitHub for version control and maintaining code repositories, adept at collaborating on projects, and managing code changes with precision

Excellent performance in building and publishing customized interactive reports and dashboards with customized parameters, including producing tables, graphs, and listings using various procedures and tools such as Tableau and Power BI.

Experienced in Agile/Scrum development and Waterfall project execution methodologies, demonstrating proficiency in Agile methodologies such as extreme programming, SCRUM, and Test-Driven Development (TDD).

EDUCATION:

Master’s in computer science, University At Albany, Albany, New York. Jan 2012 to Dec 2013

Bachelor’s in Electronics and Communication Engineering, Osmania University, Hyderabad, India. Jun2007 to Jun 2011

CERTIFICATION:

Microsoft Certified Azure Data Engineer Associate – DP 203

Databricks Certified Data Engineer Professional

Microsoft Certified Azure Fundamentals – AZ 900

TECHNICAL SKILLS:

Azure Services

Azure data Factory, Azure Data Bricks, Logic Apps, Functional App, Azure Synapse analytics, Azure EventHub, Airflow, Snowflake, Azure Devops, Cosmo DB, Azure Data Lake Storage (ADLS) Gen2, Azure Purview, Azure DevOps, Azure Kubernetes and Azure

Hadoop Distribution

Cloudera, Horton Works

Big Data Technologies

HDFS, MapReduce, Hive, Sqoop, Oozie, Zookeeper, Kafka, Apache Spark, Spark Streaming, HBase, YARN, Scala and Pig

Languages

SQL, PL/SQL, Python, HiveQL, Scala and PySpark

Web Technologies

HTML, CSS, JavaScript, XML, JSP, Restful, SOAP

Operating Systems

Windows (XP/7/8/10), UNIX, LINUX, UBUNTU, CENTOS.

File formats

ORC, Avro, CSV, JSON, TXT, XML, Excel

Build Automation tools

Maven, SBT

Version Control & CI&CD

GIT, GitHub, Jenkins

IDE &Build Tools, Design

Eclipse, Visual Studio

Visualization Tools

Power BI, tableau

Databases

MS SQL Server 2016/2014/2012, Azure SQL DB, Azure Synapse. MS Excel, MS Access, Oracle 11g/12c, Cosmos DB

PROFESSIONAL EXPERIENCE:

Client: US Bank, Irving, TX. Feb 2023 - till now Role: Senior Azure data engineer

Responsibilities:

Managed end-to-end operations of ETL data pipelines, orchestrating robust data workflows with Azure Databricks and Apache Spark for large-scale transformations, advanced analytics, and efficient handling of large datasets within Azure Data Factory.

Built pipelines in the Azure cloud platform, leveraging technologies like Delta Lake, Blob Storage, Azure Data Factory, Cosmos DB, Azure Databricks, and Azure Key Vault.

Developed and deployed production-grade Azure Functions using both C# and JavaScript to implement custom ETL logic and data transformation components.

Implemented secure authentication flows using Microsoft Entra ID (Azure AD), External ID (B2B/B2C), and identity federation models across internal and partner-facing applications.

Designed and consumed RESTful APIs using Microsoft Graph API for directory access, user provisioning, and role management within enterprise data workflows.

Applied claims-based authentication techniques, developing custom claims providers to enforce fine-grained access control and contextual authorization.

Led identity governance integration by implementing access reviews, entitlement management, and sponsor-based user provisioning to enforce least-privilege policies.

Integrated OAuth 2.0 and OpenID Connect protocols within application flows, customizing tokens for secure and context-aware access management. Integrated Azure DevOps REST API with Continuous Integration and continuous deployment pipelines to automate build, test, and deployment processes, enhancing development workflows and reducing time-to-market.

Applied secure development lifecycle practices to meet compliance requirements under frameworks such as CJIS and NIST, with secure audit trails and traceability.

Collaborated with data scientists to productionize ML models using Azure Machine Learning and integrated them with Azure Data Factory for automated model scoring pipelines. Experienced working with UNIX/LINUX environments, writing UNIX shell scripts.

Designed data preprocessing pipelines using PySpark and Python to cleanse, transform, and feature-engineer large volumes of data for model training.

Deployed real-time inference pipelines using Azure Kubernetes Service (AKS) and Event Hubs to support low-latency ML applications.

Implemented model versioning and experiment tracking using MLflow within Databricks, ensuring reproducibility and auditability of ML experiments. Proficient in SAS Data Integration Studio, SAS Visual Analytics, SAS Programming, SAS Data Quality, SAS Workflow Manager, and SAS Data Governance solutions.

Enabled continuous model retraining workflows by orchestrating data refreshes and automated triggers based on model drift detection metrics.

Integrated Azure Logic Apps, Azure Kubernetes, and Azure Data Factory Analytics to orchestrate comprehensive workflow automation and improve operational efficiency, while implementing encryption of Personally Identifiable Information (PII) in Azure Data Lake Storage using Azure Key Vault-managed keys. Transformed the visualizations in Power BI reports and dashboards as per the client requirements.

Implemented calculated columns and measures using DAX in Power BI. Successfully integrated custom visuals based on business requirements in Power BI Desktop.

Proficiently crafted SQL queries encompassing DDL and DML operations, while proficiently implementing indexes, triggers, views, stored procedures, functions, and packages to optimize database performance and functionality.

Leveraged Kafka, Spark Streaming, and Hive to architect and oversee a real-time data pipeline, facilitating seamless ingestion, transformation, and analysis of data, enabling timely insights and informed decision-making with advanced expertise.

Executed seamless migrations of legacy applications onto the Microsoft Azure Cloud Platform as a Service (PaaS), establishing robust Continuous Integration/Continuous Deployment (CI/CD) pipelines on Azure for streamlined deployment and maintenance.

Developed efficient Spark core and Spark SQL scripts using Scala for accelerated data processing, ensuring high-performance analytics, and reporting capabilities at an expert level.

Applied Kimball Dimensional Data Modeling methodologies to design data warehouses tailored to business objectives, facilitating

reporting and analytics.

Utilized Hive on Spark and Spark-SQL proficiently to execute Hive scripts and efficiently process data, optimizing data processing workflows and bolstering data analysis capabilities with extensive expertise.

Implemented ETL transformations and validations leveraging Spark-SQL/Spark Data Frames in Azure Databricks and Azure Data Factory, meticulously optimizing data processing, and ensuring impeccable data quality with advanced proficiency.

Leveraged Azure Data Lake Storage (ADLS) Gen2 as a scalable or efficient data movement and processing within ETL pipelines, enabling seamless integration and data management across diverse Azure services.

Leveraged the advanced features of Azure Data Lake Storage (ADLS) Gen2, including its hierarchical namespace and fine-grained access control, to enforce robust data governance policies and ensure regulatory compliance within the ETL project.

Employed Azure Blob Storage for optimized storage and retrieval of data files, implementing techniques such as compression, encryption, and other security measures to enhance both security and efficiency.

Implemented Azure Active Directory or ENTRA ID authentication and authorization mechanisms to ensure secure access control and identity management within Azure data engineering solutions, enhancing data security and compliance measures.

Designed and managed MongoDB replica sets, ensuring high availability and implementing backup strategies for robust disaster recovery.

Integrated Python seamlessly with Event Hubs, Stream Analytics, and Time Series Insights for real-time processing, while leveraging pandas, NumPy, and SciPy for ad-hoc analysis, data cleaning, and pre-processing tasks with adept proficiency.

Integrated on-premises and cloud data sources (MYSQL, Cassandra, Blob storage, Azure SQL DB) using Azure Data Factory, applying transformations, and loading data into Snowflake.

Enhanced performance of Databricks pipelines and Spark jobs through techniques like configuration tuning, caching, and partitioning, while proficiently developing Azure Functions to manage data pre-processing, enrichment, and validation within pipelines, thereby ensuring superior processing efficiency, enhanced data quality, and increased reliability.

Implemented Azure Data Factory pipelines to efficiently process various file formats such as text files, Parquet, CSV, JSON, Avro, and ORC, leveraging Maven for dependency management and seamless integration within the Azure data engineering project.

Utilized data warehousing techniques for Snowflake modeling, including data cleansing, Slowly Changing Dimension (SCD) management, surrogate key assignment, and change data capture.

Orchestrated end-to-end Dynamics 365 implementation projects, overseeing every phase to guarantee punctual delivery and client contentment.

Spearheaded the transition from Oracle Warehouse Builder (OWB) ETL processes to Oracle Data Integrator (ODI) for enhanced

efficiency.

Applied an analytical problem-solving approach, leveraging Azure Data Factory, Data Lake, and Azure Synapse to tackle business challenges effectively, driving data-driven decision-making processes with a high level of proficiency.

Developed ELT/ETL pipelines using Python and Snowflake Snow SQL to enable seamless data movement to and from Snowflake data store, ensuring data consistency and accuracy with deep expertise.

Utilized Informatica and SSIS for data extraction, and employed Erwin Kimball methodologies to develop schemas within data marts, adhering to both star and snowflake schemas within the data warehouse architecture.

Led seamless deployment and management of critical labor department applications on Azure Kubernetes Service (AKS), ensuring consistent performance.

Enhanced performance of Databricks pipelines and Spark jobs by implementing advanced techniques such as configuration tuning, caching, and partitioning, resulting in superior processing efficiency and scalability.

Designed and developed SSIS (ETL) packages to validate, extract, transform and load data from OLTP system to the Data warehouse.

PolyBase is a data virtualization technology in Microsoft SQL Server that enables users to query and analyze data across relational and non-relational data sources seamlessly, providing a unified view of data stored in different formats without the need for complex ETL processes.

Enhanced operational efficiency by optimizing processing time for streaming data, resulting in a 25% reduction in costs through runtime optimization of clusters. Concurrently, conducted ongoing monitoring, automation, and refinement of data engineering solutions, which led to a 30% increase in productivity.

Successfully implemented dynamic scaling solutions using Kubernetes, drastically optimizing resource utilization and exemplifying a strategic commitment to unparalleled efficiency and cost-effectiveness in labor department operations.

Proficiently developed complex SQL views and stored procedures in Azure SQL Data Warehouse and Hyperscale, contributing to a 20% improvement in query performance.

Utilized JIRA for project reporting, creating sub-tasks for development, quality assurance, and partner validation.

Applied Agile methodologies extensively, participating in daily stand-ups and internationally coordinated PI Planning to ensure successful project delivery.

Environment: Azure Databricks, Apache Spark, Azure Data Factory, Cosmos DB, Azure Key Vault, Azure SQL DB, PostgreSQL, MySQL, Cassandra, Azure SQL Data Warehouse, Azure Purview, Collabra, Azure Event Hubs, Azure Functions, Azure Logic Apps, Azure Kubernetes, Azure Data Lake Storage (ADLS) Gen2, Snowflake, Python, Scala, Hive, Kafka, Spark Streaming, Spark SQL, Azure Synapse Analytics, Snowflake Snow SQL, SSIS, Git, GitHub, Bitbucket, JIRA.

Client: T-Mobile, Frisco, TX. June 2018 to Jan 2023

Role: Azure data engineer

Responsibilities:

Hands-on experience in designing and developing ETL pipelines to facilitate seamless data movement between various data sources and data warehouses.

Expert in managing database intricacies and connections across a diverse range of platforms, including MS SQL Server, MySQL, PostgreSQL, Oracle PL/SQL, and Teradata, ensuring efficient data processing and integration.

Supported the ML lifecycle by building high-throughput ETL pipelines in Azure Data Factory and Databricks, feeding curated datasets to data science teams.

Designed, deployed, and maintained Azure Functions (C#/JavaScript) in production environments to support serverless data transformation and event-driven architectures.

Integrated Microsoft Entra ID (Azure AD), B2B, and B2C identity federation into application pipelines, enabling secure external and internal collaboration.

Utilized Microsoft Graph API and RESTful services for secure directory access, user identity lifecycle management, and system integration.

Implemented claims-based authentication and built custom claims providers for fine-grained authorization and identity-based logic.

Applied identity governance strategies such as entitlement management, access reviews, and sponsor-based provisioning models to enforce compliance.

Configured OAuth 2.0 and OpenID Connect protocols in applications for secure token-based access and API security enforcement.

Practiced secure development techniques aligned with CJIS and NIST guidelines, ensuring code, data, and access controls met regulatory compliance standards.

Integrated ML scoring into batch pipelines using Databricks notebooks, supporting predictive analytics for customer churn and network optimization.

Built automated data validation layers to ensure model training data quality using custom Python scripts in ADF pipelines.

Partnered with MLOps teams to set up CI/CD for ML pipelines, enabling smooth deployment of models into production using Azure DevOps and GitHub Actions.

Implemented monitoring for batch-scored ML outputs and generated Power BI dashboards to track prediction accuracy over time.

Executed ETL tasks utilizing Azure Databricks and facilitated the successful migration of on-premise Oracle ETL processes to Azure Synapse Analytics, ensuring smooth data movement and processing across the Azure ecosystem.

Designed and developed SSIS (ETL) packages to validate, extract, transform and load data from OLTP system to the Data warehouse.

Integrated MongoDB databases seamlessly with applications, collaborating with development teams to optimize data access patterns and enhance overall application performance.

Controlling and granting database access and Migrating on Premise databases to azure data lake store using Azure Data Factory using azure synapse and Polybase.

Worked on scheduling all jobs using Airflow scripts using python adding different tasks to DAG's and dependencies between the tasks.

Proficiently implemented continuous integration and continuous deployment (CI/CD) practices for efficient productionization of data processing workloads and machine learning (ML) models using Azure DevOps, Jenkins, and GitHub Actions.

Developed enterprise level solution using batch processing and streaming framework (using Spark Streaming, Apache Kafka.

Designed and deployed Snowflake stages to seamlessly import data from diverse sources, proficiently managing transient, temporary, and persistent Snowflake tables to support efficient data processing workflows as an Azure Data Engineer.

Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each.

Implemented data processing workflows in Azure Data Factory pipelines, leveraging activities such as Copy to seamlessly move data between diverse sources and destinations, Filter to extract specific subsets of data based on defined conditions, and For Each to automate iterative tasks, optimizing data processing efficiency and enhancing workflow automation.

Developed Python scripts within Azure Databricks for file validations and automated the process using Azure Data Factory (ADF), streamlining data validation workflows and enhancing data quality assurance measures.

Designed and implemented data loading, aggregation, and exporting frameworks in Spark and Snowflake within Azure Databricks notebooks, facilitating seamless import/export of data and efficient handling of large JSON files.

Setting up Azure infrastructure like storage accounts, integration runtime, service principal id, app registrations to enable scalable and optimized utilization of business user analytical requirements in Azure.

Created DataStage jobs using different stages like Transformer, Aggregator, Sort, Join, Merge, Lookup, Data Set, Funnel, Remove Duplicates, Copy, Modify, Filter, Change Data Capture, Change Apply, Sample, Surrogate Key, Column Generator, Row Generator, Etc.

Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks.

Enhanced existing Python modules, developed APIs to load processed data to HBase tables, and implemented continuous integration using Jenkins and GIT for streamlined building and testing processes.

Worked on creating few Power BI dashboard reports, Heat map charts and supported numerous dashboards, pie charts and heat map charts that were built on Teradata database.

Worked with SDLC and development using Agile methodology, participated in daily scrum meetings, and spring planning

Assisted team members in resolving technical issues, troubleshooting, identifying project risks and issues, managing resources, conducting monthly one-on-one sessions, and weekly meetings.

Environment: ETL pipelines, Teradata, Teradata, T-SQL, U-SQL, Azure Data Lake Analytics, Azure Data Factory, Logic Apps, Azure Synapse Analytics, SSIS, PySpark, Azure Databricks, Oracle, Azure Data Lake Store, Collabra, Azure Synapse, Azure DevOps, Snowflake, Jenkins, GitHub, Spark Streaming, Apache Kafka, Scala, Spark, Airflow, Surrogate Key, RDD’s, HBase tables, Agile methodology, Power BI, Teradata database, GIT.

Client: State of New Jersey, Trenton, NJ. Feb 2017 to May 2018

Role: Big Data & Hadoop Developer

Responsibilities:

Developed Spark Applications by using Scala and Python and Implemented Apache Spark data processing project to handle data from various RDBMS and streaming sources.

Developed Spark streaming application to read raw packet data from Kafka topics, format it to JSON, and push back to Kafka for future use case’s purpose.

Used Spark Streaming to receive real-time data from Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra.

Implemented partitioning and bucketing strategies based on state to optimize data processing, leveraging bucket-based Hive joins for enhanced performance.

Spearheaded the development of dimensional models and schemas using Erwin, structuring, and organizing claims data for

analytics in the Hadoop environment.

Developed PySpark Data Ingestion framework to ingest source claims data into HIVE tables by performing Data cleansing, Aggregations, and applying De-dup logic to identify updated and latest records.

Explored with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.

Utilized Zookeeper for managing configuration information, facilitating distributed synchronization, and overseeing metadata management.

Leveraged Spark features such as In-Memory processing, Distributed Cache, Broadcast, Accumulators, Map side Joins to implement data preprocessing pipelines with minimal latency.

Enhanced team productivity by introducing and implementing agile methodologies, leading to a 25% acceleration in project delivery timelines.

Experience in data profiling, data mapping, data cleaning, data integration, metadata management, and MDM (Master Data Management).

Responsible for managing data from various sources and involved in HDFS maintenance and loading of structured and unstructured data.

Developed a comprehensive data pipeline leveraging technologies such as Flume, Sqoop, Pig, Kafka, Oozie, and MapReduce to ingest behavioral data into HDFS for analysis.

Created DataStage jobs utilizing various stages including Transformer, Aggregator, Sort, Join, Merge, Lookup, Data Set, Funnel, Remove Duplicates, Copy, Modify, Filter, Change Data Capture, Change Apply, Sample, Surrogate Key, Column Generator, and Row Generator for efficient data processing.

Processed HDFS data and created external tables using Hive, developing reusable scripts for table ingestion and repair across projects.

Worked on loading data from UNIX file system to HDFS and analyzed large amounts of data sets to determine optimal way to aggregate and report on it.

Implemented automation for deployments using YAML scripts, streamlining the process for massive builds and releases.

Utilized SSIS to construct automated multi-dimensional cubes, facilitating advanced data analysis and visualization.

Leveraged Sqoop to channel data from diverse sources into HDFS and RDBMS for seamless integration and processing.

Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System (HDFS).

Integrated Big Data Spark jobs creating ETL jobs capable of handling approximately 450 GB of data daily.

Responsible for importing data from MongoDB using Sqoop, customizing BI tools for query analytics using Hive QL, estimating hardware requirements for Name Node and Data Nodes, and implementing open-source web scraping frameworks for data extraction.

Worked on developing ETL processes to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.

Built and implemented automated procedures to split large files into smaller batches of data to facilitate FTP transfer which reduced 60% of execution time.

Employed advanced techniques including combiners, partitioning, and distributed cache to optimize the performance of MapReduce jobs.

Worked on GIT to maintain source code in Git and GitHub repositories.

Environment: Sqoop, MYSQL, HDFS, Apache Spark Scala, Hive, Hadoop, HBASE, Collabra, Kafka, MapReduce, Zookeeper, Oozie, Data Pipelines, RDBMS, Python, PySpark, shall script, JIRA, Hadoop, Hive, spark, PySpark, Sqoop, Spark SQL, Shall

Contact this candidate