Professional Summary:
I have ** years of experience in designing, developing, and managing large-scale data engineering and analytics projects in various domains.
Expertise in Big Data/Hadoop technologies, including Spark, PySpark, Hive, Pig, HBase, Flume, and Kafka, to handle large-scale distributed data processing.
Proficient in building and managing real-time data pipelines using Spark Streaming, Kafka, and Zookeeper for streaming analytics.
Strong hands-on experience with Azure Databricks, Azure Synapse, and Azure-based cloud platforms for data integration and advanced analytics.
Skilled in working with NoSQL databases such as Cassandra, MongoDB, and MariaDB, ensuring high availability and scalability.
Extensive knowledge of AWS cloud services, including EC2, S3, EMR, RedShift, and IAM, to build scalable and secure data solutions.
Deep understanding of development methodologies like Agile/Scrum and Waterfall, and experience in creating UML diagrams and applying design patterns.
Proficient in integrating build tools like Maven, ANT, and Jenkins for automated deployment and continuous integration.
Expertise in data extraction, transformation, and loading (ETL) using tools like Talend, SQL Loader, and Toad for seamless data workflows.
Strong reporting and data visualization skills using Crystal Reports XI, SSRS, and MS Office Suite, including Excel and Visio.
Experienced in handling relational databases like SQL Server, Oracle (11g/12c), DB2, Teradata, Netezza, and MySQL for structured data storage and querying.
Advanced knowledge of operating systems such as Windows, UNIX, and Linux, ensuring efficient system-level operations and configurations.
Proficient in designing and implementing scalable solutions using Spark SQL and Hadoop YARN for distributed data processing and resource management.
Skilled in working with Talend and PostgreSQL to design high-performance data pipelines for complex data integration projects.
Expertise in creating workflows with Oozie for scheduling and automating Hadoop-based tasks.
Demonstrated ability to manage cloud resources with AWS Autoscaling, CloudWatch, and Route53, ensuring high availability and reliability.
Proficient in data modeling, performance tuning, and query optimization for large-scale databases using tools like Microsoft SQL Studio and IntelliJ.
Strong interpersonal and leadership skills with a proven track record of delivering quality results in fast-paced, cross-functional team environments.
Technical Skills:
BigData/Hadoop Technologies
Spark, SparkSQL, Azure, Spark Streaming, Kafka, PySpark, Pig, Hive, HBase, Flume, Apache, Yarn, Oozie, Zookeeper.
Languages
HTML5, CSS3,C, C++, XML, SAS, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script
NOSQL Databases
Cassandra, HBase, MongoDB, MariaDB
Web Design Tools
HTML, CSS, JavaScript, JSP, jQuery, XML
Public Cloud
EC2, IAM, S3, Autoscaling, CloudWatch, Route53, EMR, RedShift, Databricks, Snaplogic, Snowflake
Development Methodologies
Agile/Scrum, UML, Design Patterns, Waterfall
Build Tools
Jenkins, Toad, SQL Loader, PostgreSql, Talend, Maven, ANT, RTC, SOAP UI.
Reporting Tools
MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS.
Databases
Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata, Netezza
Operating Systems
All versions of Windows, UNIX, LINUX.
Development Tools
Microsoft SQL Studio, IntelliJ, Azure Databricks, Eclipse, NetBeans.
Professional Experience:
CFNA, OH Aug 2023 - Present
Sr Data Engineer
Responsibilities:
Developed and optimized SQL queries for efficient data retrieval and transformation across various projects.
Built scalable data pipelines using PySpark to process and analyze large datasets stored in HDFS.
Designed and implemented workflows using Apache NiFi to automate data ingestion and transformation processes.
Deployed and managed cloud-based data solutions on AWS, leveraging services like S3 for data storage and retrieval.
Utilized Apache Kafka to create real-time data streaming solutions for high-volume data processing.
Orchestrated ETL processes using SSIS and Informatica, ensuring accurate data flow between systems.
Migrated and optimized data warehouses using Snowflake, improving performance and reducing costs.
Automated data engineering tasks and created reusable scripts using Python and IDEs like PyCharm.
Implemented batch processing workflows using Apache Pig and Hive to aggregate and summarize data.
Used Git for version control and collaboration on data engineering scripts and workflows.
Integrated Sqoop to transfer data between HDFS and relational databases.
Created and managed HBase tables for NoSQL data storage and retrieval in real-time applications.
Designed XML-based data exchange formats to enable seamless communication between systems.
Scheduled and monitored Unix-based cron jobs for regular ETL processes and data updates.
Collaborated in an Agile Scrum environment, participating in sprint planning, reviews, and retrospectives.
Designed and maintained high-performance SQL databases to support analytics and reporting.
Built and managed distributed data applications using HiveQL for querying large datasets.
Created real-time dashboards and insights by integrating data pipelines with S3 and Snowflake.
Led troubleshooting and debugging efforts for complex data engineering workflows on Unix/Linux systems.
Mentored junior team members on best practices in Python, SQL, NiFi, and distributed data processing.
Environment: SQL, Python, Pyspark, HDFS, NiFi, AWS, Pig, Hive, S3, Kafka, SSIS Snowflake, PyCharm, Scrum, Git, Sqoop, HBase, Informatica, SQL, Python, XML, Unix.
Texas Health Resources, TX Feb 2022 – Jul 2023
Data Engineer
Responsibilities:
Designed and deployed IBM WebSphere applications to support secure and scalable enterprise data solutions.
Built high-performance data pipelines using Apache Hive for querying and analyzing large-scale datasets.
Developed automation scripts in Python and Shell Scripting to streamline ETL processes.
Implemented HBase for NoSQL storage solutions, optimizing for low-latency and high-throughput requirements.
Leveraged Apache Spark with Scala to process and analyze massive datasets in distributed environments.
Optimized data processing workflows with MapReduce for efficient batch processing on HDFS.
Created and maintained data migration workflows using Sqoop to transfer data between HDFS and relational databases.
Built and deployed scalable data architectures on AWS and Azure to support enterprise-level applications.
Utilized Apache Flume for real-time log aggregation and data ingestion into distributed storage systems.
Automated recurring tasks and monitoring scripts on Linux and UNIX platforms to enhance system reliability.
Designed real-time data streaming solutions using Apache Kafka for high-velocity data ingestion and processing.
Engineered efficient SQL-based data models for complex querying and reporting in relational databases.
Implemented NoSQL databases like MongoDB and HBase for handling semi-structured and unstructured data.
Integrated data visualization solutions with Tableau to create interactive dashboards and analytics reports.
Ensured data pipeline performance and stability through robust monitoring and logging practices on Linux/UNIX environments.
Conducted root cause analysis and optimized data flows for Spark and HDFS-based ecosystems.
Designed and maintained event-driven architectures using Kafka for real-time data synchronization.
Orchestrated data workflows on Azure Data Factory to integrate on-premises and cloud-based data systems.
Built scalable and fault-tolerant solutions leveraging AWS S3, EMR, and Redshift.
Applied SQL and NoSQL expertise to architect databases for structured, semi-structured, and unstructured data requirements.
Environment: IBM Web Sphere, Hive, Python, HBase, Spark, Scala, Map Reduce, HDFS, Sqoop, AWS, Azure, Flume, Linux, Shell Scripting, Tableau, UNIX, Kafka, SQL, No-SQL.
Windstream, Twinsburg, OH Feb 2020 – Jan 2022
Data Engineer
Responsibilities:
Designed and maintained distributed data processing frameworks using Hadoop to manage and analyze large datasets.
Created data transformation scripts using Pig for aggregating and processing raw data.
Optimized queries and data storage structures using Hive for faster data retrieval in a data warehouse environment.
Implemented ETL workflows using Informatica to integrate data from multiple sources into a unified repository.
Leveraged HBase to manage and store NoSQL data for real-time analytics applications.
Developed and fine-tuned MapReduce jobs for batch processing and analysis of large-scale data.
Managed and monitored distributed storage using HDFS, ensuring high availability and reliability.
Automated data ingestion processes using Sqoop to transfer data between relational databases and HDFS.
Utilized Impala for real-time, interactive SQL querying on large datasets.
Designed and maintained relational databases using SQL, ensuring optimal schema design and indexing.
Built interactive data visualizations and dashboards using Tableau to deliver actionable business insights.
Created and deployed Python scripts for data cleansing, preprocessing, and statistical analysis.
Analyzed and modeled data trends using SAS for predictive analytics and decision-making support.
Configured and managed Apache Flume to collect and aggregate real-time log data for processing.
Scheduled and managed workflows with Oozie on Linux, enabling seamless execution of data pipelines.
Environment: Hadoop, Pig, Hive, Informatica, HBase, MapReduce, HDFS, Sqoop, Impala, SQL, Tableau, Python, SAS, Flume, Oozie, Linux.
Sonata Software, Hyderabad, India Apr 2017 – Sep 2019
Data Analyst
Responsibilities:
Gathered all the Sales Analysis report prototypes from the business analysts belonging to different Business units
Worked with Master SSIS packages to execute a set of packages that load data from various sources onto the Data Warehouse on a timely basis.
Involved in Data Extraction, Transformation and Loading (ETL) from source systems.
Responsible with ETL design (identifying the source systems, designing source to target relationships, data cleansing, data quality, creating source specifications, ETL design documents,
The data received from Legacy Systems of customer information were cleansed and then Transformed into staging tables and target tables in DB2.
Used External Tables to Transform and load data from Legacy systems into Target tables.
Use of data transformation tools such as DTS, SSIS, Informatica or Data Stage.
Conducted Design reviews with the business analysts, content developers and DBAs.
Designed, developed, and maintained Enterprise Data Architecture for enterprise data management including business intelligence systems, data governance, data quality, enterprise metadata tools, data modeling, data integration, operational data stores, data marts, data warehouses, and data standards.
Incremental loading of Fact table from the source system to Staging Table done on daily basis.
Coding SQL stored procedures and triggers.
Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.
Built pipelines to move hashed and un-hashed data from XML files to Data lake.
Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
Extensively worked with Spark-SQL context to create data frames and datasets to preprocess the model data.
Experience with Cloud Service Providers such as Amazon AWS.
Data Analysis: Expertise in analyzing data using Pig scripting, Hive Queries, Sparks (python) and Impala.
Environment: Linux, Erwin, SQLServer, Crystal Reports9.0, HTML, DTS, SSIS, Informatica, Data Stage Version 7.0, Oracle, Toad, MS Excel, Pow.
Met Life, Hyderabad, India Aug 2014 – Mar 2017
Data Analyst
Responsibilities:
Developed data-driven web applications using Python 2.7 and Django, ensuring seamless data integration and reporting.
Created and analyzed interactive dashboards in Tableau to visualize trends and insights for business stakeholders.
Tested and validated RESTful web services using Postman tool, ensuring accurate data retrieval and integration.
Deployed and managed scalable data solutions on AWS EC2 and stored critical datasets in S3 buckets.
Designed and consumed REST APIs for seamless interaction with external data sources using JSON.
Built and maintained dynamic web-based reports using HTML, CSS, and JavaScript, improving data accessibility.
Automated deployment pipelines with Jenkins to streamline data processing workflows and web service integrations.
Extracted and transformed data from Salesforce DB and MySQL, enabling advanced analytics and reporting.
Created and scheduled scripts for data preprocessing and transformation to prepare datasets for visualization in Tableau.
Collaborated with cross-functional teams to ensure accurate integration of database systems with RESTful services.
Environment Python 2.7, Django, Tableau, Postman tool, AWS Ec2 and S3, RESTful web services, HTML, CSS, JSON, JavaScript, Jenkins, Sales force db, MySQL.