Big Data Information Technology

Location:

United States

Posted:

June 26, 2025

Contact this candidate

Resume:

Keerthi B

+1-469-***-****

PROFESSIONAL SUMMARY

·Having 10 years of experience in Information Technology which includes Hands on experience in Hadoop ecosystem including Spark, Kafka, HBase, Map Reduce, Python, Scala, Pig, Impala, Sqoop, Oozie, Flume, Storm, big data technologies and worked on Spark SQL, Spark Streaming and using Core Spark API to explore Spark features to build data pipelines.

·Experience in building big data solutions using Lambda Architecture using Cloudera distribution of Hadoop, Twitter Storm, Trident, Map Reduce, Cascading, HIVE, PIG and Sqoop.

·Strong Knowledge and experience on implementing Big Data in Amazon Elastic Map Reduce (Amazon EMR) for processing, managing Hadoop framework dynamically scalable Amazon EC2 instances, Lambda, SNS, SQS, AWS Glue, S3, RDS and Redshift.

·Significant Expertise in integrating data using AWS Cloud, employing various platforms such as Data bricks, Apache Spark, Airflow, EMR, Glue, Kafka, Kinesis, and Lambda within ecosystems such as S3, Redshift, Snowflake, RDS, and MongoDB/Dynamo DB.

·Developed Real-time Data Pipelines with Kafka Connect and Spark Streaming.

·Experienced in development methodologies like Agile/Scrum.

·Extensive proficiency in designing, documenting, and testing ETL jobs and mappings in both Server and Parallel jobs using Data Stage to fill data warehouses and data marts with tables.

·Expertise in transforming business requirements into analytical models, designing algorithms, developing Data Mining, Data Acquisition, Data Preparation, Data Manipulation, Feature Engineering SIOP, Validation and Visualization and reporting solutions that scales across massive volume of Structured and Unstructured Data.

·Experienced working with various ETL tool environments such as SSIS and Informatica, along with reporting tool environments like SQL Server Reporting Services and Business Objects.

·Having good experience on all flavors of Hadoop (Cloudera, Hortonworks, and Mar.) and hands on experience in AVRO and Parquet file format, Dynamic Partitions, Bucketing for best Practice and Performance improvement.

·Proficient in building Server tasks utilizing several stages such as Sequential file, ODBC, Hashed file, Aggregator, Transformer, Sort, Link Partitioner.

·Experienced with source control systems such as Git, Bit bucket, and Jenkins and in CI/CD deployments.

·Good knowledge in RDBMS concepts (Oracle 112c/1g, MS SQL Server 2012) and strong SQL, Stored Procedures and Triggers.

·Excellent knowledge with Unit Testing, Regression Testing, Integration Testing, User Acceptance Testing, Production implementation and Maintenance.

·Extensive experience developing, publishing, and maintaining customized interactive reports and dashboards with customizable parameters, including the creation of tables, graphs, and listings using different methods and tools such as Tableau, Power BI and Excel.

·Practical expertise with Azure cloud components (SQL DB, SQL DWH, Cosmos DB, HDInsight, Data bricks, Data Lake, Blob Storage, Data Factory, and Storage Explorer).

·Extensive experience with Azure Data Factory, Azure Data bricks, and importing data to Azure Data Lake, Azure SQL Database, and Azure SQL Data Warehouse.

·Practical knowledge of data modeling (dimensional and relational) principles such as Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.

·Ability to review technical deliverables, mentor and drive technical teams to deliver quality products.

·Demonstrated ability to communicate and gather requirements, partner with Enterprise Architects, Business Users, Analysts, and development teams to deliver rapid iterations of complex solutions.

TECHNICAL SKILLS:

Big Data Technologies

Hadoop, Map Reduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Zookeeper, Apache Flume, Apache Airflow, Cloudera, HBase.

Programming Languages

Python, PL/SQL, SQL, Scala, C++, T-SQL, Power Shell Scripting, JavaScript.

Cloud Services

Azure Data Lake Storage Gen 2, Azure Data Factory, Blob storage, Azure SQL DB, Data bricks, Azure Event Hubs, AWS RDS, Amazon SQS, Amazon S3, AWS EMR, Lambda, AWS SNS.

Databases

MySQL, SQL Server, Oracle, MS Access, Teradata, and Snowflake.

NoSQL Data Bases

MongoDB, Cassandra DB, HBase.

Development Strategies

Agile, Lean Agile, Pair Programming, Waterfall and Test Driven Development.

Visualization & ETL tools

Tableau, Informatica, Talend, SSIS, and SSRS

Version Control & Containerization tools

Jenkins, Git, and SVN

Operating Systems

Unix, Linux, Windows, Mac OS

Monitoring tool

Apache Airflow, Jenkins

PROFESSIONAL EXPERIENCE:

Client: Kroger, Ohio (Remote) Dec 2023 to Present

Role: Senior Azure Data Engineer

Responsibilities:

·Extensively used Azure Data Factory to seamlessly integrate data from various source systems, allowing for a unified and centralized data processing pipeline.

·Employed ADF triggers, including Event, Scheduled, and Tumbling triggers, offering flexibility in automating data ingestion tasks based on different events and schedules.

·Utilized ADF in order processing pipelines, where Cosmos DB was employed to both source events and store catalog data, ensuring efficient and scalable order handling.

·ADF served as a comprehensive orchestration tool, seamlessly integrating data from upstream to downstream systems, ensuring a smooth flow of information throughout the data processing pipeline.

·Designed and implemented custom user-defined functions, stored procedures, and triggers specifically for Cosmos DB, enhancing the orchestration capabilities for this NoSQL database.

·Implemented Azure Data Factory triggers and custom Cosmos DB functions for seamless, automated data integration, optimizing ETL processes effectively.

·Used PySpark with Data bricks for complex data transformations, providing a powerful environment for manipulating and processing data with the added benefits of scalability and parallel processing.

·Designed SSIS packages to validate, extract, transform, and load data from OLTP systems to the Data Warehouse, ensuring data quality and consistency throughout the transformation process.

·Implemented incremental loading from Azure SQL DB to Azure Synapse, optimizing the data loading process and reducing unnecessary data transfers.

·Azure Data Lake was utilized as a storage solution for landing data processed by Azure Data bricks, offering a scalable and secure environment for storing large volumes of processed data.

·Played a pivotal role in using Cosmos DB to store events and catalog data for order processing pipelines, providing a NoSQL database solution with high availability and global distribution capabilities.

·The daily production server checklist included a thorough examination of SQL Backups, Disk Space, Job Failures, System Checks, and performance statistics using Azure Monitoring tools.

·Developed Build and Release definitions for Continuous Integration (CI) and Continuous Deployment (CD), streamlining the development and deployment processes.

·CI/CD automation ensured that code changes were tested, validated, and deployed efficiently, reducing manual errors and accelerating the delivery of updates.

·Leveraged Azure Data Share to facilitate seamless file transfer through interfaces, ensuring data consistency and integrity across different components of the system.

·Incorporated Power BI seamlessly into the Azure Data Factory-driven data processing pipeline, providing stakeholders with powerful visual insights and user-friendly reports, ultimately enhancing data-driven decision-making capabilities.

Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure Data Lake, BLOB Storage, Visual Studio 2012/2016, Microsoft SQL Server 2012/2016, SSIS 2012/2016, Teradata Utilities, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Python, Erwin Data Modelling Tool, Azure Cosmos DB, Azure Stream Analytics, Azure Event Hub, Power BI tool

Client:Gebbs Health Care Solutions,India Oct 2019 to July 2022

Role: AWS Data Engineer

Responsibilities:

·Developed Spark applications using PySpark and Spark-SQL to efficiently extract, transform, and aggregate data from various file formats.

·Leveraging Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL, and Spark Streaming, I ensured a comprehensive approach to data processing.

·Extensively worked with AWS cloud services such as EC2, S3, EMR, Redshift, Lambda, and Glue. AWS Glue and Lambda were employed for building data pipelines (ELT/ETL Scripts), extracting data from diverse sources (MySQL, AWS S3 files), and loading it into the Data Warehouse (AWS Redshift).

·Wrote ETL/ELT scripts to efficiently extract data from different sources, transform it, and load it into AWS Redshift. Spark SQL, PySpark, AWS Athena, and AWS Glue were combined to create robust ETL processes for seamless data movement.

·Contributed to the development of a server less querying environment by writing to the Glue metadata catalog, enabling refined data querying from AWS Athena.

·Employed Spark Streaming APIs for on-the-fly transformations and actions to build a common learner data model.

·Utilized Spark Streaming to consume XML messages from Kafka, processing UI updates in real-time.

·Implemented Terraform for automated provisioning and management of AWS infrastructure, optimizing the deployment process for Spark applications and related resources.

·Integrated AWS Lambda functions seamlessly into data processing workflows, responding to events, processing data, and enhancing the overall efficiency of the ETL/ELT pipeline.

·Raw data was ingested into AWS S3 from Kinesis Firehose for initial processing.

·AWS Lambda functions were triggered upon raw data ingestion, processing and loading refined data into another S3 bucket, and writing to SQS queue as Aurora topics.

·Applied Spark Data Frames for preprocessing jobs, flattening JSON documents into flat files.

·Heavily worked with AWS databases, including RDS (Aurora), Redshift, Dynamo DB, and Elastic Cache (Memcached & Redis).

·Applied Spark Data Frames for preprocessing jobs, skillfully flattening JSON documents into easily manageable flat files.

·Enhanced data accessibility and readability, paving the way for subsequent analysis and transformations.

·Enhanced the efficiency of our data processing workflows, ensuring they aligned precisely with project requirements.

·Successfully orchestrated the integration of various components, ensuring a seamless dataflow from extraction to transformation, storage, and real-time processing.

·Leveraging Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD, and Spark YARN, actively contributed to the continuous improvement of our data processing algorithms and workflows.

·Incorporated Power BI seamlessly into the data processing pipeline, providing stakeholders with dynamic dashboards and real-time analytics, ultimately enhancing decision-making capabilities with visually impactful insights across diverse data processing stages.

Environment: Python, Flask, NumPy, Pandas, SQL, MySQL, Cassandra, API, AWS EMR, Spark, AWS Kinesis, AWS Redshift, AWS EC2, AWS S3, AWS Beanstalk, AWS Lambda, AWS data pipeline, AWS cloud-watch, Docker, Shell scripts, Agile Methodologies, Power BI.

Client: Apps Tech solutions India Dec 2016 to Sep 2019

Role: Big Data Engineer

Responsibilities:

·Good Sqoop and Flume for efficient data ingestion into the Hadoop environment. Applied Sqoop for structured data transfers, ensuring compatibility with the Big Data environment.

·Employed Flume to capture and transport clickstream data from front-facing application logs.

·Implemented effective error handling mechanisms to maintain data integrity. Monitored and optimized Flume configurations to enhance data transfer efficiency.

·Implemented Kafka functionalities for distributed messaging within the architecture. Leveraged Kafka's distribution and partitioning features for efficient data processing.

·Established a replicated commit log service in Kafka to maintain consistent data feeds.

·Used Kafka as a messaging system to implement real-time streaming solutions.

·Integrated Spark Streaming with Kafka for seamless processing of real-time data. Ensured low-latency data processing, making it suitable for time-sensitive applications. Real-time streaming solutions contributed to immediate insights and actionable intelligence.

·Developed Sqoop scripts to facilitate the migration of data from Oracle to the Big Data environment. Ensured smooth and efficient transfer of data, considering data volume and complexity. Handled incremental loading of customer and transaction data based on date.

·Automated the data migration process, reducing manual intervention and enhancing reliability. Utilized Python scripting, PySpark, and Spark SQL for in-depth analysis of large datasets.

·Applied complex SQL queries on various source systems, including Oracle and SQL Server. Identified inconsistencies in data collected from diverse sources for data quality improvement. The analysis provided valuable insights for informed decision-making and enhanced data quality.

·Actively participated in designing and developing data ingestion processes in the Hadoop environment. Ensured seamless data flow between various stages of the Hadoop processing pipeline.

·Led the design of object models, data models, tables, and constraints for the Oracle Database.

·Collaborated with the project team to ensure database design alignment with project goals. Ensured the creation of efficient database structures supporting data processing requirements.

·Developed necessary stored procedures, functions, triggers, and packages for database functionality. Automated regular AWS tasks, including snapshot creation, using Python scripts.

·Ensured seamless integration between Python scripts, Airflow, and AWS services.

·Incorporated Tableau seamlessly into the Spark and AWS-driven data processing pipeline, providing stakeholders with advanced data visualization tools and interactive dashboards, ultimately enhancing the decision-making capabilities with visually impactful insights.

Environment: Hortonworks, Hadoop, Big Data, HDFS, MapReduce, Sqoop, AWS, Oozie, NiFi, Python, SQL server, Oracle, HBase, Hive, Impala, Pig, Sqoop, Tableau, NoSQL, Unix/Linux, Spark, PySpark, Notebooks, Tableau.

Client: Birlafinservice, India Jun 2014 to Nov 2016

Role: Data Analyst

Responsibilities:

·Worked with leadership teams to implement tracking and reporting of operations metrics across global programs.

·Worked with large data sets, automate data extraction, built monitoring/reporting dashboards and high-value, automated Business Intelligence solutions (data warehousing and visualization)

·Gathered Business Requirements, interacted with Users and SMEs to get a better understanding of the data and performed Data entry, data auditing, creating data reports & monitoring all data for accuracy.

·Performed data discovery and build a stream that automatically retrieves data from multitude of sources (SQL databases, external data such as social network data, user reviews) to generate KPI's using Tableau.

·Wrote ETL scripts in Python/SQL for extraction and validating the data.

·Create data models in Python to store data from various sources.

·Interpreting raw data using a variety of tools (Python, R, Excel Data Analysis Tool Pak), algorithms, and statistical/econometric models (including regression techniques, decision trees, etc.) to capture the bigger picture of the business.

·Translated requirement changes, analyzing, providing data driven insights into their impact on existing database structure as well as existing user data.

·Worked primarily on SQL Server, creating Store Procedures, Functions, Triggers, Indexes and Views using T-SQL.

·Worked on Amazon Web service (AWS) to integrate EMR, Spark 2 and S3 storage and Snowflake. Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2. Handled AWS Management Tools as Cloud watch and Cloud Trail.

·Stored the log files in AWS S3. Used versioning in S3 buckets where highly sensitive information is stored.

·Good Tableau to create interactive and visually appealing dashboards, providing a comprehensive view of key performance indicators (KPIs) and operational metrics across global programs.

·Used Excel's Data Analysis Tool Pak to perform in-depth analysis on large datasets, including statistical and econometric modeling.

·Well and modified reports, providing detailed insights into the gathered business requirements and supporting data-driven decision-making.

Environment: SQL Server, ETL, SSIS, SSRS, Tableau, Excel, R, AWS, Python, Django

Contact this candidate