Big Data Engineer

Location:

Tracy, CA

Posted:

September 04, 2024

Contact this candidate

Resume:

Siva Sankar Malapati

Mobile: +1-732-***-****

Mail id: ****************@*****.***

LinkedIn: www.linkedin.com/in/siva-sankar-reddy-m-ab0821104

Senior Big Data Engineer/Big Data Lead

Professional Summary:

Having 15+ years of diversified experience in Software Design & Development. Experience as Big Data Engineer solving business use cases for several clients. Experience in the field of software with expertise in backend applications.

Capable of handling multiple teams across locations and effective in cross-functional and global environments to manage multiple teams and assignments concurrently with effective communication and presentation skills.

Experience in Hadoop and Big Data related technologies. Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing big data technologies such as Hadoop, Map Reduces frameworks, HBase, Hive, Scala, Spark, Python and PySpark.

Technical skills:

Languages

Python, Scala, C++, Java, XML and HTML

Big Data and Analytics

Hadoop,HDFS, Map Reduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Apache Spark, Spark Streaming, HBase, Zookeeper,Python and Data bricks,Play,AKKA

Hadoop Distribution

Cloudera CDH, Horton Works HDP and Apache NiFi

Databases

Oracle 10g/11g/12c, SQL Server, MySQL, Teradata, PostgreSQL, MS Access, NoSQL Database (HBase, MongoDB), Snowflake.

Cloud Technologies

Amazon Web Services (AWS), S3, Glue, Redshift and EMR,Azure

Version Control

GIT, GIT HUB, Bitbucket,Subversion (SVN) and CVS

Automation Server

Jenkins

IDE & Tools, Design

Eclipse, Visual Studio, Net Beans, Junit, CI/CD, SQL Developer, MySQL, SQL Developer, PowerBI, Tableau, Talend Open Studio.

Operating Systems

Windows 98, 2000, XP, Windows 7,10,11, Mac OS, Unix, Linux

Container Platforms

ECS and Kubernates

AWS Services

EC2,AMI,S3,CloudFormation,CloudTrail,SNS,Glue and Cloud watch

Professional Summary:

Expertise in designing, development, requirements gathering, system analysis, team management, test & data management, client relationship and product delivery.

Excellent understanding of Hadoop architecture and underlying framework including storage management.

Solid experience in Big Data Analytics with hands-on using ecosystem components like Hadoop Map Reduce, HDFS, HBase, Zookeeper, Hive, Sqoop, Pig, Flume, Cassandra, Kafka, Oozie and Spark.

Strong experience in analysing large amounts of data sets writing PySpark scripts and Hive queries.

Running of Apache Hadoop and CDH.

Extensively Worked on Hadoop environment & good knowledge of technologies like Yarn, HDFS, Hive, Zookeeper, HBase/Phoenix etc.

Bulk loading from external stage (AWS S3), internal stage to snowflake cloud using COPY command.

Scheduled all Hadoop/hive/Sqoop/HBase jobs using Oozie

Worked with various file formats like sequence files, AVRO, JSON, PARQUET, RC, ORC, CSV and TXT

Experience in using to analyze data from multiple sources and creating reports with Interactive Dashboards using Power BI.

Using Spark RDDs and PostgreSQL queries were transformed into Spark transformations.

Extensively worked on AWS services like EC2, S3, EMR, Glue and Redshift.

Experienced in Managing Database, Azure Data Platform services (Azure Data Lake (ADLS), Data Factory (ADF), Data Lake Analytics, Stream Analytics, Databricks and NoSQL DB), SQL Server, Oracle, Data Warehouse, etc. Build multiple Data Lakes.

Extensively worked on infrastructure provisioning using Terraform.

Hands on experience with Airflow to automate tasks, improve workflow efficiency, or solve business challenges.

Monitoring, Logging, and Debugging. Set up monitoring and logging to troubleshoot a cluster, or debug a containerized application using Kubernetes.

Extraction of data from different data sources using different ETL tools (IBM data stage, Teradata) and FiveTran tool.

Involved in design of ETL jobs to apply transformation logic

Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift and SQL Server using Python.

Extensively Worked on Azure Blob Storage, Azure Data Factory, Data Lake, And Azure Data Bricks.

Design and develop data models and data warehouses in Snowflake.

Design and implement data pipelines using GCP services such as dataflow, Dataproc, and Pub/Sub.

Create and manage data storage solutions using GCP services such as Big Query, Cloud Storage and Cloud SQL.

Expertise in design and development of various web and enterprise applications using Type safe technologies like Scala, Akka and Play framework.

Experience on Azure Data Factory Pipelines and Extracting Data from different sources such as Manual files, Blob Storage, Azure Data Lake and synapse data warehouse.

Developed Restful API's which takes in an HTTP request and produces the HTTP response in JSON Format using Play.

Develop and maintain ETL processes to move data from source systems to Snowflake.

Proficient in working with Delta Tables and Delta File System.

Good experience in using Data Modelling techniques to find results based on SQL and PL/SQL queries.

Extensively used for Apache Airflow to schedule, monitor workflows and visualize data pipelines dependencies, progress, logs, code, trigger tasks, and success status.

Extensively used Talend open studio for Big data Integration and Data management solutions.

Developed Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats.

Having good experience on loading different file formats to hive using PySpark.

Experienced with code versioning and dependency management systems such as Git, Bit bucket, and Maven.

Education Summary:

Master Of Computer Applications (2005), G Pulla Reddy Engineering College, Kurnool, Sri KrishnaDevaraya University, Anantapur, Andhra Pradesh, India.

Bachelor of Computer Science (2001), Osmania College, Kurnool, Sri KrishnaDevaraya University, Anantapur, Andhra Pradesh, India.

Professional Experience:

Client: MARQETA, U.S.A Jan 2024 – Till Date

Role: Senior Big Data Engineer, California, U.S.A.

Created PySpark code that uses Spark SQL to generate data frames from the Avro formatted raw layer and writes them to data service layer internal tables in orc format.

Utilize advanced programming languages such as Python or Scala to build scalable and efficient data processing applications.

Design and optimize data storage solutions using distributed storage systems like Hadoop HDFS, Apache HBase, or Amazon S3.

Implement and optimize distributed computing frameworks such as Apache Spark, Impala or Hadoop Map Reduce for data processing and analysis.

Troubleshoot and resolve performance issues, bottlenecks, and data quality issues in big data systems.

Lead and manage the design, development, and implementation of large-scale big data solutions.

Architect data pipelines and ETL processes to collect, process, and store structured and unstructured data from various sources.

Participate in the evaluation and selection of tools and technologies to support the organization's big data infrastructure and initiatives.

Spearheaded the development of a comprehensive data platform from inception, actively participating in requirement gathering and analysis phases to document business requirements.

Redesigned cloud-based data warehouse to enhance security and improve performance.

Enhanced quality of data insights through implementation of automated data validation processes and improved access to data sources.

Leveraged python boto3 to configure multiple AWS services including Glue, Redshift, EC2, S3, etc.

Designed and implemented data pipelines in Data Bricks for ingesting, cleaning, and transforming data from various sources.

Collaborated with cross-functional teams to architect and optimize data workflows on Data bricks.

Reduced migration costs of large data sets across multiple cloud providers by 50%.

In charge of PySpark code, creating data frames from tables in the data service layer and writing them to a Hive data warehouse.

Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.

Writing the SQL queries to process the data using Spark SQL.

Created Scala code that uses Spark SQL to generate data frames from the csv formatted raw layer and writes them to data service layer internal tables in parquet format.

To meet specific business requirements wrote UDFs in Pyspark.

Built and maintained data pipelines using AWS Glue, transforming raw data into structured formats for further analysis.

Improved performance through tuning and optimization of PostgreSQL queries and ETL procedures.

Utilized AWS services (s3, Glue, Redshift) with a focus on big data analytics, enterprise data warehouse, and business intelligence solutions to ensure optimal architecture, scalability, and flexibility.

Technologies: Hadoop, HDFS, Spark, Scala, Spark-SQL, Python, Java, Oracle, PySpark,AWS S3,AWS GLUE, AWS EMR,AWS LAMBDA,AWS Data Bricks,SNS,SQS, Apache Airflow, Terraform and Git Hub.

Client: USAA Bank, U.S.A July 2023 – Dec 2023

Role: Senior Big Data Engineer/Big Data Lead, New Jersey, U.S.A.

Designing the business requirement collection approach based on the project scope and SDLC methodology.

Experience writing in house UNIX shell scripts for Hadoop & Big Data Development.

Experience in execution of Batch jobs through the data streams to SPARK Streaming.

Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.

Expertise in Extending Hive and Pig core functionality by writing custom UDF’s.

Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing Partitioning and Bucketing, writing and optimizing the HiveQL queries.

Experience in creating complex SQL Queries and SQL tuning, writing PL/SQL blocks like stored procedures, Functions, cursors, Index, triggers and packages.

Have good knowledge on NoSQL databases like HBase, Cassandra and MongoDB.

Extensively used ETL methodology for performing Data Migration, Extraction, Transformation, and loading using Talend and designed data conversions from wide variety of source systems.

Experienced in developing and designing Web Services (SOAP and Restful Web services).

Mentor junior engineers and provide technical guidance on best practices for big data engineering and software development.

Troubleshoot and resolve performance issues, bottlenecks, and data quality issues in big data systems.

Conduct root cause analysis and resolve production problems and data issues.

Performance tuning, code promotion and testing of application changes.

Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting, and scheduling tools.

Data profiling and data wrangling of XML, Web feeds and file handling using python, UNIX and SQL.

Using rest API with Python to ingest Data from and some other site to BIGQUERY.

Hands of experience in GCP, Big Query, G - cloud function, cloud dataflow, Pub/sub cloud shell and Data Proc.

Integrated data from various sources into Excel, allowing for comprehensive analysis and reporting.

Utilized automation tools like Git, Terraform, and Jenkins to streamline and automate data pipeline processes.

Define and enforce coding standards, Data Governance Policies, and documentation practices to ensure maintainability and scalability of big data solutions.

Redesigned cloud-based data warehouse to enhance security and improve performance.

Enhanced quality of data insights through implementation of automated data validation processes and improved access to data sources.

Designed and implemented data pipelines in Data bricks for ingesting, cleaning, and transforming data from various sources.

Skilled in troubleshooting Control-M job failures, analysing logs, and implementing optimizations to enhance system performance and reduce downtime, resulting in improved operational efficiency.

Worked on Dimensional and Relational Data Modelling using Erwin, Star, and Snowflake Schemas, OLTP/OLAP system, and Conceptual, Logical, and Physical data modelling.

Technologies: Hadoop, Hive, HDFS, Spark, Scala, Spark-SQL,Spark-Streaming, Python, KAFKA, Impala, Zookeeper, Sqoop, Java, Oracle, PySpark,GCP Big Query,GCP Data Proc,GCP Cloud Dataflow and GCP Pub/Sub.

Client: Citi Bank, U.S.A Mar 2019 – May 2023

Role: Technology Lead/Big Data Engineer, Wipro, Hyderabad, India

Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.

Writing the SQL queries to process the data using Spark SQL.

Use the SQOOP tool to import data.SRB (Sqoop Run Book) jobs creation and execution using shell script.

Developed REST APIs using Scala and Play framework to retrieve processed data from Cassandra database.

Designed and implemented highly performant data ingestion pipelines from multiple sources using Apache Spark and/or Data bricks Delta Lake.

To meet specific business requirements wrote UDFs in Scala.

Stay up-to-date with the latest GCP services and features and evaluate their potential use in the organization's data infrastructure.

Managed access control and permissions for Data bricks resources to safeguard sensitive data and maintain data integrity.

Integrated Data bricks with cloud platforms (e.g., AWS, Azure) to leverage cloud-native services for data storage, processing, and analytics.

Orchestrated Data bricks jobs and workflows using cloud-native services like AWS Lambda or Azure Functions.

Data from MySQL and PostgreSQL databases were extracted using Sqoop and stored in HDFS.

Developed REST APIs using Scala, Play framework and AKKA

Implemented a search micro service (Scala, REST, and Play Framework).

To Transfer data from AWS to PostgreSQL, Spark data frames were used.

Participate in code reviews and contribute to the development of best practices for data engineering on GCP.

Develop Scala and Python software in an agile environment using continuous integration

Extensively worked on Scala and experience building enterprise-level software solutions.

Extensively worked on deploying & maintaining Hadoop clusters.

Strong hands-on technical experience delivering solutions using Scala with Object Oriented Design and Programming, Distributed Architecture Design.

Strong Scala development experience & Understanding of functional programming concepts

Created and consuming RESTful Web Services in Scala, Play using AKKA.

Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.

Orchestrated all data pipelines using Airflow to interact with services like Azure Data bricks, Azure Data Factory and Azure Data Lake.

Loading data from File System into a Spark RDD.

Developed wrapper script to execute spark code, written spark code to transform the data of source and target tables and files, which is used to generate report.

Autosys Job Creation, RTC, Bit Bucket Code check in and deployment.

Proficient in data migration to Azure Data Lake, Azure SQL Database, Data bricks, and Azure SQL Data warehouse using Azure Data Factory.

Generated report on predictive analytics using Python and Power BI including visualizing model performance and prediction results.

Technologies: Hadoop, Hive, HDFS, Spark, Scala, Spark-SQL,CDH,Play,Spark-Streaming, Python, KAFKA, Cloudera, Map Reduce, Impala, AKKA, Zookeeper, Sqoop, Java, Oracle, PySpark,AWS S3,AWS GLUE, AWS Data Bricks, Azure Data Factory, Azure Data Lake, Azure Blob Storage.

Client: Southern California Edison, U.S.A Feb 2015 – Mar 2019

Role: Data Engineer, Infosys, Hyderabad, India.

Developed Pyspark programs and created the date frames and worked on transformations.

Wrote scripts in Hive SQL for creating complex tables with high-performance metrics like partitioning, clustering, and skewing.

Involved in creating Hive tables, loading data of formats like Avro, JSON, CSV, txt, and parquet, and writing hive queries to analyse data using HQL.

Hive tables were created on HDFS to store the data processed by Apache Spark on the Cloudera Hadoop Cluster in Parquet format.

Worked on handling all the requests to the systems using play framework MVC framework

Work with different data sources like HDFS, Hive, and Teradata for Spark to process the data.

Use Spark to process the data before ingesting the data into the HBase. Both Batch and real-time spark jobs were created using PySpark.

Responsible for data extraction and data ingestion from different data sources into Azure Data Lake Store by creating ETL pipelines using Spark and Hive.

Design and develop Hadoop, Map Reduce programs and algorithms for analysis of cloud-scale classified data stored in Cassandra

Analysing the requirement and involved in data lake design, Implemented the business rules as per requirements by using python.

Created the requirement and involved in data lake design, Implemented the business rules as per requirements by using python.

Loaded Hive tables, and loading and analysing data using hive queries.

Understand data mapping from Source to Target tables and the business logic used to populate target table.

Involved in configuration and deployment moments.

Coordination with the onsite team for the status.

Created and maintained a Python based data ETL pipeline to extract semi-structured and structured data.

Used Oozie workflow engine to run multiple Hive and Pig Scripts with the help of Kafka for the real-time processing of data to navigate through data sets in the HDFS storage by loading Log File data directly into HDFS.

Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better performance with HiveQL queries.

Experience in writing custom Map Reduce programs & UDFs in Java to extend Hive and Pig core functionality.

Responsible for Cluster maintenance, Monitoring, commissioning, and decommissioning Data nodes, Troubleshooting, Managing and reviewing data backups, Manage & review log files.

Handled fixing of defects efficiently and worked with the QA and BA team for clarifications.

Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.

Perform DB activities such as indexing, performance tuning, and backup and restore.

Expertise in building ETL pipelines using NoSQL and SQL from AWS and Big Data technologies.

Skilled in data modelling concepts including Star-Schema Modelling and fact/Dimension tables.

Excellent troubleshooting skills for Spark applications and performance tuning.

Proficient in using tools like Power BI and Excel for data analysis and visualization.

Developed Spark applications using Spark-SQL in Data bricks for data extraction, transformation, and aggregation from multiple file formats.

Technologies: Horton Works, Cloudera, Pyspark, Hadoop, Hive, Python, Oracle Database, SQL Database. Map Reduce, Impala, Play, Zookeeper, Sqoop, Java, Oracle 10g, Apache Nifi, AKKA and Agile Methodologies.

Client: Walmart, U.S.A Dec 2013– Jan 2015

Role: Senior Software Engineer, Infosys, Hyderabad, India.

Created test data for identified scenarios Involved in execution of test cases to validate the functionality of the application.

Coordinating with the Development team to judge the impact on the applications.

Preparing the test plan for initiatives.

Defect retesting and closure activities.

Successfully delivered multiple releases of the project.

Ensuring that the test scripts work without any defects.

Interacted with the development team for issue validations and clarifications.

Responsible for defect tracking and reporting the defects as per Defect Life Cycle using HP ALM defined by TCOE Process.

Coordination with the onsite team for the status.

Reported bugs found in the testing for the review of the development team.

Technologies: Java, J2EE, JSP, Servlets, HTML, XML, Java Script Web Logic, Oracle 9i, TOAD.

Client: Standard Bank, South Africa Dec 2010 – Nov 2013

Role: Senior Software Engineer, Infosys, Hyderabad, India.

Involved in design, development of Enterprise module.

Worked on Struts framework to create the Web application.

Developed Servlets, JSP and Java Beans using Eclipse.

Designed and developed struts action classes for the controller responsibility.

Created front-end interfaces and Interactive user experience using HTML, CSS, and JavaScript.

Responsible for validation of Client interface JSP pages using Struts form validations.

Developed and enhanced dynamic web applications using MVC framework, increasing web traffic by 20%.

Optimized database queries, reducing server load by nearly 15% and boosting overall application performance.

Implemented a customized exception handling framework that led to a 25% decrease in time spent supporting production.

Technologies: Java, J2EE, JSP, Servlets, Struts 2.0, Spring 3.0, HTML, XML, Java Script, WebLogic Server, Oracle and TOAD.

Client: Makro Technologies, India Aug 2009 – Oct 2010

Role: Software Developer, Hyderabad, India.

Releasing a defect-free code to the client by resolving and fixing all the issues raised in the testing phase

Deployed web applications onto server environments, ensuring a 99.99% system availability.

Resolving all the issues and making all the reports to be in a consistent behaviour.

Development and Customization of assigned Components.

Responsible for generating various JSP screens and Servlet classes.

Technologies: JAVA, J2EE, JSP, Servlets, HTML, XML, Java Script, Tomcat and PostgreSQL.

Contact this candidate