Data Engineer Software Development

Location:

Irving, TX

Salary:

any

Posted:

December 06, 2023

Contact this candidate

Resume:

Rahul Reddy M

Data Engineer

***************@*****.***

870-***-****

LinkedIn: https://www.linkedin.com/in/rahul-reddy-b3287a278/

Certified Aws Data Analytics – Specialty

AWS DATA ANALYTICS Certified Data Engineer with sophisticated experience in software development and data analytics pipelines and highly skilled and motivated Big Data Engineer with expertise in Machine Learning (ML). Proficiency in scaling and deploying Big Data and Cloud technologies at a large scale. A working knowledge of Java, Scala, and Python programming languages.

SUMMARY:

With over 8 years of experience in software development, I have a strong focus on data engineering, data analysis, as well as application development using Java.

Detailed understanding of Software Development Life Cycle (SDLC) and sound knowledge of project implementation methodologies including Waterfall, and Agile.

A proven track record of using big data tools for automating large-scale data pipelines.

Strong programming experience in developing rest services, automation scripts, and data engineering jobs using Java, Scala, and Python.

Good understanding of the architecture of Distributed Systems and Parallel processing frameworks for scalable data storage and distributed data processing.

Worked with HDFS, MapReduce, Hive, Yarn, Kafka, Oozie, Sqoop, and HBase tools in the Hadoop ecosystem.

Strong experience using SQL framework for performing various data cleansing, data enrichment, and data aggregation activities.

Implemented query optimization techniques within Redshift, ensuring rapid response times for complex analytical queries.

Strong experience in troubleshooting spark application failures and fine-tuning long-running jobs.

Utilized PySpark distributed computing capabilities to process and analyze massive datasets, resulting in improved data processing speeds and the ability to handle terabytes of data seamlessly.

Designed and developed machine learning models using Amazon SageMaker, including data preprocessing, feature engineering, and hyperparameter optimization.

Utilized Spark RDD APIs, Spark Data frames, Spark SQL, and Spark Streaming APIs.

Expertise in using broadcast variables, accumulators, partitioning, reading text files, JSON files, and parquet files, and fine-tuning various configurations in Spark.

Hands-on experience writing Kafka producers and consuming streaming messages from Kafka with Spark Streaming.

Worked with major distributions like Cloudera & Hortonworks including administering the clusters using Cloudera Manager and airflow Ambari respectively.

Optimized AWS Glue job configurations, significantly boosting data processing speed and overall system performance.

Good understanding and experience in working with NoSQL databases like HBase Cassandra and Dynamo DB.

integrated Airflow with cloud platforms (e.g., AWS, Azure) to manage cloud-based data pipelines and services.

Experience working with Amazon Web Services (AWS) cloud and its services like Snowflake, EC2, S3, RDS, SQS, EMR, VPC, IAM, Elastic Load Balancing, Lambda, RedShift, Elastic Cache, KMS, Auto Scaling, Cloud Front, Meta store, Athena, DynamoDB, Cloud Watch, Data Pipeline, DMS, ETL, and other AWS Services.

Leveraged big data technologies such as Hadoop and Spark to handle and process massive volumes of data, enabling effective data analysis and ML model training.

Implemented error-handling mechanisms within CloudFormation stacks, enabling automatic rollback in case of provisioning failures. Ozie and AWS Simple workflow experience for orchestrating data ingestion and data processing jobs.

TECHNICAL SKILLS

Hadoop/Big Data

Spark, PySpark, Map Reduce, Hive, Pig, Sqoop, Flume, Tableau, Talend, Oozie, Kafka, MongoDB, HBase, Impala, Informatica.

Languages

Java, Python, Shell, SQL, Scala, DB SQL, MEMSQL.

Web Applications

Angular, Node.JS and spring-boot

Cloud

AWS, Azure

Version Control

GIT, SVN, Azure DevOps Server

IDE & Build Tools

Eclipse, IntelliJ, Vs code

Hadoop Distribution

Hortonworks, Cloudera.

Databases

Oracle, SQL Server, MySQL, NoSQL, Snowflake, Teradata

Build Automation tools

Docker, Ant, Jenkins, Maven, Kubernetes, SBT.

PROFESSIONAL EXPERIENCE:

Cox Automotive, Sacramento, CA June 2021 to Till Date

Data Engineer

Responsibilities:

Python script for scraping charge information from the web and storing it in Azure Cloud.

Converted unstructured web data into structured data using Python pipelines.

Utilized Spark parallelization for lodging and cleaning large volumes of data.

Proficient in Apache Airflow, with hands-on experience in creating, scheduling, and monitoring complex workflows and data pipelines.

With Terraform, I have created Azure resource groups, Storage accounts, Containers, Databricks, and PostgreDB SQL resources and deployed them as needed.

Automated scrapping and ETL processes with exhaustive logging and error handling.

Expert at profiling and summarizing data for assessing ETL pipeline quality and consistency.

Estimates the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.

Implementing Spark clusters and configuring high-concurrency clusters using Databricks to speed up the preparation of high-quality data.

Ensured data pipeline security by implementing access controls, authentication, and encryption measures within Airflow.

Ingestion of data into Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.

Implemented end-to-end PySpark script for automotive dealer-related information from the web and dumped the data into Azure blob storage.

Experience with developing SQL Applications.

Led the implementation of advanced data warehousing solutions at Cox Automotive, harnessing Snowflake's powerful capabilities.

Used tools like JQ and Jstream for splitting large multiline JSON files (>200GB) into chunks of single-line JSON for enabling downstream parallel processing in Spark.

Experience in using Python packages like Json-Stream for working with large JSON files in Python.

Connecting ADF with linked services to import data into Snowflake.

Worked on the Snowflake to create and Maintain Tables and views.

Using Python APIs to support Apache Spark or PySpark, executed the program.

Created Jenkins pipeline for deploying Azure resources using Maven, terraform, Arti Factory, and Git.

Developed robust Extract, Transform, Load (ETL) pipelines using Snowflake's native features and integrated data from various sources, including databases, APIs, and cloud platforms.

Developed Pipelines that extracted, transformed, and loaded data from a variety of sources, including Azure SQL, Blob storage, Azure SQL Data Warehouse, write-back tool, and reverse tool in Azure Data Factory.

Environment: Azure Data Factory, Data Bricks (PySpark, Spark SQL), Data Lake, Azure BLOB Storage, Snowflake, Airflow, Terraform, Docker, Maven, ETL, Data Warehouse, Azure SQL, GitHub, Python.

Kaiser Permanente, Pleasanton, CA Jan 2020 to May 2021

Data Engineer

Responsibilities:

Involved in writing Spark applications to perform various data cleansing validation, transformation, and summarization activities according to the requirement.

Build the application and database servers using AWS EC2 and create AMIs as well as use RDS for PostgreSQL

Developed and executed data quality checks using AWS Glue to ensure the accuracy, consistency, and reliability of data loaded into Redshift.

Preprocessed and cleaned raw data, handled missing values and outliers, and transformed data into meaningful features for ML model inputs.

Responsible for developing data pipelines involving ingesting raw Json files, transactional and user profile information from prem data warehouses and processing them using spark.

Integrated AWS Lambda functions with API Gateway to build efficient and scalable APIs, facilitating seamless communication between client and server.

Utilized Spark-Synapse SQL Connector for writing the processed data Synapse SQL directly.

Implemented comprehensive monitoring and logging solutions within Airflow to track workflow progress, detect issues, and facilitate troubleshooting.

Assisted in the development of data-driven products and services, integrating ML models into production systems for real-time decision-making.

Written Kafka producers for streaming real-time Json messages to Kafka topics and processed them using spark streaming and performed streaming inserts to Synapse SQL.

Implemented Spark in EMR for processing Enterprise Data across our Data Lake in AWS System.

Led the containerization efforts by creating Docker images for multiple applications, improving portability, and reducing deployment inconsistencies.

Leveraged CloudFormation to implement Infrastructure as Code practices, enabling version-controlled and repeatable infrastructure deployments.

Built and maintained a centralized data catalog using AWS Glue's metadata repository, enabling efficient data discovery and governance across the organization.

used Scylla as a data source or destination in ETL workflows, allowing data engineers to efficiently move and transform data between different systems.

Backing up AWS Postgres to S3 on daily job run on EMR using Data Frames.

Worked on different le formats like Text, Avro, Parquet, JSON, and Flat files using Spark.

Developed daily process to do incremental import data from Teradata into Hive tables using Sqoop.

Orchestrated application deployments on Kubernetes clusters, ensuring high availability and seamless scaling of microservices.

Work with cross-functional consulting teams within the data science and analytics team to design, develop, and execute solutions to derive business insights and solve clients’ operational and strategic problems.

Environment: AWS, Glue, Redshift, Lambda, CloudFormation, Kafka, Docker, IAS, Machine Learning, Spark, Scala, Java, Hatta Spark Streaming. Hive, Teradata HDInsight, Synapse Analytics.

Nationwide Insurance, Columbus, Ohio Dec 2018 – Dec 2019

Data Engineer

Responsibilities:

Developed data ingestion pipelines to collect insurance and provider data from various external sources such as FTP servers and S3 buckets.

Involved in migrating existing Teradata Data Warehouse to AWS S3-based data lakes.

Involved in migrating existing traditional ETL jobs to Spark and Spark Jobs on a new cloud data lake.

Good knowledge of data visualization and dashboard designing using Power BI.

In addition, wrote complex spark applications for performing various denormalizations of the datasets and creating a unified data analytics layer for downstream teams.

Contributed to ingesting 207 source systems into the Data Lake Hadoop environment, including databases (DB2, MySQL, Oracle), flat files, mainframe files, and XML files.

Responsible for fine-tuning long-running spark applications, writing custom spark udfs, troubleshooting failures, etc.,

Designed and implemented data processing pipelines using PySpark, enabling the efficient extraction, transformation, and loading (ETL) of large-scale data from diverse sources into a unified data lake or warehouse.

Implementing Spark clusters and configuring high-concurrency clusters using Databricks to speed up the preparation of high-quality data.

Involved in building a real-time pipeline using Kafka and Spark streaming for delivering event messages to downstream application teams from an external rest-based application.

Utilized Snowflake's automatic scaling and concurrency features to ensure smooth performance during peak workloads.

Involved in creating Hive scripts for performing ad hoc data analysis required by the business teams.

Worked on Procedures, triggers, Crawlers, and functions.

Implementing Databricks notebooks using SQL, Python, and automated notebooks using jobs.

Utilized Cassandra as a source of data for the data processing pipeline and integrated with Apache Spark and other big data processing frameworks for batch and real-time data processing.

Worked on utilizing AWS cloud services like S3, EMR, Redshift, Athena, and Glue Megastore.

Used broadcast variables in spark, effective and efficient Joins, caching and other capabilities for data processing.

Worked on Jenkins's continuous integration of applications.

Environment: AWS, Spark, Pyspark, Snowflake, Hive, Teradata, SQL, HDFS, Tableau, Sqoop, Kafka, Impala, Cassandra, Oozie, HBase, PySpark, Scala, Data Bricks.

De Shaw, Hyderabad, India Mar 2017- Oct 2018

Big Data Engineer

Responsibilities:

Importing and exporting data between Hadoop Data Lake and Relational Systems like Oracle, and MySQL using Sqoop.

Engaged in the development of spark applications that perform ELT-like operations on the data.

Utilized Spark SQL APIs to transform existing MapReduce jobs to Spark transformations and actions.

Utilized Hive partitioning and bucketing and performed various kinds of joins on Hive tables.

Involved in creating Hive external tables to perform ETL on data that is produced daily.

Validated the data being ingested into Hive for further filtering and cleansing.

Incorporated Sqoop jobs for incremental loading from RDBMS into HDFS and further applied Spark transformations.

Coordinate with project managers and evaluate priorities for all assigned tasks.

write a shell script to run the Oozie workflow to execute job runs.

Using Hue check hive tables which are loaded with job runs.

Loaded data into hive tables from spark and used Parquet columnar format.

Created Oozie workflows to automate and produce the data pipelines.

Migrating Map Reduce code into Spark transformations using Spark and Scala.

Log data is collected and analyzed using Apache Flume and staged in HDFS.

JIRA is used to design and document operational problems according to standards and procedures.

Environment: Hadoop, Hive, Impala, Oracle, Spark, Pig, Sqoop, Oozie, Map Reduce, GIT, Confluence, Jenkins.

Intergraph, India June 2015- Feb 2017

Java Developer

Responsibilities:

Designed user interfaces, object models, application server logic and schema by analyzing the requirements.

Used J2EE patterns for designing applications.

Designed UI pages using HTML, DHTML, JavaScript 1.8, JQUERY, JSP and Struts tag libraries.

JMS is used for exchanging information like Author publication status reports between the author and the company.

Using Model View Controller MVC architecture of Struts developed client User Interface.

Generated WSDL files using the AXIS2 tool.

Build Spring Boot microservices for the delivery of software products across the enterprise.

Decompose existing monolithic code base into Spring Boot microservices.

Created a client library that provided load-balanced and fault-tolerant consumption of Spring Boot microservices from a monolithic application.

Created POC of Authentication and Authorization with Oauth2 Spring Boot microservice. Utilized JWT as a tokenization scheme for Oauth2.

Used Struts Validation framework for client/server validations.

Extensively used design patterns like Singleton, Factory, Abstract Factory etc.

Used EJB Session beans and entity beans to develop business services and persistence.

Developing a mechanism for sending and receiving SOAP messages over JMS by MQ Series Engine.

Implemented business logic using Java Beans for front end and storage/retrieval from the backend Oracle DB using SQL queries functions, sequences, triggers, cursors etc.

Followed coding guidelines while implementing the code.

Extensively involved in Unit testing and coordinated with the testing team and fixing Bugs at various stages of the application development.

Develop Message Driven Beans MDB involved in building and accessing Web Services using SOAP over JMS.

Developed Web Services using Apache AXIS2 tool Framework.

Implemented Web Services using SOAP protocol, UDDI, WSDL and Service-oriented Architecture SOA Concept.

Environment: JAVA/J2EE, JSP, Servlets, Java Beans, JSTL, Java Script 1.8, EJB-session bean, entity beans, JMS, HTML4, DHTML2.0, Struts, Eclipse, Apache Tomcat 6.0, CVS, JAVA/J2EE Design Patterns, Edit Plus, JUNIT, Oracle-SQL, QC etc.

Contact this candidate