Data Engineer

Location:

Dallas, TX, 75207

Salary:

140000

Posted:

April 15, 2025

Contact this candidate

Resume:

Sai Kiran Yedulla

SUMMARY OF EXPERIENCE

Results-driven Big Data Engineer with 8+ years of experience designing and optimizing scalable data pipelines using Spark, Hadoop, and AWS/GCP/Azure. Proven track record in automating workflows, improving data processing efficiency by 40%, and collaborating with cross-functional teams to deliver data-driven solutions. Passionate about leveraging cloud technologies and real-time data processing to solve complex business problems.

Strong expertise in Big Data ecosystem like Spark, Hive, Sqoop, HDFS, Map Reduce, Kafka, Oozie, Yarn, HBase, Nifi.

Developed production ready Spark applications using Spark RDD, Data frames, Datasets, Spark SQL and Spark Streaming.

Solid experience in using the various file formats like CSV, XML, Parquet, ORC, JSON and AVRO.

Strong knowledge of NoSQL databases and worked with HBase, Cassandra and Mongo DB.

Experience in using cloud services like Amazon EMR, S3, EC2, Red shift and Athena.

Worked on Spark Streaming and Structured Spark streaming including Kafka for real time data processing.

Good knowledge in Oracle PL/SQL and shell scripting.

Worked extensively in Agile methodology to complete projects continuously and collaboratively.

Having strong analytical and problem-solving skills and can resolve complex technical issues.

EDUCATION

Master of Engineering, University of North Texas

Bachelor of technology, Vardhaman College of Engineering, Hyderabad, India (Computer Science)

TECHNICAL SKILLS

Big Data Ecosystem: Spark, MapReduce, HDFS, HIVE, HBase, Sqoop, Oozie, Zookeeper, Pyspark, Hue, Cloudera (CDH)

Cloud Services: AWS EC2, EMR, S3, Redshift, Athena, Glue, StepFunction, Lambda, IAM, S3 EventNotification, RDS, Azure HDInsight, Azure Databricks, Azure Data Factory & Azure SQL DW, GCP BigQuery

Relational Databases: Oracle 12c, MySQL, MS-SQL Server

NoSQL Databases: HBase, Cassandra and MongoDB, DB2

Programming Languages: Python, SQL, Scala, PL/SQL, Shell Scripting, Terraform, Java, C#

Web Technologies: JavaScript, CSS, HTML and JSP

Operating Systems: Windows, UNIX/Linux, and Mac OS

IDE & Command line tools: Eclipse, IntelliJ

EXPERIENCE

Data Engineer, Goldman Sachs – Texas Nov 2024 – Till Date

Designed and implemented a real-time data pipeline using Apache Kafka and Spark Streaming for ingesting and processing high-volume data streams.

Configured and monitored various AWS and GCP services such as EC2, GKE, GCE, RDS, GCS, BigQuery, and CloudWatch.

Developed ETL pipelines in and out of data warehouse using a combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.

Created a Python script that called the Cassandra Rest API, transformed the data, and loaded it into Hive.

Automated deployment processes using Terraform to provision and manage cloud infrastructure on AWS and Azure.

Integrated AWS Lambda with event-driven architectures using Amazon S3, DynamoDB, and Kinesis for real-time data ingestion and processing.

Designed, constructed, and maintained data solutions on the AWS cloud platform, utilizing services like S3, EC2, Redshift, RDS, DynamoDB, EMR, Glue, Lambda, and Step Functions.

Secured Lambda functions by managing permissions with AWS IAM roles and integrating with AWS Secrets Manager for sensitive data handling.

Built real-time data ingestion pipelines using Spark Streaming, processing data from Kafka to enable low-latency analytics.

Collaborated with the Data Science team to provide them with clean and reliable data for their analyses.

Environment: Spark, Sark Streaming, Kafka, Scala, CICD, AWS EMR, EC2, Glue, Lambda, Redshift, RDS, S3, Secrets Manager, Snowflake

Data Engineer, United Health Group – Texas Jun 2022 – Sep 2024

Collaborated with Business Analysts, and SMEs across departments to gather business requirements, and identify workable items for further development.

Partner with ETL developers to ensure that data is well cleaned, and the data warehouse is up to date for reporting purposes by Hive.

Selected and generated data into CSV files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.

Processed some simple statistical analysis of data profiling like cancel rate, var, skew, Kurt of trades, and runs of each stock everyday group by 1 min, 5 min, and 15 min.

Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into a data warehouse.

Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, OpenShift, pair RDDs

Utilized Terraform to manage infrastructure on AWS, creating reusable modules for deploying scalable and secure data environments.

Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.

Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.

Designed custom data validation frameworks in Python to ensure data quality and consistency across pipelines.

Developed and validated machine learning models including Ridge and Lasso regression for predicting the total amount of trade.

Automated data engineering workflows with Python scripts, improving efficiency and reducing manual interventions by 40%.

Leveraged Spark SQL to query structured data efficiently from distributed data stores like HDFS and Amazon S3.

Boosted the performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks.

Environment: Spark, AWS, AWS S3, AWS Redshift, SQL, Snowflake, Jenkins, Git.

Big Data Developer, FactSet – INDIA Feb 2020 – Aug 2021

Developed multiple Spark applications in Python for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file format.

Implemented data pipeline to read data from DB2 using spark SQL and load into Dataframes and write as ORC files.

Developed a framework using Pyspark to generate Parquet and CSV files from hive and snowflake tables.

Experience managing Azure Data Lakes (ADLs) and Data Lake Analytics and an understanding of how to integrate with other Azure Services.

Spark integration of data storage systems, particularly AZURE Data Lake and Blob storage.

Using PySpark and AZURE Data Factory, design, build, and implement large ETL pipelines.

Implemented logging and error-handling mechanisms in Python code to ensure robust and maintainable data workflows.

Developed JSON flattening framework using JSON schema in spark.

Developed test scripts for unit and integration testing.

Good experience with Unix commands.

Environment: Pyspark, Hive, Sqoop, Python, Snowflake, SQL, DB2, Spark SQL, AWS S3, EMR, EC2, Snowflake.

Big Data Developer, Wipro – INDIA Aug 2016 – Dec 2019

Converted Hive/SQL queries into Spark transformations using Scala.

Created Sqoop Scripts to import customer profile data from RDBMS and to export it to S3 buckets.

Created data frames by loading data from Hive tables using Spark SQL and created prep data and stored in AWS.

Tuned Spark jobs for optimal performance by configuring memory management, partitioning, and caching strategies.

Loaded the customer's data and event logs from Kafka into HBase using REST API

Developed various enrichment applications in spark using Scala for performing cleansing and enrichment of click stream data with customer profile lookups.

Worked on fine tuning and performance enhancements of various spark applications and hive scripts.

Used various concepts in spark like broadcast variables, caching, dynamic allocation to design scalable spark applications.

Created Unix scripts for importing data into Hadoop.

Environment: Hive, Hadoop, Sqoop, Python, Snowflake, Rest API, Spark SQL, AWS S3, EMR, EC2.

Contact this candidate