Data Engineer Real-Time

Location:

Hyderabad, Telangana, India

Posted:

September 10, 2025

Contact this candidate

Resume:

Sohin Raj Seedarla

DATA ENGINEER

+1-831-***-****

Mail: ***********@*****.***

PROFESSIONAL SUMMARY:

Having 5+ years of experience in IT as a Data Engineer with expertise on cloud platforms like AWS, Azure, and GCP, which helps to design, analyze, and develop ETL Data Pipelines, Snowflake. Experienced in developing both batch and real-time streaming data pipelines using cloud services, Python, and PySpark Scripts.

Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.

Hands-on expertise with importing and exporting data from Relational databases to HDFS, Hive, and HBase using Sqoop. Proficiency with transporting and processing real-time event streaming using Kafka and Spark Streaming.

Excellent in High-Level Design of ETL & SSIS Packages for integrating data using OLE DB connection from heterogeneous sources like (Oracle, Excel, CSV, Oracle, flat file, and Text Format Data) by using multiple transformations provided by SSIS such as Data Conversion, Conditional Split, Bulk Insert, Merge and Derived Column.

Experience in Hadoop Ecosystems HDFS (Storage), Spark, Map Reduce (Processing), Hive, Pig, Sqoop, YARN, and AWS Glue.

Experience with HBase, Cassandra, and MongoDB NoSQL databases and creation of Sqoop scripts for Teradata and Oracle to Big Data Environment data transfer.

Good experience building Confidential Redshift data warehouse solutions; worked on numerous projects to migrate data from on-premises databases to Confidential Redshift, RDS, and S3; implemented a variety of analytics methods using Cassandra with Spark and Scala.

Experience in Migration projects from on-premises Data warehouse to Azure Cloud.

Hands-on experience building a serverless architecture utilizing Amazon Lambda, API gateway, Route 53, and S3 buckets to convert existing AWS infrastructure to serverless architecture (AWS Lambda, AWS Kinesis).

Good command of version control systems using CVS, SVN, GIT/GitHub, Bit-bucket, and issue-tracking tools like Jira and Bugzilla.

Experience in Data Integration and Data Warehousing using various ETL tools like IBM WebSphere Data Stage PX, Apache Nifi, and DBT.

Expertise in cloud-based technologies such as AWS, GCP, and Azure.

Experience in Microsoft BI tools (SSIS, SSRS, Azure Data Factory, Power BI) – using integration services (SSIS)and Data Factory for ETL (Extraction, Transformation, and Loading) reporting services (SSRS)and Power BI.

Experienced in optimizing OLAP queries and aggregations to enhance query performance and ensure timely delivery of analytical insights.

Proficient in Quorum for distributed data processing and analytics.

Working knowledge of AWS Cloud Services, EC2, EBS, VPC, RDS, SES, ELB, Auto scaling, CloudFront, Cloud Formation, Elastic Cache, API Gateway, Route 53, Cloud Watch, SNS, and Elastic Cache.

Has experience with Kafka 0.10.1 producers and stream processors to process real-time data. Kinesis was used to create the stream process, and the data was then stored in Data Lake S3.

In-depth knowledge of and practical expertise with Python-based Tensor Flow and Scikit Learn frameworks for machine learning and AI.

Strong background in web application development using Python, Django, Amazon, HTML, CSS, JavaScript, and PostgreSQL.

Solid understanding of Relational Database Systems (RDBMS), Normalization, Multi-Dimensional (Star), and Snowflake schema.

Proficient with big data technologies like Hadoop 2, HAD, HDFS, MapReduce, and Spark and statistical programming languages like R and Python.

Experience in building real-time streaming data pipelines using Apache Flink and Apache Kafka for event-driven architecture

EDUCATION:

M.S in Data Analytics and Management, Indiana Wesleyan University Marion, Indiana (2023-2024)

BBA.LLB, Symbiosis Law School, Hyderabad (2016-2019)

TECHNICAL SKILLS:

Big Data Technologies

Zookeeper, Spark, Cloudera, HDFS, YARN, Oozie, Sqoop, Hive,

Impala, Apache Flume, Hadoop, Apache Airflow,

HBase, MapReduce,

Programming Languages

Java, SQL, Scala, PowerShell, Python, C, C++, PL/SQL, T-SQL

Cloud Services

Azure Data Factory, Blob storage, Azure Event Hubs, Azure SQL DB,

Databricks, AWS RDS, Amazon SQS, Amazon S3, Azure Data Lake

Storage Gen 2, GCP, AWS EMR, AWS S3, Redshift, Glue, G- cloud function, GCS Bucket, Lambda, AWS SNS, BiqQuery, Data flow, pub/sub cloud shell

Databases

Oracle, MySQL, Teradata, MS Access, Snowflake, SQL Server,

NoSQL Data Bases

DynamoDB, HBase, Cassandra DB, MongoDB

Data Integration

Apache Airflow, Matillion, Apache Flink

Visualization & ETL tools

SSRS, Tableau, Talend, SSIS, DBT, Power BI, and Informatica

Version Control and containerization tools

Kubernetes, Bitbucket, Docker, GitHub

Operating Systems

Linux, Unix, Mac OS, Windows,

PROFESSIONAL EXPERIENCES

Client: Visa - Austin, TX Nov 2022 to Present

Role: Data Engineer

Responsibilities:

Designed and set up Enterprise Data Lake to provide support for various use cases, including Analytics, processing, storing, and Reporting of voluminous, rapidly changing data.

Designed and implemented end-to-end data pipelines in Snowflake using SQL, Python, and ETL tools (e.g., Apache Airflow, DBT, Talend).

Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation, and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect.

Designed and developed a Security Framework to provide fine-grained access to objects in AWS S3 using AWS Glue Lambda and DynamoDB.

Build data pipelines in airflow in GCP for ETL-related jobs using different airflow operators.

Developed scalable ETL pipelines using PySpark to process and transform large volumes of structured and semi-structured data.

Coordinated with the team and Developed framework to generate Daily ad-hoc reports and Extracts from enterprise data from BigQuery.

Worked with Google data catalog and other Google Cloud API’s for monitoring, query, and billing-related analysis for BigQuery usage.

Set up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig, and MapReduce to access cluster for new users.

Performed end-to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, and S3.

Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

Used Spark SQL for Scala & amp, PySpark, Python interface that automatically converts RDD case classes to schema RDD.

Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.

Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost of EC2 resources.

Importing & exporting databases using SQL Server integration services (SSIS) and Data Transformation Services (DTS Packages).

Created AWS Glue jobs and crawlers to catalog data and automate schema discovery across various data sources.

Developed a reusable framework to be leveraged for future migrations that automate ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.

Conducted Data blending and data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to the Tableau server.

Developed Kibana Dashboards based on the Log stash data and Integrated different source and target systems into Elasticsearch for near real-time log analysis of monitoring End-to-end transactions.

Implemented AWS Step Functions to automate and orchestrate the Amazon SageMaker-related tasks such as publishing data to S3, training ML model, and deploying it for prediction.

Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon SageMaker.

Experience in building real-time streaming data pipelines using Apache Flink and Apache Kafka for event-driven architecture

Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Amazon SageMaker, Dataflow, Datalake, Big Query, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Apache Pig, Python, SSRS, Tableau, Apache Flink.

Client: Global Tele Systems, India May 2019 to Aug 2022

Role: Azure Data Engineer

Responsibilities:

Meetings with business/user groups to understand the business process, gather requirements, analyze, design, develop, and implement according to client requirements.

Migrated large-scale datasets from legacy systems and cloud storage (AWS S3, Azure Blob, GCS) into Snowflake using Snowpipe, COPY INTO, and external stages.

Designing and Developing Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and Non-relational to meet business functional requirements.

Designed and Developed event-driven architectures using blob triggers and Data Factory.

Creating pipelines, data flows, complex data transformations, and manipulations using ADF and PySpark with Databricks.

Automated jobs using different triggers like Events, Schedules, and Tumbling in ADF.

Created and provisioned different Databricks clusters, notebooks, jobs, and autoscaling.

Ingested huge volume and variety of data from disparate source systems into Azure Data Lake Gen2 using Azure Data Factory V2.

Created several Databricks Spark jobs with Pyspark to perform several tables to table operations.

Performed data flow transformation using the data flow activity.

Worked with AWS Glue Job Bookmarks to handle incremental data processing and avoid duplicate loads.

Developed streaming pipelines using Apache Spark with Python.

Created, provisioned multiple Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.

Improved performance by optimizing computing time to process the streaming data and saved cost to the company by optimizing the cluster run time.

Perform ongoing monitoring, automation, and refinement of data engineering solutions.

Designed and developed a new solution to process the NRT data by using Azure stream analytics, Azure Event Hub and Service Bus Queue.

Created Linked service to land the data from SFTP location to Azure Data Lake.

Extensively used SQL Server Import and Export Data tool.

Working with complex SQL views, Stored Procedures, Triggers, and packages in large databases from various servers.

Experience in working on both agile and waterfall methods in a fast pace manner.

Generating alerts on the daily metrics of the events to the product people.

Extensively used SQL Queries to verify and validate the Database Updates.

Suggest fixes to complex issues by doing a thorough analysis of root cause and impact of the defect.

Provided 24/7 On-call Production Support for various applications and provided resolution for night-time production job, attend conference calls with business operations, system managers for resolution of issues.

Environment: Azure Data Factory (ADF v2), Azure SQL Database, Azure functions Apps, Azure Data Lake, BLOB Storage, SQL server, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Python, ADLS Gen 2, Azure Cosmos DB, Azure Event Hub, Azure Machine Learning, Test automation, engineer, development, quality assurance, software, Mongo, automated testing, scalability, regression, performance testing, DevOps, collaborate, cross-functional, QA, microservices, Selenium, CI/CD, documentation, communication skills, debugging, vision, strategy, market, status, HTTP, compliance, less, sales, test plan, ELT, computer science, networking, Terraform, GPU, information system, NLP, natural language processing, problem solving, collaboration, computer vision, classification, Dash, TypeScript, Flask, PyTorch, scikit-learn, pandas, NumPy, Full stack, authorization, test-driven development, data security, technical documentation, accessibility, critical thinking, continuous improvement, technical leadership, data modeling, business intelligence, Agile methodologies, mentoring, Software engineer, Scrum, SAFe, Go, Rust, monitoring systems, service mesh, Prometheus, Grafana, multi-cloud, infrastructure as code, GitOps, Datadog, innovation, data structures, vmware, project management, dhcp, dns, technical writing, incident response, penetration testing.

Contact this candidate