Data Engineer

Location:

Long Beach, CA

Posted:

February 21, 2025

Contact this candidate

Resume:

SURAGANI TEJASRI

Data Engineer

*****************@*****.*** +1-913-***-****

OBJECTIVE

Results-driven Data Engineer with 5+ years of hands-on experience in designing and building highly scalable Big Data solutions, including data architecture, data warehousing, and data pipelines. Proficient in Postgres, Scala, Hadoop, Hive, Apache Spark, Python, and cloud computing platforms like GCP and Azure. Seeking to contribute my expertise to a leadership role, driving complex projects, mentoring teams, and enhancing business outcomes with innovative technical solutions.

TECHNICAL SKILLS

Big Data Technologies: HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Storm, Flume, Spark, Apache Kafka, Zookeeper, Ambari, Oozie, MongoDB, Cassandra, Mahout, Puppet, Avro, Parquet, Snappy, Falcon.

NO SQL Databases: Postgres, HBase, Cassandra, MongoDB, Amazon DynamoDB, Redis

Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR, and Apache.

Languages: Scala, Python, R, XML, XHTML, HTML, AJAX, CSS, SQL, PL/SQL, HiveQL, Unix, Shell Scripting

Source Code Control: GitHub, CVS, SVN, ClearCase

Cloud Computing Tools: Amazon AWS, (S3, EMR, EC2, Lambda, VPC, Route 53, Cloud Watch, CloudFront), Microsoft Azure, GCP

Databases: Teradata Snowflake, Microsoft SQL Server, MySQL, DB2

DB languages: MySQL, PL/SQL, PostgreSQL & Oracle

Build Tools: Jenkins, Maven, Ant, Log4j

Business Intelligence Tools: Tableau, Power BI

Development Tools: Eclipse, Intellij, Microsoft SQL Studio, Toad, NetBeans

ETL Tools: Talend, Pentaho, Informatica, Ab Initio, SSIS

Development Methodologies: Agile, Scrum, Waterfall, V model, Spiral, UML

WORK EXPERIENCE

Azure Data Engineer

Synchrony Kansas City, Kansas, USA Apr 2023 - Present

Synchrony Financial is an American consumer financial services company. I design and manage data storage solutions with Azure Data Lake, Azure Blob Storage, and Azure SQL Database. Implement ETL processes using Azure Data Factory, Databricks, or SSIS to move data from various sources into a central repository.

Key Responsibilities and Achievements:

Written multiple MapReduce Jobs using Java API, Pig, and Hive for data extraction, transformation, and aggregation from multiple file-formats including Parquet, Avro, XML, JSON, CSV, ORCFILE, and other compressed file formats Codecs like Gzip, Snappy, LZO.

Leveraged Apache Beam and GCP Dataflow for building data streams, supporting real-time analytics.

Built scalable data pipelines and microservices using Python, PySpark, and SQLAlchemy for data processing in the cloud.

Used PowerBI as a front-end BI tool to design and develop dashboards, workbooks, and complex aggregate calculations.

Imported real-time weblogs using Kafka as a messaging system, ingested the data to Spark Streaming, did data quality checks using Spark Streaming, and arranged bad and passable flags on the data.

Collaborated with data scientists to build pipelines integrating heterogeneous data sources and optimizing for data science applications.

Developed data pipelines and workflows using Azure Databricks to process and transform large volumes of data, utilizing programming languages such as Python, Scala, or SQL.

Created interactive dashboards and generated insights through business intelligence tools, driving process improvements.

Responsible for estimating cluster size, monitoring, and troubleshooting the Spark Databricks cluster.

Created Data tables utilizing PyQt to display customer and policy information and add, delete, and update customer records.

Developed data architectures and data models for financial analytics using Azure Synapse and Power BI, ensuring data quality and enhancing reporting capabilities.

Worked with Postgres and Azure SQL to ensure the integrity and optimization of databases and built scalable data processing frameworks using Apache Spark.

Mentored junior engineers, providing technical support and guidance in best practices for data engineering and project delivery.

Managed cross-functional projects, collaborating with product managers to align technical solutions with business goals and timelines.

Visualized the results using Tableau dashboards and the Python Seaborn libraries were used for Data interpretation in deployment.

Developed and optimized data pipelines using Java and SQL Server to support business initiatives, ensuring high performance and efficiency.

Supported data migrations and implemented data integration workflows for cloud-based solutions, including Snowflake and BigQuery.

Utilized Git version control to streamline code development and management across multiple environments.

Designed and implemented Infrastructure as code using Terraform, enabling automated provisioning and scaling of cloud resources on Azure.

Extract Transform and Load data from source systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and Azure Data Lake Analytics. Data Ingestion to one or more Azure and processing the data in Azure Databricks.

Used Jira for ticketing and tracking issues and Jenkins for continuous integration and continuous deployment.

Collaborated with cross-functional teams of analysts, engineers, and data scientists to translate business requirements into technical solutions, enabling actionable insights.

Used Azure Synapse Analytics, Azure DataLake Storage, and Azure Data Lake Analytics services to query large amounts of data stored on Azure Data Lake Storage to create a Virtual Data Lake without having to go through ETL processes.

Conducted Performance tuning and optimization of Snowflake data warehouse, resulting in improved query execution times and reduced operational costs. Instantiated, created, and maintained CI/CD continuous integration & deployment pipelines and applied automation to environments and applications.

Environment: Analytics, API, Azure, Azure Data Lake, Azure Synapse Analytics, CI/CD, Data Factory, Docker, ETL, Factory, Hive, Jenkins, Jira, JS, Kafka, Kubernetes, MapR, Oracle, Pig, Python, RDBMS, Scala, Snowflake, Spark, Spark SQL, Spark Streaming, SQL, Tableau, Teradata

AWS Data Engineer

(Wipro) Credit Suisse Mumbai, India Aug 2021– Dec 2022

Credit Suisse Group AG is a global financial services company and investment bank that provides services such as private banking, asset management, and investment banking. Automated the repetitive data engineering tasks using AWS Lambda, Step Functions, or CloudFormation. Ensured compliance with data governance policies, managing data quality and lineage using AWS tools like Glue Data Catalog.

Key Responsibilities and Achievements:

Developed PySpark applications for various ETL operations across data pipelines enhancing data processing efficiency and reliability.

Wrote AWS Lambda functions in Spark with cross-functional dependencies, creating custom libraries for deploying the Lambda function in the cloud. Performed raw data ingestion, which triggered a lambda function and put refined data into ADLS. Monitored KPIs for data scripts to streamline data ingestion workflows and prepare data.

Loaded data from UNIX file system to HDFS, ensuring seamless data integration and accessibility.

Configured Spark streaming to get ongoing information from Kafka and store the stream information in DBFS.

Analysed and redesigned SQL scripts using Spark SQL achieving faster performance and optimized query execution.

Automated and monitored AWS infrastructure with Terraform for high availability and reliability, reducing infrastructure management time by 90% and improving system uptime.

Developed ETL infrastructure using Python, Pyspark, and AWS services like RDS, S3, and Glue to process large datasets and ensure data quality and accuracy.

Applied data mining techniques to extract actionable insights and support business processes.

Collaborated with business analysts to understand user requirements, ensuring the design and development of systems aligned with business goals.

Designed and implemented ETL solutions for financial systems using AWS services (EMR, S3, Lambda), integrating disparate data sources and improving workflow efficiency.

Led data migration and transformation efforts to move from legacy systems to cloud-based solutions, ensuring data consistency and scalability.

Integrated Hadoop and Spark technologies with data lakes to handle large volumes of financial data, providing insights into business performance.

Contributed to the project management process by performing effort estimation, requirements intake, and prioritization of technical tasks across multiple stakeholders.

Contributed to the creation of scalable data structures and data modeling methodologies, improving reporting accuracy.

Developed and maintained data pipelines using Apache Spark and Hadoop for distributed data processing.

Leveraged Solr for real-time data search and retrieval, improving query performance.

Developed Shadow pipelines to cross-check data between source systems and destinations.

Conducted performance tuning and optimization of database systems, focusing on PostgreSQL and SQL Server.

Developed container-based Docker solutions, worked with Docker images, Docker Hub and Docker registries, and Kubernetes to streamline application deployment and management.

Conducted thorough regression testing for new software releases, collaborating with vendors to resolve issues.

Migrated an existing on-premises application to AWS, utilizing services like EC2 and S3 for data processing and storage, and maintained the Hadoop cluster on AWS to enhance data management.

Environment: API, AWS, CI/CD, Docker, EC2, Elasticsearch, EMR, ETL, Git, HDFS, Jenkins, Kafka, Kubernetes, lake, Lambda, PySpark, S3, SAS, Snowflake, Spark, Spark SQL, SQL

GCP Data Engineer

Clariant Mumbai, India Jun 2020 - july 2021

Clariant AG is a Swiss multinational specialty chemical company. Integrated data from various sources, including Cloud Spanner, Bigtable, and external APIs, into unified data solutions. Monitored and optimized the data pipelines and storage solutions using Google Stackdriver, Google Cloud Monitoring, and other GCP tools.

Key Responsibilities and Achievements:

Installed and configured various components of the Hadoop ecosystem.

Validated data and generates reports using PowerBI.

Created Data Studio report to review billing and usage of services, optimizing queries and contributing to cost-saving measures.

Utilized GCP Dataproc, GCS, Cloud functions, BigQuery.

Leveraged GCP features including Google Compute Engine, Google Storage, VPC, Cloud Load balancing, and IAM to improve infrastructure efficiency.

Developed data pipelines for ingestion, transformation, and storage using Snowflake and Databricks.

Managed large-scale NoSQL databases, including Cassandra and ElasticSearch, for high-volume data storage.

Created Databricks Job workflows to extract data from SQL Server and upload the files to SFTP using PySpark and Python.

Applied knowledge of Cloud Shell for various tasks and deploying services.

Experience working with large data sets and Machine Learning classes using Tensor Flow and Apache Spark.

Developed monitoring and notification tools using Python.

Utilized Google Cloud components, Google container builders GCP client libraries, and Cloud SDK to improve cloud infrastructure management.

Deployed and managed applications in distributed storage environments with Kubernetes and Docker.

Collaborated with analysts and data scientists to develop data science workflows, ensuring seamless integration with production systems.

Delivered production support for business-critical data pipelines, ensuring compliance with organizational SLA standards.

Used Azure Data Factory to ingest data from log files and business custom applications, processed data on Databricks per day-to-day requirements, and loaded them to Azure Data Lakes.

Leveraged Cloud technologies to migrate legacy systems, enabling a 25% reduction in operational costs.

Created BigQuery-authorized views for row-level security or exposing the data to other teams.

Used Sqoop import/export to ingest raw data into Google Cloud Storage by spinning up the Cloud Dataproc cluster.

Process and load bound and unbound Data from Google pub/sub topic to BigQuery using Cloud Dataflow with Python.

Built data pipelines in Airflow/Composer for orchestrating ETL-related jobs using different airflow operators, improving ETL process efficiency.

Environment: Airflow, Apache, Azure, Azure Data Lake, BigQuery, Data Factory, ETL, Factory, GCP, Lake, MySQL, Oracle, PySpark, Python, SDK, Spark, SQL, Sqoop, VPC

Data Engineer

Balaji Telefilms Mumbai, India Jun 2019 - May 2020

Balaji Telefilms is an Indian company that produces Indian soap operas in several Indian languages. Participated in the design and development of the data platform, ensuring it meets the needs of the business and supports scalable data solutions.

Key Responsibilities and Achievements:

Designed, developed, and implemented performant ETL pipelines using the Python API (PySpark) of Apache Spark.

Ingested Data to Azure Services, including Azure Data Lake, Azure Storage, Azure SQL, and Azure DW, streaming data processing in Databricks.

Designed and managed SQL-based data warehouses to support real-time reporting needs and also conducted Root Cause Analysis to identify and resolve data inconsistencies in workflows.

Wrote Spark Streaming applications to consume data from Kafka topics and write processed streams to HBase.

Created several Databricks Spark jobs with PySpark to perform several tables-to-table operations.

Collaborated with business teams to understand informational needs and deliver customized data analytics solutions.

Partnered with cross-functional teams to implement end-to-end data workflows, ensuring seamless integration across systems and platforms.

Participated in various phases of the Software Development Lifecycle (SDLC), including gathering requirements, design, development, deployment, and analysis ensuring successful application delivery.

Created complex stored procedures, Slow Changing Dimension Type 2, triggers, functions, tables, views, and other T-SQL code, ensuring efficient data retrieval.

Built and optimized data pipelines for processing geospatial data using Apache Spark and Hadoop, providing valuable insights into environmental and business trends.

Worked closely with business analysts and data scientists to develop data models that supported key reporting and decision-making.

Enhanced reporting capabilities by integrating data sources into Snowflake and providing clear and actionable insights through Power BI visualizations.

Collaborated in a collaborative Agile development environment to design and implement database architecture for business-critical applications.

Implemented Synapse Integration with Azure Databricks notebooks, reducing development work by half and improving Synapse loading performance through dynamic partition switching.

Designed and deployed data pipelines using Data Lake, Databricks, and Apache Airflow optimizing data flow and processing.

Supported the development of a data warehouse, enhancing analytics capabilities across the organization.

Supported 24/7 production environments, ensuring seamless data operation and issue resolutions.

Developed and maintained data workflows and REST APIs for media content management and built efficient ETL pipelines, enabling real-time data processing for analytics.

Implemented Continuous Integration Continuous Delivery (CI/CD) for end-to-end automation of the release pipeline using DevOps tools like Jenkins.

Environment: Azure, Azure Data Lake, CI/CD, Data Factory, Docker, EC2, EMR, ETL, Factory, HBase, Hive, Java, Jenkins, Kafka, Kubernetes, PySpark, Python, Redshift, S3, Services, Snowflake, Spark, SQL

CERTIFICATIONS

Microsoft-certified Azure Data Fundamentals

Google Cloud certified Associate Cloud Engineer

AWS certified Data Engineer Associate

EDUCATION

Masters in computer science

University of Central Missouri, Kansas.

Contact this candidate