Data Engineer Information Technology

Location:

San Francisco, CA, 94102

Posted:

July 31, 2024

Contact this candidate

Resume:

SOWMYA ALLURI

+1-813-***-****) ****************@*****.*** LinkedIn

PROFESSIONAL SUMMARY

8+ years of professional experience in Information Technology in those 6 years as a Data Engineer with an expert hand in the areas of Database Development, Integration, Implementing, ETL Development, SDLC, Data modeling, Report Development and Big Data Technologies, Testing as per Cycle in both Waterfall and Agile methodologies.

Performed map-side joins on GCP and Imported data from different sources like HDFS/HBase into Spark RDD. And extensively worked on Spark (Spark) framework for Data Wrangling to implement different use cases based on the business requirement.

Experienced in using ETL tools such as Informatica Power Center, Apache NIFI, and Talend for data integration, transformation and loading processes.

Experience in building multiple Data pipelines, end-to-end ETL and ELT process for Data ingestion and transformation in leveraging GCP services like GCS, Pub/Sub, Dataflow, Data prep and big query and leveraging python extensively for Data Manipulation, file handing and automation by enabling both batch and real-time analytics while securing system efficiency and scalability.

Proficient in designing, developing, and maintaining secure and reliable ETL solutions using Talend to support critical business processes across multiple Business Units.

Orchestrated containerized applications using Google Kubernetes Engine (GKE), deploying microservices architecture that improved scalability and reliability by leveraging automated scaling and self-healing capabilities.

Expert in developing SSIS/DTS Packages to Extract, Transform and Load (ETL) data into data warehouses/data marts from heterogeneous sources.

Proficient in UNIX shell scripting for automation and task scheduling.

Experienced in Perl and Python scripting for system administration and data processing tasks.

Designed comprehensive relational and dimensional data models that optimized database performance and facilitated efficient data storage, retrieval, and analytics across diverse business domains.

Shown in-depth expertise in utilizing GCP services, successfully migrating data to Snowflake, and streamlining data procedures to create reliable analytics platforms.

Solid background in building Data analysis scripts with Python API, PySpark API, and Spark APIs. Extensively used Python Libraries like Pandas, PyTest, PyMongo, Oracle, PyExcel, Boto3, NumPy, Beautiful Soup, Matplotlib, and PySpark Seaborn.

Proficient in leveraging AWS Services like EC2, S3, EMR, RDS, Lambda, Glue, Redshift, Athena, CloudWatch, Sage maker, CloudFormation, Quick sight, Aurora, DynamoDB also Experienced in Maintaining the Hadoop cluster on AWS EMR.

Good experience in implementing advanced procedures like text analytics and processing the in-memory computing capabilities with Apache Impala, and end-to-end applications in Scala.

Created dashboards using Tableau and Power BI for getting valuable insights based on business uses and stakeholder requirements while also integrating machine learning algorithms to enhance predictive analytics and support data-driven decision-making.

Good understanding of AWS Sage Maker and experience with Hortonworks and Cloudera Hadoop environment.

Experienced in building Jupyter notebooks using PySpark for extensive data analysis.

Hands on experience in installing, configuring, monitoring, and using Hadoop ecosystem components like

Hadoop Map-Reduce, Hive, Sqoop, Pig, Zookeeper, Oozie, Yarn, Horton works and Flume.

Leveraged the Hadoop ecosystem and Spark for big data analytics, focusing on deploying scalable applications for real-time data processing and insights.

I possess advanced skills in CI/CD pipeline creation and version control integration, utilizing platforms like Kubernetes for robust build processes. Leveraging tools such as Jenkins, Bitbucket, and GitHub, I streamline continuous integration and delivery processes effectively. Good understanding of software development methodologies, including Waterfall and Agile (Scrum) with proficiency in JIRA for project management.

TECHNICAL SKILLS:

GCP Services

Google Cloud Storage, Identity and Access Management (IAM), Snowflake, Fire store, Cloud Functions, Data proc, Big Query, Pub/Sub, Dataflow, Cloud Data prep, Data studio, cloud SQL, Cloud Data Fusion, Cloud Composer

Cloud Technologies

AWS- EMR, EC2, S3, Redshift, Athena, Lambda Functions, Step Functions, DynamoDB, CloudWatch, CloudTrail, SNS, SQS, Kinesis, Quick Sight, Glue

Big Data Technologies

HDFS, MapReduce, Hive, YARN, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala

Languages

Python, Scala, SQL, HiveQL, Unix Shell Scripting, Java, XML, Shell Scripting, Schemas, Pyspark, R, C/C++

Hadoop Distribution

Cloudera Enterprise, EMC Pivotal, Dataiku, CDH, Horton Works HDP, Apache

Databases

Oracle, MySQL, SQL, Cassandra, Mango DB, PostgreSQL, HBase, Snowflake, Hive, Cosmos DB, Big query

Operating Systems

Windows, LINUX, UNIX, Ubuntu, Mac OS

IDEs / Tools/Design

MS Visual Studio, Eclipse, PyCharm, CI/CD, SQL Developer, Workbench, Tableau, Power BI

Version Control

Git, Bitbucket, GitHub

Tools

Airflow, Kubernetes, Docker, Terraform, Git, Jenkins, Talend.

CI/CD Tools

Hudson, Jenkins, Bamboo, Cruise Control.

PROFESSIONAL EXPERIENCE:

Global Atlantic Financial Corp, New York, NY Jul 2021 to till now Role: Senior GCP DATA ENGINEER

Responsibilities:

Experience in building multiple Data pipelines, leveraging GCS, G-Cloud Function, and Data Fusion and end-to- end ETL and ELT process for Data ingestion and transformation in GCP.

Strong understanding of GCP services such as Big Query, Cloud Storage, Dataflow, Pub/Sub, and Cloud Dataproc.

Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data.

Developed PySpark Data Ingestion framework to ingest source claims data into HIVE tables by performing Data cleansing, aggregations, and applying De-dup logic to identify updated and latest records.

Develop airflow DAGS using python to schedule jobs on incremental basis.

Worked on handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, transformations, and others during the ingestion process itself.

Produce unit tests for Spark transformations and helper method for Design data processing pipelines.

Multiple batch jobs were written for processing hourly and daily data received through multiple sources like Adobe, No-SQL databases using GCP Services, Cloud Function, Data Proc, Cloud Compose, Cloud Storage, etc,

Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as GCP.

Experienced in building advanced analytics solutions in Big Data environment using Google Storage, GCP Dataflow, Data fusion, Cloud Composer (Apache Airflow), Google Pub/Sub, and Data Proc.

Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python.

Designed and implemented end-to-end ETL processes using Talend to meet complex business requirements, including data extraction, transformation, and loading.

Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipeline system using Scala programming and enhancements of Talend jobs and Shell scripts to improve performance and maintainability of ETL processes in GCP.

Proficient in UNIX shell scripting for automation of routine tasks, improving operational efficiency by 20%.

Utilized Talend strategies such as indexing and partitioning to fine-tune data warehouse and big data environments, improving query response time and scalability.

Defined and captured metadata and rules associated with ETL processes using Talend.

Pulling the data from data lake (GCS) and massaging the data with various RDD transformations.

Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes.

Demonstrated ability in optimizing database queries and designing efficient indexing strategies for improved performance.

Designed and implemented web services using REST and SOAP protocols in Java, enhancing system interoperability and data exchange capabilities.

Actively involved in code review and bug fixing for improving performance and worked on handling data manipulation using Python Scripts.

Developed and maintained Shell scripts to orchestrate and automate various aspects of ETL processes in the GCP environment, enhancing efficiency and reliability of data pipelines.

Extract Transform and Load data from sources Systems to Cloud Storage.

Built performant, scalable ETL processes to load, cleanse and validate data.

Involved in development, building, testing, and deploying to Hadoop cluster in distributed mode.

Developed Analytical dashboards in Snowflake and shared data downstream in GCP.

Worked on building data-centric queries to cost optimization in Snowflake.

Implemented a Continuous Delivery pipeline with Kubernetes, Docker, Git Hub, and GCP Cloud Services.

Worked on the configuration management tools like Bitbucket/GitHub.

Developed and deployed to production multiple projects in the CI/CD pipeline for real-time data distribution, storage, and analytics. Persistence to S3, HDFS, Postgres.

Environment: GCP, Pyspark, Big Query, Python, Scala, Talend, Cloud Composer, Snowflake, Airflow, Data Frame, RDD, Hive, Pig, Bitbucket, GitHub, Dataflow, Hive, Data Proc, Cloud SQL, MySQL, Postgres, SQL Server, Spark, Spark-SQL.

Tenet Healthcare-Dallas, Texas Jul 2020 to Jun 2021

Role: Data Engineer

Responsibilities:

Design and Develop ETL Processes in AWS Glue diverse data such as ETR, CT scan, X-Rays and also the Patients data from external sources like S3, RDBMS, Teradata and Text Files into AWS Redshift and responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.

Developing AWS Lambda functions that create the EMR Cluster and auto terminates the cluster after the job is done.

Developed data transition programs from DynamoDB to AWS Redshift (ETL Process) using AWS Lambda by creating functions in Python for certain events based on use cases.

Involved in building a data pipeline and performed analytics using AWS stack (EMR, EC2, EBS, Elastic search, Kinesis, SQS, DynamoDB, S3, RDS, Lambda, Glue, SQS, and Redshift).

Implemented robust data pipelines in AWS environment using Talend, integrating data from varied sources such as S3, RDBMS, Teradata, and text files into Snowflake, ensuring data consistency and reliability.

Utilized Talend's capabilities to orchestrate data workflows in AWS Glue, optimizing performance and scalability for processing large volumes of healthcare data.

Designed and implemented data quality checks within Talend workflows to ensure accuracy and integrity of healthcare data, crucial for maintaining high standards of patient care and compliance.

Provided support and troubleshooting for production ETL jobs, identifying, and resolving issues promptly to minimize impact on critical healthcare operations.

Writing Pyspark Applications which runs on an Amazon EMR cluster that fetches data from the Amazon S3/one lake location and queues it in the Amazon SQS (Simple Queue Services) queue.

Built Kinesis dashboards and applications that respond to incoming data using AWS SDKs, exported data from kinesis to other AWS services, such as EMR for analytics, S3 for storage, Redshift for extensive data, and Lambda for event-driven actions.

Developed multiple POCs using Spark and Scala and deployed them on the Yarn Cluster, comparing the performance of Spark, with Hive and SQL.

Involved in converting MapReduce programs into Spark transformations using Spark RDD in Scala.

Developed Spark scripts using Scala Shell commands as per the requirements.

Loaded the data into Spark RDD and did in-memory data Computation to generate the output response.

Working on the Spark SQL for analyzing and applying the transformations on data frames created from the SQS queue and loading them into Database tables and querying.

Working on Amazon S3 for persisting the transformed Spark Data Frames in S3 buckets and using Amazon S3 as a Data-lake to the data pipeline running on Spark.

Developing Email reconciliation reports for ETL load in Spark framework.

Analyzed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.

Proven track record of collaborating with business stakeholders to gather requirements and perform Agile story analysis, ensuring alignment between technical solutions and business needs.

Understood Business requirements to the AWS core and came up with Test Strategy based on Business rules.

Experience with Apache Airflow for authoring, scheduling, and monitoring Data Pipelines.

Used Teradata Studio and Teradata SQL Assistant to run SQL Queries.

Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop Map Reduce developed in python, pig, Hive.

Implemented data quality checks using Spark Streaming and arranged bad and passable flags on the data.

Implemented Spark-SQL with various data sources like JSON, Parquet, ORC, and Hive.

Used Spark Streaming APIs to perform transformations and actions on the fly for building a common learner data model which gets the data from Kafka in near real-time and persist it to Cassandra.

Created Tableau Visualizations by connecting to AWS Hadoop Elastic MapReduce and for Integrated delivery (CI and CD) using Jenkins and puppet.

Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team.

Environment: AWS, S3, Lambda, Glue, Hive, Spark, Snowflake, Talend, Python, SQL, Pyspark, EMR, Spark-SQL, Teradata, JSON, Parquet, ORC, Cassandra, Spark RDD, RDS, DynamoDB, Redshift, ECS, AWS Kinesis, Airflow, Oracle, Teradata, Tableau, Jira, Puppet.

Dollar General- Goodlettsville, TN Jan 2019 to Jun 2020 Role: Data Engineer

Responsibilities:

Design and implementation of scalable ETL pipelines using Python, Talend and Spark, optimizing data flow between various sources and targets including Snowflake and Redshift, which significantly improved data processing efficiency and supported advanced analytics.

Spearheaded the design and development of scalable ETL pipelines utilizing Talend Big Data to seamlessly integrate data across diverse sources and targets, including Snowflake and AWS services, facilitating enhanced data processing and analytics capabilities for Dollar Tree.

Provided expertise in performance tuning of Talend jobs, optimizing query response time and scalability in Big Data.

Provided expertise in performance tuning of Talend jobs, optimizing query response time and scalability in Big Data environments, resulting in improved operational efficiency for Dollar Tree.

Extensively involved in the Installation and configuration of Cloudera Hadoop Distribution.

Implemented advanced procedures like text analytics and processing using in-memory computing capabilities like Apache Spark written in Scala.

Developed spark applications for performing large-scale transformations and denormalization of relational datasets.

Created reports for the BI team using Scala to export data into HDFS and Hive.

Have real-time experience with Kafka-Storm on HDP 2.2 platform for real-time analysis.

Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using AWS.

Development of Spark jobs in AWS PySpark and SparkSQL to run on top of hive tables and create transformed data sets for downstream consumption.

Implemented scheduled downtime for non-prod servers for optimizing AWS pricing.

Developed Spark with Scala and Spark-SQL for testing and processing of data.

Implemented data ingestion from various source systems using Scala and Pyspark.

Hands-on experience implementing Spark and Hive jobs performance tuning.

Performed end-to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3, Athena, Glue, and Kinesis.

Hive As the primary query engine of Scala, we have built external table schemas for the data being processed in AWS.

Generated graphs and reports using package in RStudio for analytical models. Developed and implemented R and Shiny application which showcases machine learning for business forecasting.

Developed predictive models using Decision Tree, Random Forest, and Naïve Bayes.

Used pandas, NumPy, seaborne, SciPy, Matplotlib, Scikit-learn, AWS, NLTK in Python for developing various machine learning algorithms. Expertise in R, MATLAB, python, and respective libraries.

Research on Reinforcement Learning and control (spark, Torch, Scala), and machine learning model (Scikit- learn).

Supported Machine learning and Data Scientist to implementing Naive Bayes and skilled in Random Forests, Decision Trees, Linear, and Logistic Regression, spark, AWS, Clustering, Principal Component Analysis.

Implemented various statistical techniques to manipulate the data like missing data imputation, principal component analysis and sampling.

Building, publishing customized interactive reports and dashboards, report scheduling using Tableau server.

Environment: AWS Cloud, Hadoop, Spark, Kafka, Scala,Talend, Hive, Yarn, HBase, Jenkins, Docker, Power BI, Splunk, Amazon EMR, Redshift, S3, Athena, Glue, and Kinesis, pandas, NumPy, seaborne, SciPy, Matplotlib, Scikit-learn, AWS, NLTK, Scala, Torch.

High Radius Technologies- Hyderabad, India Aug 2016 – Jun 2018 Role: Data Analyst

Responsibilities:

Extracted Multiple Population Data using SQL queries through Impala and geo-mapping the data using Tableau Dashboards.

Update and optimized stored procedures using SQL.

Developed PL/SQL stored procedure, functions, and style sheets to reduce data retrieval time by 50%.

Re-structured schemas with 100+ tables to enhance data integrity.

Design and implement reports that track key business metrics and provide actionable insights.

Analyze financial data to identify bad populations using highly efficient data mining techniques that automatically isolates the bad segments based on set configuration. This technique was developed using SQL and Python.

Design and implement reports that track key business metrics and provide actionable insights.

Worked for the BI Analytics team to conduct A/B testing, data extraction, and exploratory analysis.

Generated dashboards and presented the analysis to researchers explaining insights on the data.

Environment: SQL, PL/SQL, Python, JIRA, MS SQL SERVER, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel, Tableau, Informatica, Impala.

Ceequence Technologies- Chennai, India Apr 2015 to Jul 2016 Role: SQL Server DBA

Responsibilities:

Implemented SQL logins, roles, and authentication modes as part of security policies for various categories of users.

Created stored user-defined data types, procedures, triggers, and functions, as well as rebuilt indexes at regular intervals for better performance.

Monitored performance, recovered databases, maintained databases, used log shipping, and analyzed/tuned long-running slow queries.

Analyzed various aspects of stored procedures and performed tuned as needed to increase productivity and reusability of code.

Provided DA support for a development team customizing and testing new releases of client-server applications.

Created and maintained documentation of data interface specifications, Data Mappings, Transformation Rules, Business decision and migration processes to be available to business/Product owners and IT teams.

Implemented Always on Availability between Primary and secondary Production servers with configuring Shared Witness server and Windows Failover cluster.

Worked with large datasets and performed data analysis, transformations, data cleansing, De-duping, De- Normalizing, and data validations using SQL Server, SSIS packages and Excel.

Handled up to half TB size databases with clustered and replication environments.

Provide database coding to support business applications using PL SQL, T-SQL.

Design and develop reports using SQL Server Reporting Services (SSRS) and Crystal Reports.

Experienced in extracting data from Development / UAT / Production servers by using SSIS.

Environment: SQL, Data Mapping, datasets, SQL- Server, SSIS, PL SQL, T-SQL, SSRS, UAT.

Contact this candidate