Data Engineer Big

Location:

Irving, TX

Salary:

120000

Posted:

October 15, 2025

Contact this candidate

Resume:

Jyothika Bollareddy

Data Engineer

Email ID: **********************@*****.*** PHONE NUMBER:334-***-****

PROFESSIONAL SUMMARY

Over 5 years of experience in the IT industry specializing as a Data Engineer, designing and implementing end-to-end data pipelines and analytics solutions for large-scale enterprises across banking, insurance, and government sectors.

Strong expertise in Big Data technologies such as Hadoop, Apache Spark, Databricks, and Snowflake, building scalable and efficient data processing systems to handle high-volume, high-velocity data.

Extensive experience working with cloud platforms including AWS (services like EMR, EC2, S3, Lambda, Athena, Glue, RDS, Redshift, DynamoDB), Azure (Blob Storage, ADLS, ADF, Synapse, Azure SQL Server), and GCP (Beam, Data Proc, BigQuery, Dataflow, Composer) to deploy cloud-native data solutions.

Proficient in programming languages including Python, Scala, and Java, using these languages to develop, optimize, and automate data ingestion, transformation, and analysis workflows in distributed computing environments.

Skilled in working with various databases and querying languages including SQL, Hive, Oracle, MySQL, DB2, and PostgreSQL, with strong capabilities in data modeling, query optimization, and database management.

Experienced with DevOps and containerization tools like Jenkins for CI/CD automation, Docker for container creation, and Kubernetes for orchestration and scalable deployment of data applications.

Familiar with Machine Learning concepts and workflows, collaborating with data science teams to prepare quality datasets, enabling the training and deployment of predictive models in big data environments.

Extensive experience with scheduling and orchestration tools such as Apache Airflow, Oozie, Azkaban, and AWS Step Functions for automating complex data pipeline workflows and ensuring reliable execution with monitoring and alerting.

Hands-on experience with on-premises Hadoop distributions such as Cloudera and Hortonworks, managing cluster setup, configuration, and maintenance to ensure optimized processing and data ingestion performance.

Expertise in designing and developing ETL processes and data modeling using tools like Erwin, ensuring scalable, maintainable, and optimized data architectures for complex analytical applications.

Proficient in handling various data formats including CSV, Parquet, ORC, Avro, JSON, XML, and TXT, optimizing storage and query performance in both batch and streaming environments.

Developed and maintained data visualization dashboards using tools like Tableau, Power BI, and Looker, delivering interactive reports and analytics to business stakeholders for better decision-making.

Strong documentation skills using platforms such as Confluence, SharePoint, and Lucidchart, creating detailed technical guides, system architecture diagrams, and data flow charts for team collaboration and knowledge sharing.

Knowledgeable in OLAP and analytics tools such as Kyvos, Druid, Azure Analysis Services (AAS), and SQL Server Analysis Services (SSAS) to support multidimensional data analysis and fast query performance.

Experienced with version control and repository management systems including Git, GitLab, GitHub, SVN, and CVS, ensuring effective code collaboration and version tracking in agile environments.

Deep understanding of Software Development Life Cycle (SDLC) methodologies including Agile and Waterfall, participating actively in sprint planning, code reviews, and quality assurance to deliver robust data solutions.

Proficient in using ticketing and project management tools such as JIRA, ServiceNow, Rally, TFS, and Azure DevOps Dashboard, improving team collaboration, issue tracking, and project transparency.

Skilled in optimizing Spark jobs and complex SQL queries to improve performance and reduce latency, leveraging cluster resources efficiently for processing terabytes of data daily.

Implemented data governance and security best practices including encryption, access control, and audit logging across cloud and on-premises platforms to ensure compliance with corporate and regulatory policies.

Automated ETL build and deployment workflows using Jenkins to establish CI/CD pipelines, reducing manual errors and speeding up delivery of data pipelines and applications.

Collaborated cross-functionally with data scientists, analysts, and DevOps teams to design and deliver scalable, end-to-end data engineering solutions aligned with business objectives.

Designed and implemented cloud-based data lakes and data warehouses using AWS, Azure, and GCP native services to enable scalable storage and high-speed querying of large datasets.

Mentored junior engineers and onboarded new team members, sharing knowledge on big data tools, cloud platforms, and best coding practices to ensure consistent project quality.

Proactively researched and adopted emerging big data technologies and tools to improve existing data pipelines and architectures, maintaining industry relevancy and enhancing system capabilities.

Demonstrated excellent problem-solving skills by analyzing and resolving data pipeline failures and performance bottlenecks, improving reliability and operational efficiency of data engineering workflows.

TECHNICAL SKILLS

Languages: Python, Scala, Java

Web Technologies: N/A (primarily backend data engineering focus)

Databases: SQL, Hive, Oracle, MySQL, DB2, PostgreSQL, DynamoDB

Java Technologies: N/A (focus on data engineering; Java used for processing)

ORM Frameworks: N/A (not typically used in data engineering)

Web Servers: N/A (not applicable)

Testing Frameworks: JUnit (basic familiarity), pytest (Python testing)

Web Services: REST APIs (experience in data ingestion and integration)

Version Control Tools: Git, GitLab, GitHub, SVN, CVS

Methodologies: Agile, Waterfall, SDLC

Cloud Technologies: AWS (EMR, EC2, S3, Lambda, Athena, Glue, RDS, Redshift, DynamoDB), Azure (Blob Storage, ADLS, ADF, Synapse, Azure SQL Server), GCP (Beam, Data Proc, BigQuery, Dataflow, Composer)

DevOps Tools: Jenkins, Docker, Kubernetes

Scheduler & Orchestration Tools: Apache Airflow, Oozie, Azkaban, AWS Step Functions

Data Formats: CSV, Parquet, ORC, Avro, JSON, XML, TXT

Data Visualization Tools: Tableau, Power BI, Looker

Documentation Tools: Confluence, SharePoint, Lucidchart

OLAP Tools: Kyvos, Druid, Azure Analysis Services (AAS), SSAS

Build Tools: Maven (basic understanding)

Logging Tools: Log4j, Slf4j (basic familiarity)

Ticketing & Project Management Tools: JIRA, ServiceNow, Rally, TFS, Azure DevOps Dashboard

Operating Systems: Linux (Ubuntu, RedHat), Windows

IDEs: IntelliJ IDEA, VS Code, PyCharm

PROFESSIONAL EXPERIENCE

American Express Bank May 2024 – Present

Data Engineer

Designed the overall data architecture following a microservices approach, leveraging AWS Cloud infrastructure including S3, Lambda, and Redshift for scalable data ingestion and processing pipelines.

Led requirement gathering sessions with business analysts to define ETL processes, focusing on real-time data transformations using Apache Spark (PySpark) to support fraud detection analytics.

Utilized AWS EMR clusters to process large volumes of transactional data in batch mode, optimizing cluster performance and reducing processing time by 30%.

Developed data ingestion pipelines that integrated JSON, CSV, and AVRO data formats from multiple external APIs and internal systems into Amazon S3 for centralized data storage.

Implemented real-time streaming ingestion using Apache Kafka, enabling event-driven data flows for transactional monitoring and alerting across customer accounts.

Migrated legacy customer data from on-premises Oracle databases to AWS S3 and Redshift, ensuring seamless integration and improved query performance in cloud data warehouse environments.

Employed Snowflake as a cloud-native data warehouse solution for high concurrency analytics and implemented Snowflake Streams and Tasks for continuous data ingestion and processing.

Worked on data modeling and schema design for fact and dimension tables in Snowflake, enabling efficient aggregation and historical trend analysis for business intelligence dashboards.

Automated pipeline scheduling and orchestration using Apache Airflow, ensuring reliable execution and monitoring of ETL workflows with automatic failure recovery mechanisms.

Applied row-level security and column masking in Snowflake to protect sensitive customer information, aligning with GDPR and PCI-DSS compliance standards.

Integrated Log4j for centralized logging of data pipeline events and errors, facilitating rapid troubleshooting and root cause analysis in distributed environments.

Developed REST APIs for data consumption and provisioning to downstream applications, enabling flexible and secure access to processed datasets via OAuth2 authentication.

Used Jenkins CI/CD pipelines for automating deployment of ETL jobs and infrastructure as code, accelerating development cycles and reducing manual errors.

Leveraged GitHub for version control of code repositories, promoting collaboration across distributed teams and enforcing code review standards.

Ensured data quality with automated validation frameworks using PyTest and custom Python scripts, significantly reducing data inconsistencies and improving pipeline robustness.

Designed scalable data storage solutions with Parquet and ORC file formats to optimize storage and query efficiency on S3 and Redshift Spectrum.

Implemented IAM policies for fine-grained access control across AWS resources, ensuring that only authorized users and services have access to sensitive data assets.

Containerized ETL applications using Docker and orchestrated deployments with Kubernetes clusters, improving portability and scalability of data processing components.

Participated actively in Agile sprint ceremonies using Jira, managing backlogs, tracking progress, and facilitating continuous delivery of features aligned with business goals.

Collaborated with the data science team to deploy machine learning models into production pipelines, integrating scoring outputs with ETL workflows for real-time risk scoring and insights.

State of California Dec 2022 – Dec 2023

Data Engineer

Designed and implemented service-oriented architecture for state data platforms, leveraging Azure Data Lake Storage (ADLS) and Azure Synapse Analytics to build scalable, secure, and compliant big data pipelines.

Gathered detailed business requirements and designed ETL workflows using Azure Data Factory (ADF) for batch and incremental data loading from diverse sources, ensuring data quality and traceability.

Developed and maintained PySpark scripts on Azure Databricks for large-scale data transformations, improving data processing speed and enabling complex analytics on government data sets.

Integrated various structured and unstructured data formats including JSON, CSV, XML, and Avro into the Azure data ecosystem, supporting multiple reporting and compliance needs.

Implemented streaming ingestion solutions using Kafka to capture real-time data from public service applications, enabling near real-time analytics and decision-making for public health monitoring.

Migrated critical healthcare data from on-premises SQL Server databases to Azure Blob Storage, streamlining data access and analytics while maintaining compliance with state privacy laws.

Created dimensional data models in Azure Synapse to support reporting and visualization requirements, enabling self-service analytics through Power BI dashboards for stakeholders.

Designed data aggregation patterns and multi-layered data marts to support efficient data consumption by internal teams and external agencies while ensuring data consistency.

Utilized Azure Data Factory for orchestrating complex data workflows, automating data ingestion, transformation, and loading, significantly reducing manual effort and errors.

Applied row-level security and Azure Active Directory (AAD) based access controls to protect sensitive citizen data, adhering strictly to GDPR and HIPAA compliance mandates.

Implemented detailed logging and error handling using Azure Monitor and Log Analytics, enabling real-time monitoring and rapid issue resolution across the data pipeline.

Developed RESTful APIs for secure data provisioning to state government applications, incorporating OAuth2 for authentication and role-based access control.

Automated deployment and continuous integration using Jenkins and GitLab CI/CD, reducing release cycles and enhancing pipeline reliability in a regulated environment.

Utilized GitLab for source code version control and collaboration, maintaining strict branching policies and performing code reviews to ensure high code quality.

Implemented extensive data quality checks using custom Python scripts and Azure Data Factory validation activities to ensure accuracy and reliability of ingested data.

Stored processed data in optimized file formats like Parquet and ORC on ADLS, reducing storage costs and improving query performance on Synapse SQL pools.

Managed identity and access management via Azure IAM policies, restricting data access to authorized users only and ensuring audit trails for compliance.

Containerized applications using Docker, deployed on Azure Kubernetes Service (AKS) for scalable and resilient microservices architecture supporting various state projects.

Coordinated Agile sprint planning and retrospectives using Jira, improving team communication, sprint velocity, and timely delivery of critical features aligned with government mandates.

Collaborated closely with data scientists to integrate machine learning pipelines for predictive analytics in public welfare programs, leveraging Azure ML for model training and deployment.

Progressive Insurance March 2020 – Nov 2022

Data Engineer

Architected and developed a microservices-based data platform using AWS cloud services, enabling scalable ingestion and processing of insurance claim data with high availability and fault tolerance.

Led requirement gathering sessions with business and technical teams to design robust ETL pipelines using AWS Glue for extracting, transforming, and loading large volumes of policy and claims data.

Developed batch and streaming data processing jobs using Apache Spark with Scala on AWS EMR, handling multi-terabyte datasets with optimized resource allocation for faster processing times.

Implemented data ingestion pipelines that processed varied formats like JSON, CSV, Avro, and XML, integrating external vendor data and internal databases into AWS S3 data lake.

Designed and maintained Kafka-based streaming architecture for real-time insurance data flows, enabling near real-time fraud detection and risk assessment for underwriting teams.

Migrated on-premises Oracle and SQL Server databases to AWS Redshift and S3, optimizing data storage costs and improving query performance for analytics and reporting.

Created complex dimensional data models in Redshift for business intelligence use cases, supporting Power BI dashboards that provided actionable insights for claims adjustment teams.

Utilized AWS Lambda functions for serverless ETL orchestration, event-driven data processing, and integration between AWS services, reducing infrastructure management overhead.

Ensured data security and compliance by implementing encryption at rest and in transit using AWS KMS and following HIPAA and PCI-DSS guidelines for sensitive insurance information.

Developed comprehensive logging and monitoring using AWS CloudWatch and Log4j, proactively identifying and resolving data pipeline issues to maintain SLA compliance.

Designed and implemented REST APIs for data access and integration with third-party applications, enforcing security through OAuth2 and token-based authentication mechanisms.

Automated build, test, and deployment workflows using Jenkins integrated with GitHub, enabling continuous integration and delivery in an Agile development environment.

Used GitHub for version control and collaborative code management, enforcing code reviews and branching strategies to maintain code quality and stability.

Applied data validation and quality checks using Apache Hive and custom PySpark scripts, ensuring high data integrity for business-critical reports.

Optimized storage using columnar file formats such as Parquet and ORC within the S3 data lake, enhancing data retrieval speed and reducing operational costs.

Managed access controls with AWS IAM policies and role-based permissions, restricting sensitive data access and maintaining audit logs for regulatory compliance.

Employed Docker containers for packaging Spark applications, deploying them on Kubernetes clusters to achieve scalable and portable data processing environments.

Orchestrated complex data workflows using Apache Airflow, scheduling ETL jobs and monitoring data pipeline health with alerting mechanisms for early anomaly detection.

Participated actively in Agile ceremonies using Jira, tracking sprint progress and managing backlogs, which improved collaboration across distributed teams and ensured timely feature delivery.

Collaborated with data scientists to build ML data pipelines, preprocessing data for predictive models that enhanced customer segmentation and fraud detection capabilities.

EDUCATION

Bachelor of Technology in Computer Science

Jawaharlal Nehru Technological University, Hyderabad

CERTIFICATIONS

Microsoft Certified: Azure Data Engineer Associate

Contact this candidate