Resume

Data Engineer Processing

Location:

Maryville, MO

Posted:

January 04, 2024

Contact this candidate

Resume:

Ribkarani Annam

Email: ad2f22@r.postjobfree.com

PH: 913-***-****

LinkedIn: https://www.linkedin.com/in/ribkaraniannam/

Sr Data Engineer

Professional Summary

Accomplished Data Engineer and Analyst with over 9 years of expertise in developing and implementing robust data models for enterprise-level applications. Proficient in Big Data tools and cloud services, specializing in AWS and Azure platforms. Known for orchestrating fault-tolerant infrastructures, optimizing data processing, and transforming data landscapes for actionable insights.

Responsible for AWS-centric projects at Apple, overseeing fault-tolerant infrastructures and creating robust data pipelines.

Orchestrated multi-tier application designs using AWS services like EC2, S3, RDS, DynamoDB, and IAM, ensuring high availability and fault tolerance.

Developed scalable AWS Lambda code in Python for nested JSON files, enabling efficient data processing and comparison.

Implemented Spark applications using Scala, optimizing Spark processing for various RDBMS and streaming sources.

Proficient in collecting real-time data from AWS S3 buckets using Spark Streaming, performing transformations, and persisting data in HDFS.

Integrated AWS Glue catalog with crawler to extract and perform SQL query operations on S3 data.

Spearheaded Hadoop cluster management on AWS EMR, utilizing services like EC2 and S3 for small dataset processing and storage.

Employed RUP and agile methodologies for new development and software maintenance.

Developed Python code for workflow automation using the Airflow tool, managing dependencies and job scheduling.

Orchestrated Apache Airflow for scheduling DAGs, automating Hive to Azure Blob Storage data migration.

Led a team of data engineers in designing and implementing complex data ingestion pipelines, reducing time-to-insights by 50%.

Proficient in Azure cloud platforms, including HDInsight, Data Lake, Databricks, Azure Functions, and Synapse.

Designed and executed ETL pipelines in Azure Data Factory, moving data to Azure SQL Data Warehouse and Data Lake.

Leveraged Azure Databricks to compute large datasets and uncover actionable business insights.

Configured Snow Pipe to efficiently load data from AWS S3 buckets into Snowflake tables for streamlined analytics.

Expertise in using Spark applications for batch and continuous streaming data processing in Azure Databricks.

Managed version control using GitHub and Azure DevOps Repos for code collaboration.

Crafted visualizations and dashboards in Tableau and Power BI, facilitating intuitive reporting and data-driven decision-making.

Extensive experience in ETL development with tools like Informatica, Kafka, and Sqoop for real-time analytics and data integration.

Proficient in Python, SQL, Scala, HiveQL, and Spark for data processing, analysis, and ETL workflows.

TECHNICAL SKILLS:

Big Data Technologies

Hadoop, Spark, Sqoop, Hive, Flume

Cloud Services:

AWS (S3, EC2, SQS, Redshift, Glue), Azure (Data Lake, Databricks, Data Factory)

Programming Languages

Python, Scala, SQL

ETL Tools

Informatica, Apache NIFI, Apache Airflow, Apache Oozie

Visualization Tools

Tableau, Power BI

Databases

Snowflake, DynamoDB, Elasticsearch

Other

Terraform, Git, Apache Kafka, Apache Zookeeper

Professional Experience

Sr. AWS Data Engineer

Apple, Sunnyvale, CA May 2022 to Present

Description: As an AWS Data Engineer at Apple, I enhanced the Customer Experience Insights Platform by designing efficient data pipelines, utilizing Spark for large-scale data processing, integrating SQL and NoSQL databases, optimizing data storage formats, and leveraging AWS services. I ensured data quality, collaborated with teams, and documented processes, resulting in actionable insights and improved customer experiences.

Responsibilities:

Design and develop efficient and scalable data pipelines using programming languages such as Python and Scala. Ensure the pipelines are optimized for performance, reliability, and data quality.

Utilize Spark for data processing and analysis, leveraging its distributed computing capabilities to handle large volumes of data and extract valuable insights.

Integrate SQL databases like Postgres, MySQL, and Redshift Datawarehouse into the data engineering solutions, ensuring seamless data retrieval, storage, and management.

Incorporate NoSQL databases like MongoDB, Cassandra, and Elasticsearch into the data ecosystem, effectively integrating and leveraging their specific features for data storage and retrieval.

Handle various data storage formats, optimizing them for efficient data processing and storage. Ensure compatibility and interoperability across different storage systems.

Optimize data solutions by leveraging AWS cloud. Make use of specific AWS services like EMR, Lambda, S3, Athena, Glue, and RDS to enhance data processing, storage, and analytics capabilities.

Ensure data quality and governance by implementing data validation checks, monitoring data pipelines for anomalies or errors, and establishing data governance practices to maintain data integrity and compliance.

Contribute to driving actionable insights by designing and implementing data solutions that enable effective analysis and reporting, ultimately improving customer experiences on the Customer Experience Insights Platform.

Stay updated with emerging technologies, industry trends, and best practices in data engineering to continuously enhance the data engineering capabilities and drive innovation within the organization.

Develop and maintain data ingestion processes to efficiently capture and ingest data from various sources, such as transactional databases, event streams, log files and external APIs. Ensure data is collected in a timely and accurate manner to support real-time and batch processing.

Utilizing AWS Glue for data transformation and cleansing processes that can greatly enhance data quality, consistency, and compatibility across various sources. Apply data validation, standardization, and enrichment techniques to improve the accuracy and reliability of the data.

Design and optimize data models and schemas to support efficient data storage, retrieval, and analysis. Collaborate with data scientists and analysts to understand their requirements and create data structures that facilitate advanced analytics and reporting.

Utilize AWS services like Amazon EMR (Elastic MapReduce) for distributed processing and analysis of large datasets. Leverage the power of Spark and Hadoop frameworks to perform complex data transformations, aggregations, and machine learning tasks.

Implement data storage solutions using AWS services like Amazon S3 (Simple Storage Service), Amazon RDS (Relational Database Service), and Amazon Redshift. Optimize data storage configurations for performance, scalability, and cost efficiency.

Develop and maintain data pipelines and workflows using workflow orchestration tools like Apache Airflow, Luigi, or Azkaban. Schedule and monitor data processing jobs, handle job dependencies and ensure reliable and timely execution of data workflows.

Collaborate with data scientists and analysts to understand their data requirements, provide data engineering support, and assist in the development and deployment of machine learning models and advanced analytics algorithms.

Environment: AWS EMR, S3, EC2, Lambda, Apache Spark, Spark-Streaming, Spark SQL, Python, Scala, Shell scripting, Snowflake, AWS Glue, Oracle, Git, Tableau.

Sr. Data Engineer (Azure)

Finra, Rockville, MD August 2020 to April 2022

Description: As an Azure-based data engineer at FINRA, I led a transformative market surveillance project, optimizing data workflows, compliance, and analytics. Leveraging Azure services like Databricks, Data Factory, and SQL DB, I transformed data landscapes and orchestrated efficient and scalable Azure ecosystems. Collaborating with stakeholders, I delivered enriched reporting insights through Power BI and implemented robust version control strategies. My contributions significantly enhanced regulatory adherence, data processing efficiency, and decision-making capabilities.

Responsibilities:

Led and drove a market surveillance project on the Azure cloud platform, with a focus on optimizing data workflows, compliance procedures, and analytical capabilities for regulatory adherence.

Utilize Azure services such as Databricks, Data Factory, and SQL DB to transform complex data landscapes and ensure efficient data processing and storage.

Design and implement robust data ingestion, processing, and ETL pipelines using Azure technologies, ensuring seamless integration and reliable analysis of large volumes of data.

Orchestrate the enhancement of Azure ecosystems for improved efficiency and scalability, optimizing performance and resource management to meet growing demands.

Collaborate closely with stakeholders, including compliance teams, analysts, and decision-makers, to understand their requirements and deliver actionable insights based on comprehensive and accurate data.

Monitor and optimize data workflows, identify areas for improvement, and implement enhancements to ensure regulatory compliance, data accuracy, and processing efficiency.

Collaborate with cross-functional teams, including data analysts, compliance officers, and software engineers, to gather requirements and translate them into technical specifications for data engineering solutions on the Azure platform.

Design and implement scalable and efficient data architectures on Azure, considering factors such as data volume, velocity, and variety. This includes selecting appropriate Azure services and components for data storage, processing, and analytics.

Develop and maintain data pipelines, ensuring the smooth and reliable movement of data from various sources to the target systems. This involves utilizing Azure Data Factory, Azure Logic Apps, or other relevant tools to orchestrate data workflows.

Implement data transformation and enrichment processes to cleanse, validate, and enhance the quality of incoming data. Apply appropriate techniques such as data deduplication, normalization, and aggregation to ensure data accuracy and consistency.

Optimize data processing and analytics workflows by leveraging Azure technologies like Azure Databricks, Azure Synapse Analytics, or Azure HDInsight. This includes optimizing Spark jobs, managing cluster resources, and tuning performance for efficient data processing.

Collaborate with compliance teams to ensure data privacy and security, implementing appropriate measures such as data encryption, access controls, and monitoring mechanisms to adhere to regulatory requirements.

Monitor and troubleshoot data pipelines and systems, proactively identifying and resolving issues related to data ingestion, processing, or availability. Implement effective monitoring and alerting mechanisms to ensure continuous data flow and minimal downtime.

Document data engineering processes, workflows, and configurations for future reference and knowledge sharing. Maintain comprehensive documentation to facilitate collaboration, onboarding of new team members, and compliance with organizational standards.

Develop enriched reporting insights using Power BI, creating visually compelling dashboards, interactive reports, and data visualizations that empower stakeholders to make data-driven decisions.

Collaborate with DevOps teams to automate deployment, configuration, and monitoring of data engineering solutions. Utilize infrastructure-as-code principles and tools like Azure DevOps or Terraform to ensure consistent and reproducible deployments.

Environment: Azure (Data Lake, HDInsight, SQL, Data Factory), Databricks, Cosmos DB, Git, Blob Storage, Power BI, Scala, Hadoop, Spark, PySpark, Airflow.

Big Data Engineer

NantHealth, Providence, RI February 2018 to July 2020

Description: Revamped healthcare analytics at Nant Health by orchestrating diverse data extraction, processing, and team-led pipeline optimization, cutting insights time by 50%. Orchestrated a pioneering healthcare data integration pipeline, optimizing data extraction, processing, and analytics. Streamlined critical healthcare datasets, driving informed decision-making. Spearheaded a multidisciplinary team to design and implement a high-efficiency data pipeline, reducing insights' time by 50%. Employed cutting-edge technologies and methodologies to elevate healthcare data management.

Responsibilities:

Extracted data from HDFS, including customer behavior, sales and revenue data, supply chain, and logistics data.

Stored the data to AWS S3 using Apache NIFI, which is an open-source data integration tool that enables powerful and scalable dataflows.

Validated and cleaned the data using Python scripts before storing it in S3.

Used PySpark to process and transform the data, which is a distributed computing framework for big data processing with Python API.

Loaded the transformed data into AWS RedShift data warehousing to analyze the data.

Scheduled the pipeline using Apache Oozie, which is a workflow scheduler system to manage Apache Hadoop jobs.

Developed and maintained a library of custom Airflow DAG templates and operators, which improved consistency and code quality across the team.

Led a team of three data engineers in designing and implementing a complex data ingestion and processing pipeline for a new data source, which reduced time to insights by 50%.

Analysed the data in HDFS using Apache Hive, which is a data warehouse software that facilitates querying and managing large datasets.

Converted Hive queries into PySpark transformations using PySpark RDDs and Data Frame API.

Monitored the data pipeline and applications using Grafana.

Configured Zookeeper to support distributed applications.

Used functional programming concepts and the collection framework of Scala to store and process complex data.

Used GitHub as a version control system for managing code changes.

Developed visualizations and dashboards using Tableau for reporting and business intelligence purposes.

Environment: S3 buckets, Redshift, Apache flume, PySpark, Oozie, Tableau, Scala, Spark RDDs, Hive, HiveQL, HDFS, HQL, Scala, Zookeeper, Grafana MapReduce, Sqoop, GitHub

ETL Developer

Charter Communications, St Louis, MO November 2016 to January 2018

Description: Revolutionized data integration at Charter Communications, employing Informatica, Kafka, and Sqoop for real-time analytics, fostering informed decisions and operational efficiency. Orchestrated robust data workflows at a financial institution, leveraging Informatica tools for ETL design, implementing Kafka and Spark Streaming for real-time data ingestion, and utilizing Sqoop and Flume for seamless data transfer from diverse sources, enhancing data accuracy and accessibility. Spearheaded end-to-end data management strategies, ensuring adherence to ETL standards through meticulous SQL query evaluations. Drove process optimization by converting business requirements into technical design documents and collaborating with stakeholders to develop comprehensive business requirement documents (BRD), thereby streamlining data processes and enhancing system efficiency.

Responsibilities:

Extensively used Informatica Client tools Power Centre Designer, Workflow Manager, Workflow Monitor, and Repository Manager.

Used Kafka for live streaming data and performed analytics on it. Worked on Sqoop to transfer the data from a relational database and Hadoop.

Configured in building real-time data pipelines using Kafka for streaming data ingestion and Spark Streaming for real-time consumption and processing.

Loaded data from Web servers and Teradata using Sqoop, Flume, and Spark Streaming API.

Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.

Written multiple MapReduce programs for data extraction, transformation, and aggregation from multiple file-formats including XML, JSON, CSV & other compressed file formats.

Extracted data from various heterogeneous sources like Oracle, and Flat Files.

Developed complex mapping using the Informatica Power Centre tool.

Extracted data from Oracle and Flat files, Excel files, and performed complex joiner, Expression, Aggregate, Lookup, Stored procedure, Filter, Router transformation, and Update strategy transformations to load data into the target systems.

Worked with Data modeler in developing STAR Schemas.

Handling a high volume of day-to-day Informatica workflow migrations.

Review Informatica ETL design documents and work closely with development to ensure correct standards are followed.

Worked on SQL queries to query the Repository DB to find the deviations from Company's ETL Standards for the objects created by users such as Sources, Targets, Transformations, Log Files, Mappings, Sessions, and Workflows.

Leveraging the existing PL/SQL scripts for the daily ETL operation.

Experience in ensuring that all support requests are properly approved, documented, and communicated using the QMC tool. Documenting common issues and resolution procedures

Extensively involved in enhancing and managing Unix Shell Scripts.

In converting the business requirement into a technical design document.

Documenting the macro logic and working closely with Business Analyst to prepare BRD.

Involved in setting up SFTP setup with the internal bank management.

Building UNIX scripts in cleaning up the source files.

Involved in loading all the sample source data using SQL loader and scripts.

Environment: Informatica Power Centre, Kafka, Sqoop and Flume, Spark Streaming API, Hadoop and AWS, Oracle and Flat Files, PL/SQL, Unix Shell Scripts, SQL Loader, Technical Documentation Tools, Data Modelling Tools, SFTP Setup

Data Analyst- Python

ICICI Bank, Hyderabad, India June 2014 to October 2016

Description: Enhanced ICICI Bank's analytics: Integrated diverse data sources, performed detailed analysis, and optimized operations, refining customer insights and financial strategies.

Responsibilities:

Experience working on projects with machine learning, big data, data visualization, R and Python development, Unix, and SQL.

Performed exploratory data analysis using NumPy, matplotlib, and pandas.

Expertise in quantitative analysis, data mining, and the presentation of data to see beyond the numbers and understand trends and insights.

Experience analyzing data with the help of Python libraries including Pandas, NumPy, SciPy, and Matplotlib.

Created complex SQL queries and scripts to extract and aggregate data to validate the accuracy of the data and Business requirements gathering and translating them into clear and concise specifications and queries.

Prepared high-level analysis reports with Excel and Tableau. Provides feedback on the quality of Data including identification of billing patterns and outliers.

Worked on sort & filters of tableau like Basic Sorting, basic filters, quick filters, context filters, condition filters, top filters, and filter operations.

Identify and document limitations in data quality that jeopardize the ability of internal and external data analysts' ability; wrote standard SQL Queries to perform data validation; created Excel summary reports (Pivot tables and Charts); and gathered analytical data to develop functional requirements using data modelling and ETL tools.

Read data from different sources like CSV files, Excel, HTML pages, and SQL and performed data analysis and wrote to any data source like CSV file, Excel, or database.

Experience in using Lambda functions like filter, map, and reduce with pandas Data Frame and performing various operations.

Used Pandas API for analyzing time series. Creating regression test framework for new code.

Developed and handled business logic through backend Python code.

Environment: Python, SQL, UNIX, Linux, Oracle, NoSQL, PostgreSQL, and python libraries such as PySpark, and NumPy.

Education:

Bachelor of Technology (CS), ANU University Guntur June-2011 to June-2014.

Contact this candidate