Data Engineer Big

Location:

Missouri City, MO

Posted:

April 08, 2024

Contact this candidate

Resume:

Dileep Kumar

Phone: +1-636-***-****

Email: ************@*****.***

LinkedIn: https://www.linkedin.com/in/dileep-marneni-2a91a473/

Objective

Skilled Data Engineer with a proven record of success in designing and implementing data solutions on AWS cloud platform. Proficient in building and optimizing data pipelines, data warehouses, and ETL processes. Experienced in working with diverse datasets and technologies to deliver robust and scalable solutions. Strong analytical and problem-solving abilities coupled with excellent communication and teamwork skills.

Education

Master’s in information systems 2022 - 2023

Saint Louis University Saint Louis, MO

Summary

Designed and implemented ETL pipelines to extract data from multiple sources, including relational databases, APIs, and flat files.

Developed Extract, Transform, Load (ETL) processes to populate and maintain data warehouses from various source systems, ensuring data accuracy, consistency, and timeliness.

Designing and maintaining high-performance data warehouses using Amazon Redshift and Snowflake.

Designed and implemented cloud-native data architectures on AWS, leveraging services such as Amazon S3, AWS Glue, Amazon EMR, and Amazon Redshift to ingest, transform, and analyse large volumes of data.

Implemented distributed data processing solutions using Apache Spark and Hadoop, enabling real-time and batch processing of big data workloads with high scalability and fault tolerance.

Experienced in deploying and managing big data technologies such as Apache Spark and Hadoop for processing and analysing large datasets.

Proficient in SQL and NoSQL databases, with a focus on optimizing data storage and retrieval for performance and efficiency.

Adept at implementing and optimizing database systems for efficient data storage and retrieval.

Skilled in Python programming for data manipulation, automation, and machine learning tasks.

Well-versed in CI/CD practices and tools like Jenkins and GitLab CI/CD, enabling automated testing, deployment, and monitoring of data pipelines.

Managed Git repositories on GitHub and Bitbucket to version control code, configurations, and data artifacts for data engineering projects.

Managed Kubernetes clusters using tools like Kubernetes Dashboard and cloud-based Kubernetes services Amazon EKS, monitoring cluster health, resource utilization, and performance metrics.

Experienced in developing interactive dashboards and reports using tools like Tableau and Power BI to visualize.

Designing scalable data solutions to drive business insights and enhance decision-making processes.

Skills

Big Data Technologies Programming Languages Data Warehousing & ETL Tools

Hadoop, Kafka, Spark, Flink, HBase, Python, SQL, Scala Informatica PowerCenter

Databricks Snowflake

Cloud Platforms Database Systems Version Control

AWS (S3, EMR, Redshift, Glue. Athena) RDBMS - PL/SQL, MySQL, SQL Server, Git, GitHub, Bitbucket

Microsoft Azure NoSQL Databases (MongoDB, Cassandra)

Containerization Tools CI/CD Reporting Tools

Kubernetes, Docker Jenkins, GitLab Power BI, Tableau

Employment History

Data Engineer Oct 2023– Present

Capital One McLean, VA

Designed and implemented end-to-end data pipelines on AWS, leveraging services such as S3, Glue, EMR, and Redshift to ingest, process, and analyse large volumes of data.

Designed and implemented ETL (Extract, Transform, Load) processes to populate the data warehouse from various source systems.

Design and implementation of data solutions on AWS, including data ingestion, storage, processing, and analysis, to support business intelligence, analytics, and machine learning initiatives.

Designing and optimizing Hive tables, writing efficient HiveQL queries for data transformation, and performance tuning for large-scale data processing tasks. Using Apache Airflow for workflow orchestration and scheduling of data processing tasks.

Worked with big data technologies such as Hadoop, Spark, Kafka, to process and analyse large volumes of structured and unstructured data efficiently. Optimize data processing workflows for performance and scalability.

Utilized Amazon EMR to manage and deploy Apache Hadoop and Apache Spark clusters for processing large volumes of data in parallel.

Managing jobs that are written in Scala and Python to perform tasks like data extraction, transformation, and loading (ETL)

Manage and optimize databases, both relational and NoSQL (MongoDB & Cassandra) to store and retrieve data effectively. Implement database schemas, indexes, partitions, and other optimizations to improve query performance and resource utilization.

Using SQL Server Integration Services (SSIS) and Data Transformation Services (DTS Packages) to efficiently import data from external sources into databases and export data to external systems.

Implement CI/CD practices for data pipelines to ensure smooth deployment, testing, and updates.

Implemented Agile methodologies such as Scrum and Kanban to iteratively develop and deploy data pipelines, ETL processes, and analytics solutions.

Managed project backlogs, sprint planning, and sprint reviews to ensure alignment with business objectives and stakeholder priorities.

Data Engineer Nov 2017 – Dec 2021

Max Healthcare New Delhi, India

Integrate data from various sources such as clinical trials, research studies, electronic health records (EHRs), laboratory experiments, and drug manufacturing processes. This involves designing and maintaining data pipelines to ensure the smooth flow of data.

Develop and maintain ETL processes to extract data from diverse sources and transform it into a suitable format and load it into Amazon Redshift.

Monitor incoming data feeds from various sources such as clinical trials, electronic health records (EHR), laboratory systems, and regulatory databases to ensure data integrity and completeness.

Perform data transformations to standardize formats, resolve data inconsistencies, and harmonize terminology across different data sets to facilitate integration and analysis.

Cleanse and preprocess data to ensure its accuracy, consistency, and reliability. This may involve managing missing values, data normalization, and standardization.

Design, develop, and maintain data warehouses or data lakes to store large volumes of structured and unstructured data efficiently.

Implement data modelling techniques such as star schema or snowflake schema to optimize querying and analysis.

Integrating data from (Apache Kafka, Amazon Kinesis) including database MySQL and NoSQL database i.e., MongoDB, API’s (RESTful APIs) and third-party services, to create a unified view of data for analysis and reporting.

Implement data governance policies and security measures to ensure compliance with regulatory requirements such as HIPAA (Health Insurance Portability and Accountability Act) and GDPR (General Data Protection Regulation). This includes data anonymization, encryption, access control, and audit trails.

Optimize the performance of databases, data pipelines, and analytical queries to manage large-scale datasets efficiently. This may involve indexing, partitioning, caching, and parallel processing techniques.

Collaborate with data scientists and analysts to perform exploratory data analysis, develop predictive models, and derive actionable insights from the data.

Create interactive dashboards and visualizations using Power BI to communicate findings effectively.

Conduct thorough testing of data pipelines, ETL processes, and analytical models to ensure accuracy, reliability, and robustness. Implement automated testing frameworks and monitor data quality metrics regularly.

Strong programming acumen in Python, SQL for intricate data manipulation tasks and scripting. Utilized these skills to develop efficient data processing solutions and automate repetitive tasks.

Created automated and scheduled reports using Power BI's subscription and distribution features, enabling stakeholders to receive up-to-date insights without manual intervention.

Demonstrated ability to connect, clean, and transform data from various sources into usable formats within Power BI, ensuring data accuracy and consistency for meaningful visualizations.

Implemented dynamic parameterization within Airflow workflows, allowing for dynamic scheduling and execution of tasks based on changing data sources and conditions.

Hadoop Developer Oct 2015 – Sep 2017

Broadridge Hyderabad, India

Installed and configured Hive, HDFS and the NIFI, implemented HDP Hadoop cluster. Helped with performance tuning and monitoring.

Involved in loading and transforming large sets of structured data from router location to EDW using a NIFI data pipeline flow.

Design and maintain data models, schemas, and metadata definitions for the data warehouse.

Collaborate with data analysts and business users to understand reporting requirements and translate them into actionable data models.

Developed PySpark code and Spark-SQL for faster testing and processing of data.

Worked on Data serialization formats for converting complex objects into sequence bits by using Parquet, ORC, AVRO, JSON, and CSV formats.

Created Hive tables to load large data sets of structured data coming from WADL after transformation of raw data.

Created reports for the BI team using SQOOP to export data into HDFS and Hive.

Managed and reviewed Hadoop log files.

Evaluated suitability of Hadoop and its ecosystem to project and implemented various proof of concept applications to eventually adopt them to receive help from the Hadoop initiative.

Documenting data warehouse structures, data lineage, transformations, and business rules. Maintaining metadata repositories to track data sources, data definitions, and mappings.

Monitoring data warehouse processes, including ETL jobs, data loads, and query performance. Finding and resolving issues, such as data load failures or data discrepancies, on time.

Collaborating with business analysts, data architects, and other stakeholders to understand data requirements, define data warehouse solutions, and ensure alignment with business goals.

Continuously evaluate and optimize data warehouse solutions and data products to meet evolving business needs.

Stay abreast of industry trends, emerging technologies, and best practices in data engineering.

Database Administrator Mar 2013 – July 2015

Hitachi Ltd Hyderabad, India

Installed, configured, and maintained MySQL database servers on Linux and Windows platforms, ensuring optimal performance and security configurations.

Conducted performance tuning activities, including query optimization, index optimization, and configuration tuning, to improve database performance and response times.

Perform regular maintenance tasks such as database backups, restores, and integrity checks to ensure data consistency and reliability.

Implemented caching mechanisms, such as query caching and caching solutions like Redis and Memcached, to reduce query execution times and improve overall system performance.

Managed SQL Server instances, databases, and services, performing routine maintenance tasks such as database backups, restores, and integrity checks.

Developed and implemented backup and recovery strategies for MySQL databases, including full backups, incremental backups, and point-in-time recovery procedures.

Configured MySQL replication for data redundancy and high availability, implementing master-slave and master-master replication topologies to ensure data consistency and failover resilience.

Automated backup processes using tools like MySQL dump, MySQL Enterprise Backup, or third-party backup solutions, ensuring data integrity and disaster recovery readiness.

Review and optimize database schemas, table designs, and data relationships to ensure scalability, maintainability, and data integrity in financial systems.

Developed PowerShell scripts and T-SQL scripts to automate routine database administration tasks, such as database provisioning, backup scheduling, and performance monitoring.

Test backup and recovery procedures periodically to validate their effectiveness and minimize data loss in the event of hardware failures.

Plan and execute database upgrades, migrations, and patch deployments to apply security patches, bug fixes, and performance enhancements while minimizing downtime and service disruptions.

Maintain comprehensive documentation of database configurations, procedures, and troubleshooting guides to facilitate knowledge sharing and collaboration among team members.

Contact this candidate