ANWESH M.
Data Engineer
+1-203-***-**** **********@*****.*** Hartford CT LinkedIn
S UMMARY
Efficient Data Engineer with 5 years' experience in the design, development, and maintenance of extensive business applications encompassing data migration, integration, conversion, and testing. My proficiency extends to the Hadoop Ecosystem, where I have designed and developed applications, bringing innovation to diverse domains such as Finance, Insurance, Healthcare, and Technology. Proven abilities in process automation, workflow streamlining, and improving software delivery. Skilled in Big Data solutions using Hadoop technologies (HDFS, Pig, Hive, Impala, Sqoop, Apache Spark, Kafka, AWS, Azure), Spark-SQL, crafting MapReduce programs, processing varied data types— Structured, Semi-structured, and Streaming—with a focus on optimization through process simplification, Sqoop for seamless data import/export and Azure tools like Data Lake, Data Factory, Blob, and Synapse and automation tools like Control-M. Engaging in full SDLC and Agile methodologies. My strengths encompass log data integration, unit testing, data warehousing principles, and Agile participation, Strong technical team player with Strong communication skills a knack for problem-solving for distributed systems and leveraging emerging technologies for optimal system performance.
S K I L L S
Programming Languages: Python, SQL, PySpark, Scala, Linux Shell Script, R Big Data Processing & Tools: PySpark, Apache Spark, Hadoop, Hive, Sqoop, Ozie, Impala Cloud Computing: AWS Services EC2, S3, RDS, DynamoDB, SNS, SQS, Kinesis, Step Functions, Lambda Functions, Glue, Redshift, Azure, Lambda, CloudWatch, Cloud Formation, Azure Data Factory, Azure Synapse Analytics Relational Databases: MySQL, SQL Server, Oracle, Redshift, Snowflake NoSQL Databases: DynamoDB, Hbase, Cassandra, Mongo DB Data Visualization: Matplotlib, Seaborn, Plotly, Bokeh, Tableau, Power BI Version Control: GitHub, Azure DevOps
ETL Tools: Talend, SSIS, Matillion, Apache Airflow, AWS Glue, Apache Hadoop, PySpark Container Platforms: Docker, Kubernetes, Jenkins (CD/CI) Operating Systems: Windows, Linux, MacOS
Data Processing Techniques: Data Mining, Clustering, Data Visualization, Data Analytics Methodologies: Agile SDLC, Waterfall
WORK E X P E R I E N C E
BNY Mellon Texas January 2024 - Current
Data Engineer
Engineered Apache Spark-based data pipelines that boosted processing efficiency by 30% handling over 1TB of financial data daily.
Orchestrated ETL workflows with Talend and Python, reducing data processing time by 40% and enhancing data accuracy across MySQL, SQL Server, Oracle, and DynamoDB.
Implemented Airflow DAGs (Directed Acyclic Graphs) for scheduling and monitoring data workflows, optimizing task dependencies and execution, which improved pipeline reliability and reduced processing time by 20%.
Worked with Application development team to develop Web service Application with Java Spring MVC and Hibernate.
Developed and optimized multi-thread scripts using Kafka producer and consumer API.
Optimized AWS Redshift clusters, slashing query times by 50% and improving dashboard responsiveness for real-time financial analytics.
Developed real-time data streaming solutions using AWS Kinesis and Lambda Functions, supporting high-frequency trading analytics with data streams of 100K events per second.
Implemented Docker and Kubernetes containers for scalable deployment, integrating CI/CD pipelines with Jenkins, automating data pipeline deployments, and increasing release frequency by 60%.
Designed and implemented scalable data warehousing solutions using Snowflake, optimizing data storage and query performance to support analytics and reporting requirements.
Dell Technologies Texas August 2022 – January 2024 Data Engineer
Migrated critical data from an on-premise MySQL database to DynamoDB, improving scalability and availability.
Developed a data analysis application using MS Access for a small business, improving customer insights and decision-making.
Developed and implemented complex SQL queries that reduced data retrieval time by 25%.
Developed ETL processes to load and transform data into Snowflake, integrating with various data sources and ensuring high data quality and reliability, which improved overall data accessibility and business insights.
Performed end-to-end Architecture for Retail data & implemented various AWS services like Amazon Redshift.
Processed flat files in various file formats and stored them as in various partition models in HDFS.
Good Understanding of Data ingestion for Data Orchestration and other related python libraries.
Created machine learning models in Python using libraries like Scikit-learn, achieving 35% accuracy on specific tasks.
Deep knowledge in incremental imports, partitioning and bucketing concepts in Hive and Spark SQL needed for optimization.
Defining Test cases, analyzing bugs, interaction with team members in fixing errors, Unit testing and User Acceptance Testing (UAT).
Load the data into Spark RDD and do in memory data computation, Performed Data Analysis using SQL, Python, Spark, Databricks, Teradata SQL Assistant
Worked on Rest API’s & Python API’s and landed data to S3 from external sources.
Created Data pipeline for different events of ingestion, aggregation, and load consumer response data in AWS S3 bucket into external tables in Dynamo & Redshift location to serve as feed for tableau dashboards.
Proficient in Python (2.x, 3.x) for data manipulation and statistical programming.
Utilized object-oriented design patterns to build scalable and efficient data models, facilitating the management of large datasets and complex data transformations.
Created and maintained detailed documentation for debugging procedures, unit testing frameworks, and object-oriented designs, ensuring knowledge transfer and facilitating onboarding for new team members.
Utilize SQL for efficient database querying and management. Develop and maintain data models and databases.
Employ HTML5 and CSS3 for web development tasks. Execute data integration and ensure metadata management.
Manage and optimize databases using SQL Server, MySQL, MS Access, Oracle, and Dynamo DB.
Leverage NoSQL databases such as Cassandra for specific data requirements.
Designed SSRS reports for sales analysis that boosted revenue forecasting accuracy by 5%. BCBS Texas March 2021 - July 2022
Data Engineer (Intern)
Developed and maintained database systems using SQL Server, MySQL, MS Access, Oracle, and Dynamo DB in the healthcare sector.
Implemented data transformation processes and pivot tables to enhance data analytics capabilities.
Optimized data manipulation routines in SQL, resulting in 15% faster data processing.
Leveraged statistical programming skills in MATLAB and Simulink for advanced data analysis in healthcare projects.
Designed and managed databases, ensuring efficient data integration and metadata management.
Executed data clustering techniques to identify patterns and trends, contributing to informed decision-making.
Utilized AWS, Azure Data Factory, and Google Cloud Platform for cloud-based data storage and processing.
Automated data ingestion processes using Azure Logic Apps and Azure Functions, improving data availability and reducing manual intervention.
Migrated on-premises data solutions to Azure Synapse Analytics, reducing infrastructure costs and improving scalability.
Designed and developed logical and physical data models that utilize concepts such as Star Schema, Snowflake Schema and Slowly Changing Dimensions.
Deep knowledge in incremental imports, partitioning and bucketing concepts in Hive and Spark SQL needed for optimization.
Applied SDLC methodologies, including Waterfall and Agile, to ensure systematic and effective project development.
Collaborated with cross-functional teams, employing excellent analytical and communication skills.
Reduced data retrieval time in SQL Server by 20% using optimized indexing strategies.
Utilized Python (2.x, 3.x) for scripting and automation tasks, enhancing data engineering workflows.
Developed and maintained healthcare data visualizations using Tableau for effective reporting and analysis.
Implemented automated unit testing frameworks to streamline the testing process, leading to faster deployment cycles and higher software quality.
Implemented machine learning models with Py Torch, Py Spark, Scikit-learn, TensorFlow, and NumPy.
Employed Cassandra for distributed database management, ensuring scalability and reliability.
Conducted regular code versioning and collaboration using Git and SVN.
Executed SSRS and SSIS for efficient SQL Server reporting and integration solutions.
Applied Agile methodologies for rapid and flexible development cycles in healthcare data projects.
Supported data-driven decision-making processes through comprehensive data analysis and reporting.
Ensured compliance with healthcare data regulations and security standards in database management.
Participated in regular code reviews and knowledge-sharing sessions to foster a collaborative team environment.
Actively participated in the deployment and maintenance of data engineering solutions for healthcare applications. Trigent India March 2019 - December 2020
Data Analyst
Collaborated with cross-functional teams to understand healthcare data requirements and designed robust data models for efficient storage and retrieval.
Implemented data integration pipelines using Python (2.x, 3.x) and SQL across diverse database platforms, including SQL Server, MySQL, Oracle, and Dynamo DB.
Utilized Python for data analysis and visualization, leveraging libraries like NumPy, Matplotlib, and Seaborn.
Developed an AWS S3-based data lake architecture, reducing data storage costs by 30%.
Employed statistical programming in MATLAB and Simulink to analyze healthcare datasets, providing valuable insights for decision- making.
Created and optimized SQL queries for complex data retrieval tasks, ensuring data accuracy and reliability for downstream analytics.
Reduced data processing time by 25% using optimized ETL pipelines in Python.
Utilized Pandas, NumPy, and Matplotlib for exploratory data analysis, enhancing data quality and facilitating the identification of patterns and trends.
Implemented data clustering techniques to categorize healthcare data, enhancing the efficiency of analytical processes.
Managed in working with multiple databases, including MySQL, SQL Server, and Oracle.
Built and deployed data pipelines using Python libraries like Pandas, Tensor Flow, and Matplotlib.
Automated data pipelines using Azure Data Factory, improving data processing time by 50%.
Used GCP Big Query to analyze massive datasets, enabling faster and more insightful market research.
Excelled in documenting project specifications, data dictionaries, and technical documentation. E D U C A T I O N
Master of Data Analytics June 2022
North Eastern University
Bachelors in Mechanical Engineering October 2020
Chalapathi Institute of Technology
C E R T I F I C A T I O N S
Python for Data Analysis and Visualization
Complete Python Boot Camp