Data Engineering Engineer

Location:

New Paltz, NY

Salary:

120000

Posted:

February 20, 2025

Contact this candidate

Resume:

VINEESHA MORAM

+1-904-***-**** ***************@*****.***

DATA ENGINEER - AWS AZURE ETL PYTHON

JOB OBJECTIVE:

I am seeking a challenging role where I can leverage my extensive expertise in big data technologies, Cloud platforms, Data visualization, and Data pipeline optimization to drive data-driven decision-making and deliver significant business value. My goal is to continuously innovate and contribute to organizational growth through strategic data engineering and cutting-edge technological solutions.

SUMMARY PROFESSIONAL:

Over 5 years of experience in data engineering with a focus on AWS, Azure, ETL processes, and Python. Proficient in designing, developing, and maintaining robust data pipelines and workflows.

Advanced skills in cloud platforms, particularly AWS and Azure, with hands-on experience using AWS services like S3, EMR, Glue, Redshift, and Lambda, and Azure services like Azure Data Factory, Databricks, SQL DB, Synapse, Key Vault, and Data Lake Storage.

Extensive experience with Snowflake, including data warehousing, data modeling, and optimization. Skilled in designing and managing scalable data pipelines and performing complex queries using Snowflake’s capabilities.

Proven expertise in ETL processes and tools, integrating data from multiple sources and ensuring effective data transformation and loading.

Skilled in Python for developing data processing scripts, automation, and custom data solutions. Proficient in using Python libraries for data manipulation, analysis, and machine learning.

Hands-on experience with real-time data processing technologies like Apache Kafka and Apache Flink, capable of managing and processing real-time data streams for timely insights.

Strong background in data warehousing technologies, including Snowflake and BigQuery, and experienced in designing and optimizing data models for business intelligence.

Skilled in creating interactive and informative dashboards using Tableau and Power BI, designing reports to support data-driven decision-making.

Expertise in developing automation frameworks and scripts for data validation, process testing, and operational efficiency, as well as automating repetitive tasks and workflows.

Experienced in developing and managing APIs for data exchange and integration across various platforms, creating robust solutions to connect disparate systems.

Proven ability to plan and execute cloud data migration projects, ensuring smooth transitions and minimal disruptions. Knowledgeable in Docker and Kubernetes for containerization and orchestration, with experience deploying and managing containerized applications and microservices.

Experience implementing data governance frameworks and ensuring compliance with regulations like GDPR and CCPA, focusing on data privacy, security, and protection.

Familiar with CI/CD practices and tools such as Jenkins, GitLab CI/CD, and Terraform, skilled in automating deployment processes and maintaining high code quality.

Strong collaboration skills with experience working with cross-functional teams, including data scientists, analysts, and business stakeholders, delivering data solutions.

Managed data engineering projects from inception to completion, including scoping, planning, execution, and delivery, with the ability to handle multiple projects and meet deadlines.

Proven problem-solving abilities, troubleshooting complex data engineering issues, including performance bottlenecks and system integration challenges.

Demonstrated adaptability to new technologies and tools, with a commitment to continuous learning and professional growth in the data engineering field.

Proficient in managing and optimizing various database systems, including relational databases, NoSQL databases, and Snowflake, ensuring high performance and availability.

Strong focus on delivering solutions tailored to client needs, providing actionable insights, and aligning data solutions with business objectives.

TECHNICAL SKILLS:

Operating Systems: Linux RHEL/Ubuntu, Windows (XP/7/8/10) UNIX, MAC.

Programming Languages: Python, R, C, C++, Java, Scala, SAS, Shell, Bash.

Python Libraries/Packages: Psycopg2, Pandas, Matplotlib, SQL, HTTPLib2, Teradata, Pyhive, NumPy, SciPy, Boto, CDK, PySide, PyTables, Data Frame.

Statistical Analysis Skills: A/B Testing, Time Series Analysis, Marko

IDE: PyCharm, PyScripter, Spyder, PyStudio, PyDev, IDLE, NetBeans, Sublime Text, Visual Code

Machine Learning

and Analytical Tools: Supervised Learning (Linear Regression, Logistic Regression, Decision Tree, Random Forest, SVM,

Classification), Unsupervised Learning (Clustering, KNN, Factor Analysis, PCA), Natural Language Processing, Tableau.

Amazon Web Services: EC2, S3, MQ, ECS, Lambdas, Redshift, Sage maker, RDS, SQS, DynamoDB IAM, Cloud Watch, EBS and CloudFormation

Databases/Servers: Hive, DB2, MySQL, SQLite3, PostgreSQL, MongoDB

ETL: Informatica 9.6, SSIS, Azure Data Factory, Azure Databricks, Azure SQL DB, Azure Synapse Analytics, Azure Blob Storage.

Webservices/Protocols

and Miscellaneous: HTTP/HTTPS, Rest, Restful, Gitlab, GitHub

Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Sqoop, Spark, Airflow, Azure Databricks, Azure HDInsight,

Azure Data Lake Storage, Snowflake, Azure Synapse, Kafka, Yarn, Hudi, Oozie, Zookeeper.

Build and CI Tools: Docker, Kubernetes, Azure DevOps, Jenkins, Screwdriver

Agile Tools: Jira, Rally

PROFESSIONAL EXPERIENCE

Strategic Financial Solutions NY, Amherst, New York Dec 2023 to Present Azure Data Engineer

Description

“Strategic Financial Solutions” helps individuals overcome financial challenges with personalized debt relief solutions, fostering stability and community impact. I specialize in deploying data processing solutions on Azure.

Roles and Responsibilities:

Worked on migrating data from an on-prem SFTP server to Azure Cloud using Azure Data Factory.

Designed and implemented data pipelines to extract data from on-premises systems and load it into Azure Delta Lake and Snowflake.

Built ETL pipelines for historical and incremental data migration from PostgreSQL and SQL Server to Snowflake, optimizing data processing and storage efficiency.

Configured Databricks Notebooks to load data into Delta tables and implemented data transformations using SQL, Python, and Scala, including JSON file parsing.

Developed Spark applications using PySpark and Spark SQL for data extraction, transformation, and aggregation.

Managed SQL pools in Azure Synapse for query performance optimization and configured consistency models in Cosmos DB.

Built and optimized data pipelines from legacy SQL Server to Azure Data Warehouse using Azure Data Factory, Databricks, and Azure SQL DB.

Implemented Azure Linked Services to integrate with Snowflake and used stored procedures within ADF to enhance data processing automation.

Worked on Secure Data Sharing in Snowflake and API-based data access to support external data products.

Integrated with external reporting tools such as Looker, Tableau, and Power BI for business intelligence and analytics.

Contributed to defining technical standards, including schema design, naming conventions, and data governance frameworks.

Managed CI/CD pipelines and automated testing for data pipelines using Jenkins, GitHub Actions, and Azure DevOps, ensuring robust deployment practices.

Worked with GitHub for branching, tagging, and release activities while utilizing Agile methodologies and tracking sprint cycles in JIRA.

Provided mentorship to junior engineers, conducted knowledge-sharing sessions, and participated in code reviews to uphold best practices in data engineering.

Implemented performance optimization techniques such as partitioning, indexing, materialized views, and caching strategies to enhance scalability and efficiency in large-scale data environments.

Designed and developed data models for efficient querying and reporting using Snowflake SQL and Azure SQL DB.

Built and optimized data workflows for real-time and batch data processing using Azure Data Factory and Databricks.

Worked on data security best practices, including the use of Azure Key Vault for secrets management and ensuring compliance with data governance standards.

Integrated data lakes and data warehouses with cloud-native services to ensure seamless data flow and high availability.

Environment: Azure, Azure Data Factory, Snowflake, performance tuning, ADLS gen2, dataflow jobs, copy activity, lookup activity, Data Flow, linked services, Stream Sets, Databricks, Snow pipe, Scala, SQL.

PySpark, Python, Azure Key Vaults, JIRA, GitHub.

Essen Health Care, Bronx, New York Oct 2022 to Nov 2023

Data Engineer

Description

“Essen Health Care” provides comprehensive healthcare services through their network, offering Urgent Care, Primary Care, and Specialty Care divisions. Their commitment is to deliver top-quality valued care to patients across all medical needs. Designed and developed strategies for efficient data loading, transformation, and pipeline development using AWS services.

Roles and Responsibilities:

Understand business needs and technical requirements using tools like JIRA and Confluence and conduct stakeholder meetings for requirement documentation.

Gather data from various sources such as CRM systems, transaction databases, and online sources, for real-time data change tracking, Apache Kafka or AWS Kinesis for real-time streaming, and Apache NiFi or AWS Glue for batch integration.

Use Apache Hudi to efficiently manage record-level inserts, updates, and deletes on large datasets, allowing for incremental data processing and providing support for upserts and data versioning.

Ensure data quality and consistency by handling missing values, outliers, and performing data normalization and transformation, primarily utilizing PySpark and Python libraries like Pandas and NumPy.

Conduct exploratory data analysis (EDA) to gain insights into data distribution and patterns using visualization libraries such as Matplotlib, Seaborn, and Plotly within Jupyter Notebook.

Utilize AWS S3 as the central data lake for storing raw, processed, and curated datasets, enabling scalable and cost-effective storage for large volumes of structured and unstructured data. AWS S3 provides cost-effective storage with seamless integration for data processing and analytics services.

Store processed data in scalable and accessible formats using AWS S3, AWS Redshift, and AWS DynamoDB for low-latency, high-throughput access to real-time data.

Organize data in AWS S3 using partitioning strategies (e.g., by date, region, or other relevant dimensions) to enhance query performance and data retrieval efficiency in downstream analytics.

Develop ETL pipelines using AWS Glue to load data into Redshift, Clickhouse, to perform necessary transformations, with UNIX scripts to manage scheduling and execution.

Utilized DBT (Data Build Tool) to develop and manage data models in Snowflake and Redshift, streamlining transformations and improving query performance.

Leverage Databricks along with AWS EMR, Glue, and Lambda to optimize data workflows and pipeline performance.

Build predictive models and perform advanced analytics using Scala and Spark for linear regression, Apache Flink for real-time processing, and PySpark for data computation and modeling.

Integrate ClickHouse for handling real-time data analytics workloads, ensuring sub-second query response times for large datasets.

Automate and manage workflows by integrating Apache Airflow with AWS to schedule and monitor ML workflows and employ Lambda functions for event-driven processing and AWS CloudFormation for infrastructure automation.

Facilitate data querying and visualization using AWS Athena, Clickhouse and create visualizations and reports with Power BI.

Use Docker and Kubernetes for containerization and orchestration, monitor services with Prometheus and Grafana, and automate infrastructure deployment with CloudFormation templates.

Generate reports and provide insights using MicroStrategy for advanced reporting and Power BI for developing dashboards and visualizations.

Ensure continuous improvement through regular code reviews, model evaluations, and gathering stakeholder feedback, by adopting tools like Git for version control, Bitbucket for collaboration, Jenkins for CI/CD, and Unix for environment setup and deployment tasks.

Create a Python framework for AWS cloud automation, incorporating multiprocessing for end-of-day (EOD) and intraday data uploads and extractions, utilizing AWS CDK and Terraform for seamless integration and deployment.

Environments: Hadoop, Spark, Scala, AWS, EMR, Lambda, S3, Elasticsearch, Athena, Glue, Redshift, Cloud watch, Snowflake, Elastic Map Reduce, Hive, Sqoop, Oozie, Tableau, DBT, SQL, NumPy, Pandas.

Alepo Technologies Inc, Austin, Texas Oct 2021 to Apr 2022

Data Engineer

Description

Alepo Technologies enables telecoms to seize AI-driven transformation opportunities, offering advanced software solutions for revenue growth and market expansion across fixed and mobile networks with a focus on digital enablement and scalability. This project involves the design, development, and deployment of comprehensive data engineering solution leveraging.

Roles and Responsibilities:

Identified and cataloged data sources, defining ingestion frequency, transformation rules, and end-user needs to guide data processing and reporting.

Planned infrastructure using AWS services (S3, Redshift, Glue, EMR, Lambda, Kinesis, EC2, Athena, CloudWatch, AWS Glue, SageMaker), ensuring scalability, reliability, and cost-efficiency.

Implemented batch data ingestion by designing AWS Glue jobs, writing Python and Spark extraction scripts, and loading raw data into Amazon S3 for further processing.

Set up real-time data ingestion by configuring Amazon Kinesis and managing clusters with Docker and Kubernetes, streaming data into S3 for real-time processing.

Conducted batch processing using AWS Glue and EMR, executing PySpark and SparkSQL jobs for data transformation, and scheduling jobs with AWS Step Functions or Lambda. Implemented real-time data pipelines with Kinesis and AWS Lambda, ensuring reliable, efficient processing for continuous data streams.

Stored raw and processed data in Amazon S3 as a Data Lake, and transformed data into Redshift or Athena for analytics, optimizing schema design for querying.

Ensured data quality and validation using Python scripts and AWS Glue jobs, maintaining audit logs for governance and compliance.

Developed reporting dashboards using Tableau and Power BI, providing actionable insights and supporting business decision-making with advanced analytics.

Deployed and optimized data pipelines using Kubernetes and Amazon EC2, while monitoring system performance and ensuring continuous operation, scalability, and cost-efficiency with AWS CloudWatch.

Environment: AWS, redshift, glue, lamda, kinesis, S3, Athena, Cloud watch, Kubernetes, Airflow, Docker, Hive, HDFS, Compute Engine, SparkSQL, PySpark, Apache Kafka, Python, IBM DB2, Shell Scripting, PowerBI, Tableau.

Brio Technologies Private Limited Hyderabad India Jun 2019 to Sep 2021

Data Engineer

Roles and Responsibilities:

Developed scalable data solutions using Apache Hadoop, Hive, Pig, and Oozie to process large structured and semi-structured datasets.

Created complex Hive queries for ETL processes, extracting and loading data from Data Lakes into HDFS for efficient retrieval.

Designed end-to-end ETL pipelines between HDFS and AWS S3, automating workflows with AWS Data Pipeline and Airflow.

Processed large datasets with Spark (DataFrames, RDDs), writing Python scripts to transform and store data in Parquet format in HDFS.

Optimized MapReduce jobs and SQL queries, improving job performance and execution times on large data sets.

Automated batch workflows using Oozie for job scheduling and monitoring in both development and production environments.

Integrated Kafka and Spark streaming for real-time data processing, ensuring high throughput and fault tolerance.

Leveraged AWS services (S3, RDS, EMR, EC2) to build, store, and process distributed data on the cloud, supporting scalable architectures.

Ingested data into HDFS with Pig scripts and extended functionality through custom User Defined Functions (UDFs).

Automated real-time data ingestion using Apache Flume, processing log data from various sources into HDFS.

Built CI/CD pipelines for seamless deployment of data solutions, ensuring efficient release cycles in both sandbox and production environments

Environment: Python, Shell Scripting, Splunk, Sqoop, Hadoop, HDFS, MapReduce, Hive, Apache Spark, Sqoop, HBase, Pig, Kafka, Flume, Oozie, AWS (EC2, S3, VPC, RedShift, EMR, CloudFormation Templates), Oracle, MySQL, DB2, SQL Server.

Education: Bachelor’s in computer science, Vignan University, India

Master’s in data science, Saint peters university, NJ, USA.

Contact this candidate