Data Engineer Machine Learning

Location:

Hasbrouck Heights, NJ

Posted:

February 09, 2025

Contact this candidate

Resume:

Stalin Goud

Data Engineer

**************@*****.*** +1-510-***-****

PROFESSIONAL SUMMARY

Results-driven Data Engineer with 3+ years of experience in building scalable data pipelines, optimizing ETL workflows, and automating data infrastructure using AWS, Spark, Python, and SQL. Skilled in designing high-performance data architectures to enhance analytics, reduce costs, and improve processing speeds by 30% or more. Proven expertise in big data technologies, stream processing, and cloud-native solutions, with a focus on data security, automation, and performance optimization. Adept at collaborating with cross-functional teams to translate business needs into actionable data solutions.

TECHNICAL SKILLS

Programming Languages: Scala, Python, R, SQL

IDEs: Eclipse, IntelliJ IDEA, PyCharm, Jupyter Notebook, Visual Studio Code

Big Data Ecosystem: Hadoop, MapReduce, Hive, HDFS, Spark, Kafka, PySpark, Apache Airflow, Sqoop, Flume, Nifi, Oozie, Zookeeper, Apache Flink

Machine Learning: Linear Regression, Logistic Regression, Decision Trees, SVM, K-Means, Random Forest

Cloud Platforms and Services:

AWS: S3, EMR, Redshift, Glue, Lambda, Kinesis, Athena, DynamoDB, CloudWatch, IAM, VPC.

Google Cloud: BigQuery, Cloud SQL, Cloud Composer/Airflow, Cloud Storage, Dataflow/Data Fusion, Dataproc, Pub/Sub

Azure: Data Lake, Data Factory, Databricks, Logic Apps, HDInsight, Synapse Analytics, Azure DevOps

Data Science Libraries: NumPy, Pandas, Matplotlib, SciPy, Scikit-learn, Seaborn, TensorFlow

CI/CD and Visualization Tools: Jenkins, Tableau, Power BI, SSIS, SSRS, SSMS, Docker, Kubernetes, Terrform, Git, Ansible

BI & Analytics: Tableau, Power BI, QuickSight

Databases: SQL Server, PostgreSQL, MongoDB, HBase

Security & Compliance: Data Encryption, Role-Based Access Control (RBAC), GDPR, AWS IAM

Operating Systems: Windows, MacOS, Linux.

WORK EXPERIENCE

Principal Financial New Jersey, USA September 2022 - Present

Data Engineer

Leveraged Spark and Scala APIs for comparative performance analysis between Hive and SQL, leading to a 25% improvement in data processing efficiency.

Designed and implemented scalable ETL pipelines using AWS Glue, Redshift, and PySpark, improving data processing speeds by 30%.

Developed and optimized real-time streaming solutions using Kafka and AWS Kinesis, reducing latency for critical business insights.

Engineered and deployed Tableau dashboards, resulting in a 10% productivity boost for the team by providing real-time, actionable insights.

Automated data infrastructure and workflow orchestration using Apache Airflow, improving pipeline reliability and efficiency by 40%.

Migrated on-prem SQL workloads to AWS, leveraging S3, Athena, and Redshift, cutting operational costs by 20%.

Built data security frameworks using AWS IAM, encryption, and multi-region backups, ensuring compliance with GDPR and other security standards.

Optimized PySpark scripts, achieving a 30% increase in processing speed and significantly enhancing data quality.

Orchestrated a full-scale AWS migration, utilizing Lambda, S3, and Redshift to reduce operational costs by 20%, while improving data quality scores by 50% and boosting machine learning model accuracy by 15%.

Collaborated with product, analytics, and DevOps teams to support data infrastructure scalability, enhancing reporting and analytics capabilities by 35%.

Streamlined Hive schema design and implemented advanced performance techniques, resulting in a 45% improvement in query efficiency and a 30% reduction in processing time.

Integrated Apache Airflow with AWS to oversee multi-stage ML workflows and developed CI/CD pipelines with Git and Jenkins, improving deployment efficiency by 40%.

Architected and executed robust data pipelines on GCP using Apache Airflow, significantly improving workflow automation and reliability.

KPIT Technologies Maharashtra, India August 2020 - July 2022

Data Engineer & Data Engineer Intern

Processed and integrated over 100 GB of data daily using Spark SQL, ensuring optimal data quality and reducing processing errors.

Developed high-performance ETL pipelines to process 100GB+ of structured & unstructured data daily, reducing query execution time by 25%.

Enhanced SQL query optimization techniques, improving data retrieval speed by 20%.

implemented Azure Data Lake and Data Factory pipelines, improving analytics efficiency by 30%.

Designed a Python-based live data conduit using Apache Kafka and AWS Lambda, enhancing real-time data ingestion by 25%.

Built Power BI dashboards that automated reporting, saving 10+ hours per week.

Enhanced data retrieval speed by 20% through the refinement of Hive-based data pipelines.

Engineered Azure Data Lake and Data Factory pipelines to handle structured and unstructured data, boosting analytics capabilities by 30%.

Developed a Python-based live data conduit utilizing Apache Kafka and AWS Lambda, improving data flow by 25% and enabling real-time actionable insights.

Leveraged Apache Airflow for optimized pipeline execution, improving processing efficiency by 25%.

Executed data cleansing workflows using Python and SQL, reducing inaccuracies by 35%, leading to more reliable reporting and decision-making.

Created Power BI dashboards to visualize critical KPIs, saving 10 hours weekly in manual reporting tasks.

Architected and deployed ETL processes with AWS Glue for seamless data transfer from S3 and Parquet to AWS Redshift.

EDUCATION:

New York Institute of Technology May 2024

Master of Science in Computer Science & Specialization in Cyber Security.

Vaagdevi College of Engineering May 2022

Bachelor’s Computer Science & Engineering.

Contact this candidate