Big Data Engineer

Location:

Sudbury, MA

Posted:

May 12, 2025

Contact this candidate

Resume:

Lavanya

E-Mail: *************@*****.***

Phone: 631-***-****

Profile Summary:

•Highly dedicated professional with over 9 years of IT industry hands-on expertise implementing multiple Big Data and Data transformations in various industries like Health, E-commerce, and Financial Services.

•Experience in implementing various Big Data Analytical, Cloud Data engineering, and Data Warehouse / DataMart, Data Visualization, Reporting, Data Quality, and Data virtualization solutions. Experience in implementing various Big Data Analytical, Cloud Data engineering, and Data Warehouse / Data Mart, Data Visualization, Reporting, Data Quality, and Data virtualization solutions.

•Experienced Data Engineer with a proven track record of designing, developing, and optimizing big data solutions across GCP,AWS, Azure and Hadoop ecosystems.

•Expertise in distributed computing architectures, including AWS (EC2, Redshift, EMR, Elasticsearch), Azure Databricks, and Hadoop to handle large-scale data processing.

•Skilled in data transformation workflows using dbt on GCP, with extensive experience in Google BigQuery, query optimization, and large-scale data warehousing.

•Proficient in SQL, Spark SQL, and PySpark, designing and optimizing data models, queries, and analytical pipelines.

•Strong expertise in ETL development using Teradata, Informatica, Oracle, MapReduce, Spark-Scala, Hive, and Pig, ensuring efficient data ingestion, transformation, and aggregation.

•Experience in real-time big data analytics using Kafka, Spark Streaming, Cloud Pub/Sub, and BigQuery, enabling near-instant data insights.

•Hands-on experience in containerized applications using Docker and Kubernetes (GKE & EKS) for high availability and scalability.

•Skilled in workflow orchestration using Airflow and Oozie, implementing robust scheduling, debugging, and monitoring mechanisms.

•CI/CD implementation using Google Cloud Build, Jenkins, and Git, automating deployment pipelines for seamless integration and delivery.

•Proficient in serverless computing with AWS Lambda, Cloud Functions, and Glue, streamlining data transformations and automation.

•Expertise in SQL and NoSQL databases, including MongoDB, HBase, Cassandra, PostgreSQL, and SQL Server, optimizing database performance and integrations.

•Developed Java applications for handling MongoDB and HBase data, leveraging Phoenix for SQL layer implementation on HBase.

•Strong understanding of Hadoop ecosystem, including HDFS, YARN, MapReduce, Sqoop, Flume, and Apache Ambari, with hands-on experience in Hadoop cluster management.

•Experience working on message queuing systems like Kafka, implementing real-time streaming solutions for high-velocity data ingestion.

•Worked on Jenkins pipelines to build and deploy Docker containers, ensuring streamlined application deployment.

•Developed RESTful and SOAP-based web services using JSON, XML, QML, and worked on various Python-based applications using PyCharm and Sublime Text.

•Experienced in data visualization and dashboarding using Tableau and Power BI, presenting actionable insights to stakeholders.

•Built predictive models using Machine Learning techniques, Hadoop, Python, SAS, and SQL, delivering data-driven decision-making capabilities.

•Expertise in Java, Scala, C, Shell Scripting, and Multithreading, with proficiency in J2EE, JDBC, and various Java APIs for application development.

•Strong communication and collaboration skills, effectively engaging with cross-functional teams, stakeholders, and leadership to deliver impactful data-driven solutions.

TECHNICAL SKILLS:

Big Data Ecosystem

Hadoop, MapReduce, Apache Spark, PySpark, Hive, Sqoop, Pig, Kafka, Flume, Impala, Oozie, Sqoop, Storm, Apache Airflow

Programming Languages

Python, Java, Scala, SQL, T-SQL, PL/SQL, HiveQL

Google Cloud Services

GCS, BigQuery, Dataflow, Data Proc, Cloud Composer, Compute Engine, Cloud Functions, GSUTIL, Cloud Shell, Stack Driver

ETL / Reporting Tools

Informatica PowerCenter, Tableau, Power BI, SSIS

Version Control Systems

Git, GitHub, SVN, CVS

EDUCATION:

•Master of Science in Computer Science, Rivier University GPA 3.8

•Bachelors in Computer Science and Engineering, PRRM Engineering (JNTU) GPA 3.6

PROFESSIONAL EXPERIENCE:

Client: Verizon, Tampa, FL April 2021 – Jan 2025

Role: Data Engineer

Responsibilities:

•Used cloud shell SDK in GCP to configure the services Data Proc, Cloud Storage, BigQuery, DataFlow, Cloud Composer services.

•Developed data pipelines in GCP using services like Dataflow, cloud pub/sub, cloud functions, cloud DataProc, Bigquery.

•Integrated Apache Beam pipelines with various Google Cloud services such as BigQuery, Pub/Sub, Storage, and Datastore, enabling seamless data movement and analysis.

•Built data pipelines in GCP for both batch and streaming data processing using Apache beam and Cloud Dataflow.

•Implemented real-time data streaming pipelines using Cloud Dataflow and Pub/Sub, enabling timely insights into business operations.

•Implemented and customized Looker dashboards and reports to meet business requirements, providing stakeholders with actionable insights.

•Extensive experience with BigQuery features including managed storage, columnar storage, and serverless querying for scalable data analytics.

•Designed and implemented efficient data models and schemas in BigQuery to support analytical queries and reporting requirements.

•Optimized BigQuery performance through partitioning, clustering, and indexing strategies to improve query speed and reduce costs

•Conducted query optimization and performance tuning in BigQuery, analyzing query execution plans and identifying opportunities for optimization.

•Built real-time analytics solutions using BigQuery's integration with Cloud Pub/Sub and Dataflow for streaming data processing.

•Processed and analyzed streaming data in real-time using BigQuery's capabilities for continuous query processing and windowing functions.

•Utilized features like query caching, partition pruning, and materialized views to improve query performance and reduce query costs.

•Developed efficient data models and schema designs in Bigtable to accommodate specific data access patterns and analytical requirements.

•Utilized Bigtable column families, row keys, and column qualifiers to organize and optimize data storage and retrieval.

•Implemented and optimized OCR solutions to digitize and extract text from various document types, improving data entry efficiency

•Integrated OCR technology into data pipelines to automate the extraction and processing of textual data from scanned documents and images, enhancing overall data processing workflows.

•Proficient in using OCR tools such as Tesseract, ABBYY FineReader, and Google Cloud Vision to convert physical documents into searchable and editable formats.

•Developed and maintained LookML models to ensure consistency and reusability across Looker dashboards.

•Worked on creating POC for processing both PDF readable file and Scanned images using Cloud Functions and Vision API and built ML model using both BQML and Vertex AI for anomaly detection.

•Work related to downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities.

•Used Apache Oozie for scheduling and managing the Hadoop jobs. Managed and scheduled Oozie jobs using GCloud DataProc.

•Built a Data Ingestion framework to process and load bound and unbound data from Google Pub/Sub Topic to Bigquery using Dataflow with Python SDK.

•Led migration of on-premises workloads to GCP, optimizing performance and enhancing reliability.

•Managed Kubernetes clusters on Google Kubernetes Engine (GKE), ensuring high availability and scalability for containerized applications.

•Deployed and managed highly available Kubernetes clusters on Google Kubernetes Engine to support microservices-based architecture.

•Orchestrated the deployment of applications using Helm charts and Kubernetes manifests, ensuring seamless updates and rollbacks.

•comprehensive monitoring and logging solutions using Prometheus, Grafana, and Google Cloud's operations suite to maintain system health and performance.

•Secured GKE clusters by implementing best practices, including network policies, RBAC, and integration with Google Cloud IAM for fine-grained access control.

•Configured horizontal and vertical pod autoscaling to ensure optimal resource utilization and application performance under varying loads.

•Designed and tested disaster recovery strategies, including backup and restore procedures for Kubernetes clusters and persistent storage.

•Projects

•Implemented a hybrid cloud solution using GKE and on-premises infrastructure, enabling seamless workload migration and cost optimization.

•Integrated GKE with serverless functions using Google Cloud Functions to handle asynchronous workloads and event-driven architectures.

•Developed serverless applications using Cloud Functions, improving agility and reducing time-to-market for new features.

•Implemented CI/CD pipelines with Google Cloud Build and Jenkins, automating deployment processes and increasing efficiency

•Utilized Apache Beam's SDKs in languages like Java, Scala, Python, or Go to build data pipelines tailored to specific use cases.

•Integrated Apache Beam with various data sources and sinks, including Apache Kafka, Apache Hadoop, and Google BigQuery.

•Created compelling data visualizations and dashboards using GCP's Data Studio or other visualization tools, providing actionable insights to stakeholders.

•Contributed to the development of data processing applications using Apache Flink and Scala, providing real-time analytics capabilities.

•Integrated Apache Kafka for efficient and fault-tolerant data streaming between various components of the data infrastructure.

•Collaborated with Data Scientists to deploy machine learning models into production using Scala-based frameworks.

•Integrated Kubernetes into CI/CD pipelines, automating the deployment process and ensuring consistent, error-free releases.

•Monitored and reported on the status of CI/CD pipelines, ensuring timely detection and resolution of build and deployment failures.

•Implemented infrastructure as code (IAC) using tools like Terraform or Ansible to automate the provisioning and management of infrastructure.

•Evaluated and integrated containerization and orchestration tools like Docker and Kubernetes to enhance application deployment and scalability.

•Implemented UDF’S, UDFTS in Hive to process the data that cannot be performed using Hive inbuilt functions.

•Responsible for loading the customer's data and event logs from Kafka into HBase using REST API.

•Experience on Kafka and Spark integration for real time data processing by using Kafka Producer and Consumer components for real time data processing.

•Developing Spark Programs using Scala API’s to compare the performance of Spark with Hive and SQL

•Used Scala libraries to process XML data that was stored in HDFS and used Spark for interactive queries, processing of streaming data and integration with NoSQL database for huge volume of data.

•Converted HIVE/SQL queries into Spark RDD’s, Scala and Python

•Designed and developed jobs that handle the initial load and the incremental load automatically using Oozie workflows. Implemented workflows using Apache Oozie framework to automate tasks.

•Collected data from Hive tables to Teradata using Sqoop. Performed incremental data load to Teradata on a daily basis.

•Responsible for using GIT for version control to commit the code developed which further used for deployment using build and release tool Jenkins.

Client: Ellucian Jan-20 -Sep-2020

Position: Data Engineer

• Assisted in designing and developing scalable data pipelines for academic research and administrative reporting using SQL, Python, and ETL tools.

• Built and optimized complex SQL queries, stored procedures, and indexing strategies to improve data retrieval efficiency and query performance.

• Developed ETL workflows for extracting, transforming, and loading (ETL) data from multiple sources into data warehouses (Snowflake, Redshift).

•Designed and implemented data models to support analytics and reporting, ensuring efficient data organization and accessibility.

•Worked with AWS cloud services to process, store, and manage large datasets securely and efficiently.

•Implemented data quality checks, validation scripts, and error-handling mechanisms, ensuring accuracy, consistency, and completeness.

•Automated data ingestion and transformation pipelines using Apache Airflow, Prefect, or Informatica, reducing manual intervention.

•Assisted faculty and researchers in data extraction, transformation, visualization, and reporting using Power BI, Tableau, and Looker.

•Collaborated with cross-functional teams, including database administrators, analysts, and researchers, to optimize database performance and query execution.

•Gained hands-on experience in big data technologies (Spark, Hadoop, Kafka) for processing and analyzing large-scale structured and unstructured datasets.

•Optimized data warehouse performance by implementing partitioning, clustering, and materialized views to reduce query response times.

•Assisted in developing and maintaining APIs for data access, enabling seamless integration with applications and analytics platforms.

•Performed log analysis and troubleshooting to identify bottlenecks in data pipelines and improve system efficiency.

• Implemented role-based access control (RBAC) and security best practices to ensure compliance with data governance policies.

•Documented data engineering workflows, best practices, and troubleshooting guides to support future students and staff.

•Conducted data profiling and anomaly detection to identify inconsistencies and enhance data integrity.

•Participated in knowledge-sharing sessions and workshops, mentoring junior students in data engineering and ETL development.

•Researched and evaluated emerging data engineering tools and technologies, recommending improvements for existing workflows.

Client: IBM, Pune India May-2014 - Jul-2017

Position: Big Data Engineer

•Designed and developed scalable ETL workflows using Apache Spark and Hadoop MapReduce, handling terabytes of structured and unstructured data.

•Built data ingestion pipelines to extract, transform, and load (ETL) data from RDBMS, flat files, and NoSQL databases into Hadoop-based data lakes.

•Implemented incremental data loading and Change Data Capture (CDC) techniques using Sqoop and Spark to improve data freshness.

•Developed custom Scala-based transformations in Spark, ensuring high-performance data processing.

•Created reusable UDFs (User-Defined Functions) in Scala for complex data transformation logic.

•Integrated Hadoop ecosystem tools (Hive, Pig, Sqoop, HBase, and Impala) to process large-scale data efficiently.

•Implemented real-time data streaming pipelines using Apache Kafka and Spark Streaming, enabling log analytics and fraud detection.

•Built event-driven architectures using message queues (RabbitMQ, ActiveMQ) and Kafka, ensuring efficient data movement across distributed systems.

•Processed high-velocity data streams from multiple sources, optimizing streaming jobs with windowing, watermarking, and stateful processing.

•Developed low-latency streaming solutions for real-time monitoring and alerting systems.

•Optimized Spark jobs by fine-tuning shuffle operations, caching, and partitioning strategies, reducing execution time by 50%.

•Improved HDFS storage efficiency by implementing file compaction, small-file merging, and optimized block sizes.

•Tuned Hive queries using bucketing, indexing, and cost-based optimization (CBO), improving query execution performance by 40%.

•Reduced ETL job failures and processing time by implementing fault-tolerant mechanisms and retry logic in Spark.

•Automated end-to-end ETL workflows using Apache Oozie, Shell scripting, and Crontab, ensuring timely data processing.

•Designed and managed job scheduling and monitoring in Oozie, enabling seamless execution of Spark and MapReduce jobs.

•Created custom alerting and logging mechanisms to track failures and performance issues in batch and streaming jobs.

•Implemented Kerberos authentication and role-based access control (RBAC) for securing data access in Hadoop clusters.

•Ensured data encryption at rest and in transit using SSL/TLS and HDFS encryption zones.

•Worked on metadata management and lineage tracking using Apache Atlas for better data governance.

•Maintained audit logs and compliance reports, ensuring adherence to enterprise security policies. Worked with HBase and Cassandra for storing high-velocity transactional data.

•Designed column-family-based data models in HBase to optimize read/write performance for large-scale applications.

•Tuned NoSQL queries for high-performance analytics on distributed data stores.

Contact this candidate