Trishika
Data Engineer
**********@*****.*** +1-469-***-****
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Professional Summary:
Around, 8 years of professional experience as a Software developer in design, development, deploying and supporting large scale distributed systems.
Around 6 years of extensive experience as a Data Engineer and Big data Developer specialized in Big Data Ecosystem-Data Ingestion, Modeling, Analysis, Integration, and Data Processing.
Extensive experience in providing solutions for Big Data using Hadoop, Spark, HDFS, Map Reduce, YARN, Kafka, Pig, Hive, Sqoop, HBase, Oozie, Zookeeper, Cloudera Manager, Horton works.
Strong experience working with Amazon cloud services like EMR, Redshift, DynamoDB, Lambda, Athena, Glue, S3, API Gateway, RDS, CloudWatch for efficient processing of Big Data.
Hands on experience building PySpark, Spark Java and Scala applications for batch and stream processing involving Transformations, Actions, Spark SQL queries on RDD’s, Data frames and Datasets.
Strong experience writing, troubleshooting and optimizing Spark scripts using Python, Scala.
Experienced in using Kafka as a distributed publisher-subscriber messaging system.
Strong knowledge on performance tuning of Hive queries and troubleshooting various issues related to Joins, memory exceptions in Hive.
Exceptionally good understanding of partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive.
Experience in importing and exporting data between HDFS and Relational Databases using Sqoop.
Experience in real time analytics with Spark Streaming, Kafka and implementation of batch processing using Hadoop, Map Reduce, Pig and Hive.
Experienced in building highly scalable Big-data solutions using NoSQL column-oriented databases like Cassandra, MongoDB and HBase by integrating them with Hadoop Cluster.
Manager and SaaS, PaaS and IaaS concepts of Cloud Computing and Implementation Worked with Google Cloud (GCP) Services like Compute Engine, Cloud Functions, Cloud DNS, Cloud Storage and Cloud Deployment using GCP.
Worked with Google Cloud(GCP) Services like Compute Engine, Cloud Functions, Cloud DNS, Cloud Storage and Cloud Deployment Manager and SaaS, PaaS and IaaS concepts of Cloud Computing and Implementation using GCP.
Extensive work on ETL processes consisting of data transformation, data sourcing, mapping, conversion and loading data from heterogeneous systems like flat files, Excel, Oracle, Teradata, MSSQL Server.
Experience of building ETL production pipelines using Informatica Power Center, SSIS, SSAS, SSRS.
Proficient at writing MapReduce jobs and UDF’s to gather, analyze, transform, and deliver the data as per business requirements and optimizing the existing algorithms for best results.
Experience in working with Data warehousing concepts like Star Schema, Snowflake Schema, DataMarts, Kimball Methodology used in Relational and Multidimensional data modeling.
Used AWS IAM, Kerberos and Ranger for security compliance.
Strong experience leveraging different file formats like Avro, ORC, Parquet, JSON and Flat files.
Sound knowledge on Normalization and De-normalization techniques on OLAP and OLTP systems.
Good experience with Version Control tools Bitbucket, GitHub, GIT.
Experience with Jira, Confluence and Rally for project management and Oozie, AirFlow scheduling tools.
Experienced in Strong scripting skills in Python, Scala and UNIX shell.
Involved in writing Python, Java API’s for Amazon Lambda functions to manage the AWS services.
Good Knowledge in building interactive dashboards, performing ad-hoc analysis, generating reports and visualizations using Tableau and PowerBI.
Experience in design, development and testing of Distributed Client/Server and Database applications using Java, Spring, Hibernate, Struts, JSP, JDBC, REST services on Apache Tomcat Servers.
Hands on working experience with RESTful API’s, API life cycle management and consuming RESTful services
Have good working experience in Agile/Scrum methodologies, communication with scrum calls for project analysis and development aspects.
Education:
Masters in Computer Science 2019 Pass out
University of Illinois Springfield
Bachelors in Information Technology 2016 Pass Out
Jawaharlal Nehru Technological University Hyderabad
Technical Skills:
Programming Languages: Python, Scala, SQL, Java, C/C++, Shell Scripting
Web Technologies: HTML, CSS, XML, AJAX, JSP, Servlets, JavaScript
Big Data Stack: Hadoop, Spark, MapReduce, Hive, Pig, Yarn, Sqoop, Flume, Oozie, Kafka, Impala, Storm
Cloud Platform: Google Cloud Platform (GCP), Amazon Web Services (AWS), Microsoft Azure, Azure SQL Database
Relational databases: Oracle, MySQL, SQL Server, PostgreSQL, Teradata, Snowflake
NoSQL databases: MongoDB, Cassandra, HBase, Pig
Version Control Systems: Bitbucket, GIT, SVN, GitHub
IDEs: PyCharm, Intellij IDEA, Jupyter Notebooks, Google Colab, Eclipse
Operating Systems: Unix, Linux, Windows
Professional Experience:
CVS, Connecticut August 2022 – Present
Role: Sr. Data Engineer
Project Overview:
The CVS Aetna Cloud Migration Project aimed to revamp the healthcare data infrastructure by transitioning from on-premises systems to Google Cloud Platform (GCP). In my role as a GCP Data Engineer, I was instrumental in executing the end-to-end data migration process, focusing on moving large datasets from legacy Hadoop and Spark clusters directly into BigQuery. My responsibilities included designing and implementing efficient data pipelines using Infoworks, which automated the extraction, transformation, and loading (ETL) processes. This end-to-end approach ensured a seamless transition of data, minimizing downtime and optimizing the workflow. I was involved in every phase of the migration, from initial data assessment and cleansing to the final loading and validation in BigQuery.
Responsibilities:
Working in Agile Scrum Development process
Working with large sets of data and solving difficult analytical problems as per the client requirements Google Cloud Platform services such as Big Query, Cloud Composer (Airflow), Cloud DataProc, Cloud Pubsub(kafka) and GCS
Implementation of Code in Database Programming using Python as per use cases and performing the unit testing
Performing necessary validations on Data ingestion for Incremental and full data loads
Led the migration of on premises Hadoop and Spark clusters to GCP Dataproc, improving performance.
Architected and deployed scalable data processing pipelines using Apache Spark on GCP Dataproc.
Designed and implemented real time data streaming applications using Spark streaming and cloud Pub/Sub, enabling real time analytics for large datasets.
Implemented data lake solutions on GCP using Cloud Storage, BigQuery and Spark enabling efficient storage and analysis of large datasets.
Implemented a Python script to upload and manage data files in Google Cloud Storage, facilitating efficient data backups and streamlined data retrieval for analytics.
Developed Python-based ETL pipelines to integrate data from diverse sources into Google BigQuery, enabling real-time analytics and improving data accessibility for stakeholders.
Implemented error handling and monitoring mechanisms in Airflow DAGs, utilizing GCP Cloud Monitoring and Logging to track and resolve issues proactively, enhancing pipeline reliability.
Integrated Apache Airflow with GCP services like BigQuery, Dataflow, and Cloud Functions, creating seamless data workflows that optimized data processing and transformation.
Integrated Tidal with GCP for seamless data transfer between on-premises systems and cloud services, enabling efficient data movement and improving overall data accessibility.
Implemented monitoring and alerting for Tidal jobs using GCP’s monitoring tools to ensure high availability and timely issue resolution, which resulted in reduced downtime and faster incident response.
Worked on DataProc cluster for Parsing the Large XML data by using PySpark operations such as Transformations and Actions in Jupiter Notebook
Working with Cloud Composer (Airflow) as workflow Orchestration tool for creating modular data transformations for daily activities and scheduling the DAGs
Experience in build and managing the data pipelines to support large scale data management in align with data strategy and data processing management
Working with cross functional teams such as other data engineers, arch team to get the data into prod servers and validate the same
Scheduling and monitoring the pipelines by connecting to Airflow webserver with GCP composer
Executing Functional, Regression testing when requires behavioural data and customer profile data to analytics Data store.
Developed PySpark and Scala based Spark applications for performing data cleaning, event enrichment, data aggregration, de-normalization and data preparation needed for machine learning and reporting teams to comsume
Hands on experience in GCP, Big Query, GCS bucket and Stack driver.
Designed and maintained complex Jenkins pipelines using Declarative and Scripted Pipeline syntax to support multi-environment deployments and enhance development efficiency.
Utilized Apache Pig and Apache Hive to perform data transformations and querying within the Hadoop ecosystem, enabling complex data analysis and reporting.
Developed and executed MapReduce and Spark jobs on Dataproc for large-scale data processing, leveraging GCP’s managed infrastructure to enhance processing speed and efficiency.
Worked on POC to check various cloud offerings including Google Cloud Platform (GCP)
Migrated from Hadoop, teradata to Google Cloud Platform
Continuously monitor and manage data pipeline(CI/CD) performance alongside applications from a single console with GCP
Worked on small file fragmentation for Hive tables with high partition counts
Implemnted OpenShift to GCP avoiding degradation of many partion/small files.
Using version control tools like GIT and VS code for source code management
Environment: BigQuery, Cloud Storage, Cloud Dataproc, Google Cloud Dataflow, Cloud Pub/Sub,Python, SQL,
Apache Spark, Apache Beam, Jupyter Notebooks, IntelliJ, PyCharm, Git, Looker, Jenkins, Hadoop
Pfizer, New York City NY Oct 2021 – Jul 2022
Role: Sr Data Engineer
Project Overview:
At Pfizer, existing on-premises systems were limiting our ability to scale and analyze vast amounts of clinical and operational data. With increasing demands for faster insights, this project was led to meet the necessity of transitioning to a cloud-driven architecture. The challenge was not only to migrate large datasets but also to ensure data integrity, security, and compliance with industry regulations. I was responsible for designing the cloud architecture, ensuring optimal use of GCP services like BigQuery for data analytics and Cloud Storage. I implemented Cloud Pub/Sub to facilitate real-time data streaming, allowing teams to access current data quickly and make informed decisions based on the latest insights.
Responsibilities:
Developed and optimized ETL pipelines to facilitate data movement between Teradata and GCP services, including Google Cloud Storage and BigQuery, enhancing data processing and improving overall workflow efficiency.
Designed and implemented comprehensive ETL solutions using Apache Spark on Google Cloud Dataproc, effectively integrating with Google Cloud Storage and BigQuery to automate data extraction, transformation, and loading processes.
Utilized Apache Airflow on Google Cloud Composer to automate and orchestrate ETL workflows, coordinating Apache Spark jobs and integrating with other GCP services to streamline data processing and scheduling.
a variety of data sources into ETL pipelines, including databases, APIs, and flat files, employing Apache Spark for efficient data processing and transformation before loading into GCP’s storage solutions.
Developed Python applications to interact with GCP APIs, automating tasks such as retrieving data from Google BigQuery and uploading files to Google Cloud Storage, enhancing operational efficiency.
Implemented Python scripts for interfacing with Google Cloud Monitoring and Logging APIs, setting up automated alerts and generating performance reports to ensure system reliability.
Integrated BigQuery with various data sources, including Google Cloud Storage, Google Analytics, and third-party providers, creating a cohesive data repository that streamlined data ingestion processes.
Developed complex SQL queries and transformation scripts in BigQuery to meet advanced analytics and reporting needs, thereby enhancing data-driven decision-making capabilities.
Created and maintained comprehensive data models in BigQuery to support business intelligence and reporting requirements, ensuring accurate and meaningful data representation.
Improved BigQuery performance by fine-tuning queries, partitioning tables, and clustering data for more efficient data retrieval.
Created and deployed Apache Spark jobs across various environments, loading data into NoSQL databases like Cassandra, Hive, and HDFS, while implementing encryption for data security.
Implemented various AWS solutions, including EC2, S3, RDS, and Elastic Load Balancer, establishing monitoring, alarms, and notifications using CloudWatch for effective resource management.
Worked extensively with GCP services such as Compute Engine, Cloud Functions, Cloud DNS, and Cloud Deployment Manager, applying SaaS, PaaS, and IaaS concepts for efficient cloud computing solutions.
Developed applications using Apache Spark, Scala, and other technologies, leveraging Jenkins, Docker, and GitHub for CI/CD processes, ensuring smooth deployments across environments.
Scheduled Informatica jobs using Autosys, managing workflows to ensure timely and effective data processing.
Created custom filters and calculations on SOQL for Salesforce queries and utilized Data Loader for ad-hoc data loads.
Extensively worked with Informatica PowerCenter, focusing on mappings, parameters, workflows, and session management for robust data integration.
Responsible for facilitating data load pipelines and benchmarking performance against established standards to ensure efficiency.
Employed SQL tools like TOAD and SQL Developer to run queries and validate data integrity, troubleshooting issues as necessary.
Conducted system reviews to provide unified evaluations of job performance, coordinating onsite and offshore teams to ensure deliverables meet expectations.
Engaged in thorough testing of databases using complex SQL scripts, effectively addressing performance issues and optimizing query execution.
Environment: Google Cloud Storage, BigQuery, Spark SQL, PySpark, Apache spark 2.4.5, Scala2.1.1, Cassandra, HDFS, Hive, GitHub, Jenkins, kafka,SQL Server 2008 UNIX, AWS, GCP, Service Now etc.
GM Financials, Texas Dec 2020 – Sep 2021
Role: Sr Data Engineer
Project Overview:
At GM Financials project, the team concentrated on maintain vast amounts of financial and customer data using BigQuery for data warehousing and analysis. I played a key role in optimizing data storage and retrieval processes. I established data models in BigQuery and implemented partitioning and clustering strategies to enhance query performance. My work involved close collaboration with cross-functional teams, including data scientists and business analysts, to ensure that the migrated data met their analytical needs.
Responsibilities:
Led the migration of a legacy data warehouse from on-prem data sources to GCP, improving scalability and reducing operational costs.
Developed and maintained ETL processes using Infoworks to automate data integration from various sources into BigQuery.
Utilized Apache Spark for large-scale data processing tasks, to decrease processing speed.
Automated data workflows with Apache Airflow, creating complex DAGs to manage dependencies and streamline ETL jobs.
Conducted extensive data profiling and quality checks to ensure data accuracy and reliability post-migration.
Established automated deployment processes using Terraform, enhancing consistency in infrastructure management.
Implemented CI/CD practices with Cloud Build, facilitating automated testing and deployment of data pipelines.
Collaborated with business stakeholders to define reporting requirements, resulting in improved data models for analytics.
Developed complex SQL queries for data extraction and reporting in Teradata, providing timely insights for strategic decisions.
Created streaming data pipelines for near real-time analytics using Cloud Pub/Sub, enhancing operational reporting capabilities.
Built comprehensive documentation of ETL processes, architectures, and best practices for knowledge sharing.
Conducted regular data audits to ensure compliance with data governance policies and maintain data integrity.
Developed and maintained reporting dashboards to visualize key performance indicators and operational metrics.
Engaged in user acceptance testing (UAT) to validate reporting solutions and ensure stakeholder satisfaction.
Created test plans and cases for ETL processes to validate functionality and performance before production deployment.
Assisted in developing data governance frameworks to ensure compliance with regulatory requirements.
Collaborated with data architects to design scalable data models that support reporting and analytics needs.
Monitored data pipeline performance metrics and implemented enhancements to improve efficiency and reliability.
Established a version control system for data pipelines to improve collaboration and change management across the team.
Coordinated with security teams to implement best practices for data protection and compliance in cloud environments.
Created visualizations to communicate insights from data analyses to non-technical stakeholders effectively.
Environment: Google Dataproc, Google Cloud Storage, Spark, Spark-Streaming, Spark SQL,map R, HDFS, Hive, Pig, Apache Kafka, Sqoop, Python, Pyspark, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Tableau, MySQL, Soap, Cassandra and Agile Methodologies.
State Farm, Bloomington, Illinois Sep 2019 – Nov 2020
Role: Big Data Engineer
Project Overview: In the State Farm project, I contributed as a GCP and Hadoop Developer, focusing on enhancing the organization’s data analytics capabilities. I was responsible for designing and implementing ETL pipelines that integrated data from various sources, ensuring high data quality and consistency. We utilized GCP tools like BigQuery for advanced analytics, enabling faster querying and reporting capabilities.
Responsibilities:
Actively involved in Agile methodologies as a key contributor in scrum meetings, promoting collaboration, adaptability, and rapid delivery of data engineering projects.
Designed and implemented scalable cloud data architectures on GCP, integrating with Hadoop to facilitate efficient data ingestion, transformation, and analysis.
Developed and managed end-to-end data pipelines using Cloud Dataflow and Apache Beam, ensuring seamless integration of data from various sources, including both batch and streaming.
Established data lakes using Google Cloud Storage, ensuring structured data organization and accessibility for analytics and business intelligence applications.
Utilized Apache Spark for data transformation and processing, employing both PySpark and Scala to develop efficient data processing algorithms and analytics applications.
Conducted performance tuning for Hadoop jobs, optimizing Hive queries, MapReduce jobs, and Spark applications to enhance processing efficiency and reduce runtime.
Developed real-time data ingestion frameworks using Cloud Pub/Sub and Spark Streaming, ensuring low-latency data access for critical applications.
Engineered solutions for seamless data integration between GCP, Hadoop, and other cloud platforms (AWS, Azure), ensuring data consistency and availability across environments.
Implemented data governance policies and security protocols using Cloud IAM and Kerberos to safeguard sensitive data and ensure compliance with regulatory standards.
Developed RESTful APIs for exposing data and functionalities from GCP and Hadoop services, enabling integration with external applications and services.
Managed and optimized Hadoop clusters using Cloudera Manager and Hortonworks, ensuring high availability and performance of data processing jobs.
Led initiatives to migrate on-premises data to GCP and Hadoop environments, ensuring data integrity and minimal downtime throughout the transition.
Developed custom User Defined Functions (UDFs) in Hive and BigQuery to extend functionality and optimize data transformation processes.
Implemented CI/CD pipelines for automating deployment of data pipelines and applications, enhancing development efficiency and reducing manual intervention.
Environment: Hadoop 3.0, Hive 2.3, MapReduce, Spark 2.2.1, Shells scripts, SQL, Python, MLLib, HDFS, YARN, Java, Kafka 1.0, Cassandra 3.11, Agile
Client: Genpact Jan 2016– Jul 2018
Role: GCP and Hadoop Engineer
Project Overview: In my role as an Hadoop and GCP Engineer at Genpact, I was instrumental in designing and implementing robust data integration solutions that streamlined our data processing workflows. The project focused on optimizing the extraction, transformation, and loading (ETL) processes to enhance data accuracy and accessibility across various business units.
Responsibilities:
Ingested data from RDBMS sources including Oracle, SQL Server, and Teradata into Google Cloud Storage (GCS) using Apache Sqoop for efficient data transfer and integration.
Loaded datasets into BigQuery from source CSV files utilizing Apache Spark for batch processing, ensuring optimal data structuring for analytics.
Established environments for accessing loaded data via BigQuery SQL and Google Cloud Dataproc for interactive analytics.
Developed real-time data ingestion and analysis solutions using Apache Kafka and Apache Spark Streaming, enabling timely insights from streaming data.
Wrote and optimized Scala programs to run on Google Cloud Dataproc (YARN) for data analysis, enhancing performance and scalability.
Created and managed BigQuery External Tables for querying data from GCS, optimizing data accessibility and performance.
Developed Apache Beam jobs to parse log data, structuring it in a tabular format to facilitate efficient querying and analysis.
Designed and implemented Apache Airflow workflows to schedule and orchestrate ETL processes, improving data pipeline management and monitoring.
Managed and reviewed log files in Google Cloud Logging, utilizing shell scripts for automation and analysis of operational data.
Migrated existing ETL jobs to Apache Beam and Dataflow, performing complex transformations and aggregations before storing data in BigQuery.
Utilized SQL join queries in BigQuery to combine multiple tables, loading results into Elasticsearch for advanced searching capabilities.
Implemented real-time streaming solutions using Kafka and Kafka Streams, performing transformations and enriching data streams on-the-fly.
Developed Spark Streaming jobs in Scala to consume data from Kafka topics, transforming the data and integrating it into BigQuery for analytics.
Experienced in managing and reviewing extensive Hadoop log files, ensuring optimal cluster performance and reliability.
Assisted in integrating Google Cloud Storage with Snowflake, optimizing data storage solutions and enhancing access for analytics.
Utilized Tableau to create interactive dashboards for weekly, monthly, and daily reporting, publishing visualizations to Google Cloud for broader access.
Environment: Google Cloud Composer, Google Cloud Storage, BigQuery, Spark, Spark SQL, Spark Streaming, Scala, Kafka, Hadoop, HDFS, Hive, Oozie, Pig, Nifi, Sqoop,
Shell Scripting, HBase, Jenkins, Tableau, Oracle, MySQL, Teradata