Data Engineer Power Bi

Location:

Irving, TX, 75038

Posted:

November 11, 2024

Contact this candidate

Resume:

Name: Rohan Rajendra

Email: **************@*****.***

Ph#: +1-346-***-****

LinkedIn: linkedin.com/in/rohanrajendra07/

Experienced Data Engineer with over 6 years of expertise in developing and optimizing scalable data solutions using Apache Spark (Scala, PySpark) and Hadoop (HDFS, Hive). Proficient in cloud platforms such as AWS and GCP, specializing in building and managing high-performance data infrastructure. Advanced skills in designing ETL pipelines with AWS Glue and Apache Airflow, along with real-time analytics using Kafka and Kinesis. Strong command of Python, SQL, and data modeling with Snowflake, complemented by proven expertise in Databricks for large-scale data processing. Adept at crafting actionable insights through dashboards in Tableau and Power BI, with a solid track record of applying Agile methodologies to drive strategic business outcomes and enhance team performance. Professional Summary:

● Expert in Apache Spark (Scala, PySpark), Hadoop (HDFS, Hive), Kafka, Airflow, and Databricks for processing and analyzing large datasets.

● Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and knowledge on Spark MLLib.

● Proficient in AWS (EC2, S3, RDS, EMR, Lambda, Glue, Redshift, Kinesis) and GCP (Big Query, Dataflow, Pub/Sub, Cloud Storage). Extensive experience in designing scalable data infrastructure on these platforms.

● Advanced skills in Snowflake, Amazon Redshift, and Oracle for designing and managing high-performance data warehouses.

● Proven experience in building and optimizing ETL pipelines using AWS Glue, Apache Airflow, and Informatica for efficient data integration and transformation.

● Strong command of Python, SQL, Scala, Java, and PL/SQL for data manipulation, query optimization, and scripting.

● Skilled in creating interactive dashboards and reports using Tableau and Power BI to provide actionable insights and support data-driven decision-making.

● Experienced with Apache Kafka, AWS Kinesis, and GCP Pub/Sub for building real-time data pipelines and analytics.

● Proficient in designing complex data models using Dimensional Modeling (Star Schema, Snowflake Schema) and data warehousing best practices.

● Knowledgeable in CI/CD practices with Jenkins and Terraform for automating deployments and managing infrastructure as code.

● Expertise in Docker and Kubernetes for containerizing applications and orchestrating deployment processes.

● Experienced with Scikit Learn and TensorFlow for developing and deploying machine learning models in data engineering workflows.

● Skilled in working with various data formats including CSV, JSON, Parquet, ORC, AVRO, and XML for data interchange and processing.

● Advanced skills in SQL query optimization for high-performance data retrieval and analytics.

● Proficient in using Git, GitHub, and Bitbucket for version control and collaborative development.

● Experience in integrating data from diverse sources using tools like Apache Sqoop, AWS Glue Catalog, and GCP Dataflow.

● Knowledgeable in data security best practices, including AWS IAM and Google Cloud IAM for managing access and permissions.

● Agile practitioner with experience in Scrum and Kanban methodologies, actively participating in sprints, stand-ups, and retrospectives.

● Experienced in designing and managing Data Lakes and Lakehouse architectures on cloud platforms using technologies like AWS Lake Formation, Delta Lake, or Google Cloud Storage, enabling efficient storage and processing of unstructured and semi-structured data.

● Skilled in integrating data from multiple sources and developing RESTful APIs for data access and manipulation, enabling seamless data flow across applications and systems.

● Competent in managing cloud infrastructure and services to ensure scalability, reliability, and performance.

● Skilled in tuning performance for Hadoop clusters, Spark jobs, and SQL queries to optimize data processing and retrieval.

● Knowledgeable in data privacy regulations such as GDPR and CCPA, with experience in implementing data anonymization and encryption techniques to ensure data security and compliance.

● Proven ability to work collaboratively with cross-functional teams to gather requirements, design solutions, and deliver impactful data engineering projects.

Technical Skills:

Databases AWS RDS, Teradata, Oracle, MySQL, Microsoft SQL, PostgreSQL. NoSQL Databases MongoDB, Hadoop HBase and Apache Cassandra. Data Warehouse Snowflake, Amazon Redshift, Oracle, azure Programming Languages Python, SQL, Scala, MATLAB, Java, PL/SQL Cloud Technologies AWS (EC2, S3, RDS, EMR, Lambda, Glue, Redshift, Kinesis, CloudFront, CloudWatch, IAM), GCP (Big Query, Dataflow, Pub/Sub, Cloud Storage) Data Formats CSV, JSON, Parquet, ORC, AVRO, XML

Querying Languages SQL, NO SQL, PostgreSQL, MySQL, Microsoft SQL Integration Tools Jenkins, AWS Code Pipeline

Big Data Technologies Hadoop, HDFS, Hive, Apache Spark, Databricks, Map Reduce, Sqoop, Apache Kafka, Apache Pig, Apache airflow

Data Visualization Tableau, Power Bi, Matplotlib

Version Control Git, GitHub, Bitbucket

Containerization Docker, Kubernetes

Machine Learning Scikit Learn, TensorFlow

Operating Systems Red Hat Linux, Unix, Windows, macOS. Reporting & Visualization Tableau, Matplotlib.

Professional Experience:

Client: Paychex, Rochester, NY Jan 2023 – Present

Role: Data Engineer

Responsibilities:

● Collaborated with cross-functional teams to gather requirements, design, and develop scalable data solutions.

● Developed and optimized PySpark scripts for large-scale data processing, ensuring efficient data transformation and integration.

● Built a Spark streaming pipeline to process real-time data, applying business logic to detect anomalies and store results in HBase for downstream analytics.

● Designed and implemented end-to-end ETL pipelines using Snowflake, integrating data from multiple sources and ensuring data integrity and quality.

● Created complex SQL queries and performed performance tuning in Snowflake, optimizing data retrieval for large datasets.

● Designed and developed interactive dashboards and visualizations in Power BI, utilizing calculations, parameters, and data hierarchies to provide actionable insights.

● Utilized ETL for building and maintaining data warehousing solutions, implementing slowly changing dimensions for historical data tracking.

● Managed AWS cloud infrastructure, deploying resources such as EC2, S3, RDS, ELB, VPC, CloudFront, and IAM for scalable data processing and storage.

● Employed AWS Glue for serverless ETL, automating the process of data cataloging, transformation, and loading across multiple data stores.

● Implemented AWS Lambda for serverless computing to execute code in response to events, reducing operational overhead and improving scalability.

● Utilized AWS CloudWatch for monitoring system performance and ensuring the reliability and availability of cloud- based services.

● Implemented Apache Airflow for workflow automation, scheduling, and monitoring of ETL processes, ensuring timely data availability.

● Developed and maintained Kafka producers and consumers, enabling real-time data ingestion and processing pipelines for streaming data applications.

● Automated routine tasks using Python, improving efficiency in data manipulation and analysis tasks across various datasets.

● Created and maintained SQL Server databases, ensuring efficient data management and query performance for transactional and analytical processing.

● Designed and implemented data models and schemas for Snowflake and RDS databases to support complex analytical queries and reporting needs.

● Worked on ER modeling and dimensional modeling techniques to design robust data warehouse schemas, including star and snowflake schemas.

● Engaged in Agile methodologies, participating in iterative development cycles, sprint planning, and retrospective meetings to deliver features incrementally.

● implemented security measures such as encryption and role-based access controls (RBAC) to safeguard sensitive data within the AWS environment.

● Managed and optimized AWS cloud infrastructure, automating resource deployment using Terraform and ensuring cost-effective data processing and storage

● Optimized PySpark scripts, resulting in a 30% reduction in data processing time, enhancing overall system efficiency.

● implemented end-to-end encryption using AWS KMS for sensitive data, ensuring compliance with GDPR and enhancing data security.

Environment: Python, PySpark, AWS (EC2, S3, RDS, ELB, VPC, Lambda, CloudWatch, Glue, IAM, CloudFront), Snowflake, Apache Kafka, Airflow, Power BI, Informatica ETL, SQL Server, NoSQL, Agile, Hadoop, HDFS, Terraform. Client: Seven Eleven, Irving, TX Sep 2020 -May 2022 Role: Aws Data Engineer

Responsibilities:

● Designed and implemented scalable data infrastructure solutions using Apache Spark on AWS EMR for processing large-scale data transformations and denormalization of datasets.

● Developed Python-based ETL scripts to aggregate, cleanse, and transform data, ensuring data quality and consistency across multiple systems, with a focus on integration with AWS Glue.

● Developed a custom data validation framework using Python, which improved data accuracy by 20% and streamlined data integration processes

● Utilized AWS Redshift for building and managing data warehouses, optimizing ETL processes for performance and scalability.

● Created and maintained real-time data pipelines using Apache Kafka and AWS Kinesis for efficient processing and analysis of streaming data.

● Engineered complex data models and schema designs in AWS Redshift to support high-performance analytical queries and reporting.

● Automated data workflows and job scheduling with Apache Airflow, integrating with AWS Lambda and Step Functions for seamless orchestration.

● integrated Databricks with AWS Redshift for real-time data processing, significantly enhancing analytics capabilities and decision-making efficiency.

● Developed interactive Tableau dashboards to visualize data insights, enabling stakeholders to perform self-service analytics and drive informed decision-making.

● Implemented data extraction and transformation processes using SQL and PL/SQL, optimizing queries for faster data retrieval and analysis on AWS Athena.

● Leveraged AWS S3 for secure and scalable storage of processed data, ensuring data accessibility and reliability, with automated data lifecycle management.

● Designed and developed data aggregation and transformation routines in PySpark, optimizing them for performance on distributed systems in AWS EMR.

● Utilized Apache Hive on AWS EMR for querying and managing large datasets stored in HDFS, supporting data analysis and reporting needs.

● Integrated and processed unstructured and semi-structured data from various sources using AWS Glue Catalog and Athena for flexible querying.

● Worked with Dimensional Data Modeling (Star Schema, Snowflake Schema) to structure data for efficient querying and reporting in AWS Redshift.

● Developed and deployed Python APIs hosted on AWS Lambda to expose processed data for consumption by downstream applications and services.

● Configured and managed Snowflake and AWS Redshift data warehouses, optimizing for cost and performance while handling complex data transformations.

● Employed MapReduce techniques for parallel processing of large data sets, optimizing the data transformation and loading processes on AWS EMR.

● Extracted data from MongoDB using Sqoop, processed it in AWS S3 and HDFS, and loaded it into Hive on AWS EMR for further analysis.

● Utilized Tableau Gateway to sync on-premise data sources with cloud-based Tableau Server, ensuring up-to-date reports and dashboards.

● Automated SQL queries and ETL processes using Shell Scripting, AWS Glue, and Airflow, ensuring consistent and reliable data processing across AWS environments.

● Implemented data security measures including encryption at rest and in transit using AWS KMS for sensitive data protection.

● Optimized AWS resource usage, resulting in a 20% reduction in data processing costs through strategic instance and storage selection.

● Collaborated with business analysts to translate complex business requirements into scalable data solutions, enhancing data-driven decision-making.

● Implemented data governance policies to ensure compliance with industry standards and regulations.

● Collaborated in Agile development teams, participating in daily stand-ups, sprint planning, and code reviews to deliver high-quality data solutions using AWS services. Environment: Python, Spark, PySpark, AWS (EMR, S3, Redshift, Glue, Kinesis, Lambda, Athena,KMS, QuickSight), Apache Kafka, Hadoop, Hive, Tableau, Airflow, MongoDB, SQL, NoSQL Client: Sonic Healthcare, Austin, TX Apr 2019 - Aug 2020 Role: Data Engineer

Responsibilities:

● Participated in requirement gathering sessions with business users and sponsors to document business requirements and align data solutions with business goals.

● Developed Spark jobs using Python in a distributed environment for high-performance data processing and real- time analytics.

● Designed and implemented a configurable data delivery pipeline for scheduled updates to customer-facing data stores, enhancing data availability and timeliness.

● Built and maintained scalable ETL processes using Snowflake for data cleansing, transformation, and validation, ensuring data quality and integrity.

● Created data models and schema designs for Snowflake to support complex analytical queries, reporting, and data integration.

● Developed Spark Python scripts to process and load large volumes of data into Hive ORC tables, improving data processing performance.

● Utilized Apache Airflow for scheduling and monitoring ETL jobs, integrating with Snowflake and GCP services for data extraction and loading.

● Leveraged GCP Big Query for real-time data analysis and reporting, facilitating actionable insights and decision- making.

● Utilized GCP Big Query for real-time data analysis, leveraging advanced features to deliver actionable insights and improve decision-making processes.

● Developed interactive Tableau dashboards and published them on Tableau Server to provide real-time data visualization and reporting.

● Automated data pipelines and workflows using Apache Airflow, ensuring efficient and reliable data processing.

● Utilized GCP Cloud Storage for storing and managing large datasets, integrating with Snowflake for analytics and data lifecycle management.

● Worked on dimensional modeling (Star schema, Snowflake schema) and data warehousing solutions to support business intelligence and advanced analytics.

● Implemented SQL and PL/SQL stored procedures for complex data transformations and integrations, optimizing query performance.

● Analyzed and optimized performance of data pipelines and queries, applying performance tuning techniques to ensure efficient data processing and reporting.

● Coordinated with cross-functional teams and stakeholders to deliver data solutions aligned with business needs and project goals.

● Employed Agile methodologies for iterative development and continuous improvement in data engineering processes.

● Conducted regular code reviews and performance evaluations to enhance the quality and efficiency of data solutions.

● Developed and maintained comprehensive documentation for data models, ETL processes, and data pipelines, including data governance and quality control measures.

● Delivered a data solution for Sonic Healthcare that improved data availability and timeliness, facilitating more accurate reporting and analysis.

Environment: Python, Spark, Snowflake, GCP Big Query, Tableau, Airflow, GCP Cloud Storage, SQL, PL/SQL, Agile, Windows.

Client: Credit One Bank, Las Vegas, NV June 2018 – Apr 2019 Role: Data Engineer

Responsibilities:

● Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development.

● Developed various spark applications using Python to perform various enrichment of these click stream data merged with user profile data.

● Developed highly complex Python code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries.

● Developed Spark code in Python and Spark SQL environment for faster testing and processing of data and loading the data into Spark RDD and doing In-memory computation to generate the output response with less memory usage.

● Designed, developed, tested, and maintained Tableau functional reports based on user requirements.

● Developed ETL’s in using Spark SQL, RDD, and Data Frames.

● Worked on Python code base related to Apache Spark performing the Actions, Transformations on RDDs, Data frames and Datasets using Spark SQL and Spark Streaming Contexts.

● Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of pySpark using Python.

● Applied industry-specific data processing techniques to enhance the integration of clickstream and user profile data, optimizing analytics and reporting.

● Worked on migrating MapReduce programs into pySpark transformations using Python.

● Worked with different feeds data like JSON, CSV, XML and implemented the Data Lake concept.

● Analyzed the SQL scripts and designed the solution to implement using PySpark.

● Use SQL queries and other tools to perform data analysis and profiling.

● Followed agile methodology and involved in daily SCRUM meetings, sprint planning, showcases and retrospective. Environment: Spark, Hadoop, Python, Pyspark, AWS, MapReduce, ETL, HDFS, Hive, HBase, SQL, Agile and Windows. Education:

Saint peter's university, Jersey City NJ May 2022 - Feb 2024 Masters of science in Data science

GPA: 3.8/4.0

Certifications:

AWS Certified Data Engineer – Associate https://www.credly.com/badges/1b74f73b-1c18-4766-87de- 3915d914abc5/linked_in_profile

Contact this candidate