Post Job Free

Resume

Sign in

Data Google Cloud, Hadoop Ecosystem, Shell Scripting, Python.

Location:
Roselle, IL
Posted:
August 24, 2023

Contact this candidate

Resume:

Name: Mirza Baig

Phone No: 615-***-****

Email ID: ady61f@r.postjobfree.com

Data Engineer/ BigData developer

PROFESSIONAL SUMMARY:

•Around 7 years of programming experience involved in all phases of Software Development Life Cycle (SDLC)

•Over 7+ Years of Big Data experience in building highly scalable data analytics applications.

•Strong experience working with Hadoop ecosystem components like HDFS, Map Reduce, PySpark, HBase, Oozie, Hive, Sqoop, Pig, Flume and Kafka

•Good hands on experiencing working with various Hadoop distributions mainly Cloudera (CDH), Hortonworks (HDP) and Amazon EMR.

•Good understanding of Distributed Systems architecture and design principles behind Parallel Computing.

•Expertise in developing production ready Spark applications utilizing Spark-Core, Dataframes, Spark-SQL, Spark-ML and Spark-Streaming API's, SciKitLearn, SparkML(MLlib) and Tensorflow.

•Strong experience troubleshooting failures in spark applications and fine-tuning spark applications and hive queries for better performance.

•Strong expertise in designing and implementing data models in Snowflake, including tables, schemas, and relationships.

•Worked extensively on Hive for building complex data analytical applications.

•Experience working with designing and development of scalable solutions on Google Cloud Platform (GCP).

•Sound Knowledge in map side join, reduce side join, shuffle & sort, distributed cache, compression techniques, multiple hadoop Input & output formats.

•Worked with Apache NiFi to automate the data flow between the systems and managed flow of information between system.

•Proficient in developing data pipelines and ETL processes to extract, transform, and load data from various sources into Snowflake.

•Good experience working with AWS Cloud services like S3, EMR, Redshift, Athena, Dynamo DB etc.

•Worked on setting up GCP’s cloud identity and access management (IAM) to manage users, roles and privileges.

•Deep understanding of performance tuning, partitioning for optimizing spark applications.

•Worked on building real time data workflows using Kafka, Spark streaming and HBase.

•Extensive knowledge on NoSQL databases like HBase, Cassandra and Mongo DB.

•Familiarity with integrating Snowflake with other tools and services, such as data integration platforms or analytics frameworks, to create end-to-end data solutions.

•Solid experience in working with csv, text, sequential, avro, parquet, orc, json formats of data.

•Proficient with container systems like Docker and container orchestration like EC2 Container Service, Kubernetes, worked with Terraform.

•Designed, implemented, and managed end-to-end data pipelines using Azure Data Factory, orchestrating data movement and transformations across various sources and destinations.

•Managed Docker orchestration and Docker containerization using Kubernetes.

•Extensive experience in performing ETL on structured, semi-structured data using Pig Latin Scripts.

•Designed and implemented Hive and Pig UDF's using Java for evaluation, filtering, loading and storing of data.

•Knowledge of best practices for data engineering in Snowflake, including data partitioning, data distribution, and optimization techniques.

•Experience in using Hadoop ecosystem and processing data using Tableau.

•Experience with Apache Phoenix to access the data stored in HBase.

•Good knowledge in the core concepts of programming such as algorithms, data structures, collections.

•Proficient in monitoring and troubleshooting Snowflake instances to ensure smooth operation and resolve any issues or bottlenecks.

•Developed core modules in large cross-platform applications using JAVA, JSP, Servlets, Hibernate, RESTful, JDBC, JavaScript, XML, and HTML.

•Extensive experience in developing and deploying applications using Web Logic, Apache Tomcat and JBOSS.

•Development experience with RDBMS, including writing SQL queries, views, stored procedure, triggers, etc.

•Developed ETL workflows using Azure Databricks to process and transform data, ensuring data quality and consistency.

•Strong understanding of Software Development Lifecycle (SDLC) and various methodologies (Waterfall, Agile).

•Experience in data governance practices, including data lineage, data quality, and metadata management in Snowflake.

TECHNICAL SKILLS

Programming Skills

Java/J2EE, JSP, Servlets, AJAX, JDBC, JavaScript, PHP and Python.

Databases

MYSQL, SQL, DB2 and Teradata

Web services

REST, AWS, Servers Apache Tomcat, WebSphere, JBoss

Operating Systems

Unix, Linux, Windows, Solaris

IDE tools

My Eclipse, Eclipse, NetBeans

QA Tools

SonarQube, Crashlytics or Fabrics

Web UI

HTML, JavaScript, XML, SOAP, WSDL

EDUCATION

Master’s Degree in Computer Science

From Governor’s State University (GSU)

Bachelor’s Degree in Computer Science & Information Technology

From Jawaharlal Nehru Technological University (JNTU)

WORK EXPERIENCE:

CVS Corporation, Richardson, TX. Dec 2021 - May 2023

Data Engineer

Roles & Responsibilities:

Designed and implemented data storage architecture on GCP, leveraging Google Cloud Storage (GCS) buckets for efficient and scalable data storage.

Designed, implemented, and managed end-to-end data pipelines using Azure Data Factory, orchestrating data movement and transformations across various sources and destinations

•Created and managed GCS buckets to store and organize large volumes of data, ensuring proper permissions, and data lifecycle management.

•Utilized GCS object versioning and lifecycle management to efficiently manage data retention, archival, and deletion, minimizing storage costs and optimizing data lifecycle processes.

•Implemented Python scripts for data extraction from GCP buckets, enabling seamless access to and retrieval of data for further analysis or processing,

•Designed and implemented data processing pipelines using Google Cloud Dataflow, enabling scalable and efficient data processing and analytics.

•Utilized Azure SQL Database and Azure Synapse Analytics (formerly SQL Data Warehouse) to build scalable and performant data warehousing solutions, enabling efficient data storage and retrieval.

•Leveraged Azure Cosmos DB to manage NoSQL databases, optimizing data access for real-time applications.

•Integrated Cloud Dataflow pipelines with data sources such as Google Cloud Storage, Pub/Sub for seamless data ingestion and output.

•Implemented data transformation and mapping processes to harmonize and standardize healthcare data into FHIR resources, facilitating seamless interoperability and data exchange across systems.

•Implemented data processing solutions using Azure HDInsight (Hadoop, Spark) to handle large-scale data sets, enabling advanced analytics and insights extraction.

•Integrated Azure Stream Analytics and Event Hubs to process real-time streaming data for immediate insights and decision-making.

•Performed Data harmonization using whistle to GCP HealthCare resources based on the business requirements.

•Developed and optimized complex SQL queries in Google BigQuery to extract insights, perform aggregations, and generate reports from large datasets.A

•Collaborated with stakeholders to understand FHIR data requirements and ensure alignment with HL7 industry standards.

•Developed and implemented FHIR store purging strategies to maintain data integrity and compliance with data retention policies and regulatory requirements.

•Designed and executed automated purging processes for FHIR stores, utilizing GCP (Google Cloud Platform) services such as Cloud Dataflow to schedule and execute purging jobs.

•Designed data models and schemas using Azure Data Lake Storage and Azure SQL Data Warehouse, ensuring optimal performance for analytical queries.

•Created star and snowflake schemas to support multidimensional analysis and reporting.

•Worked on migration of on-premises data workloads to Azure cloud infrastructure, optimizing performance, scalability, and cost-effectiveness.

•Refactored existing data pipelines to take advantage of Azure services and architecture best practices.

•Utilized FHIR APIs to identify and select the appropriate data for purging based on criteria such as date, resource type, and retention period.

•Developed data purging scripts using python programming language, effectively removing expired FHIR resources from the store while maintaining data integrity and referential integrity.

•Experience in creating Email alerts with Splunk and also designing and building interactive dashboards to visualize log data and operational metrics for effective data-driven decision-making.

•Implemented CI/CD pipelines using Azure DevOps to automate deployment, testing, and monitoring of data solutions.

•Used version control systems like Git for tracking changes to data engineering code and configurations.

Environment: GCS, Bigquery, Sql, Dataflow, Cloud function, Pub-sub, Python, GCP- Fhir store, Cloud Shell, Unix/Linux, Splunk.

Target Corporation, Minneapolis, MN. Mar 2019 – Nov 2021.

Hadoop Developer

Roles & Responsibilities:

•Developed Spark applications using PySpark utilizing Data frames and Spark SQL API for faster processing of data.

•Developed highly optimized Spark applications to perform various data cleansing, validation, transformation and summarization activities according to the requirement

•Data pipeline consists Spark(PySpark), Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze operational data.

•Worked with Snowflake-specific features such as virtual warehouses, clustering, and caching to optimize data storage and processing

•Developed Spark jobs and Hive Jobs to summarize and transform data.

•Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.

•Involved in converting Hive/SQL queries into Spark transformations using Spark(PySpark) DataFrames and Scala.

•Used different tools for data integration with different databases and Hadoop.

•Analyzed the SQL scripts and designed the solution to implement using Pyspark

•Worked on performance tuning and query optimization in Snowflake to improve query performance and reduce processing time

•Involved in installation of Tez and improved the query performance.

•Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.

•Built real time data pipelines by developing kafka producers and spark streaming applications for consuming.

•Ingested syslog messages, parses them and streams the data to Kafka.

•Worked on designing and development of scalable solutions on Google Cloud Platform (GCP).

•Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS.

•Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.

•Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis

•Worked on Migration from On-Prem to Google Cloud Platform(GCP).

•Analyzed the data by performing Hive queries (Hive QL) to study customer behavior.

•Worked with data analysts and business users to understand their data requirements and provide them with reliable and accessible data within Snowflake.

•Helped Dev ops Engineers for deploying code and debug issues.

•Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.

•Developed Hive scripts in Hive QL to de-normalize and aggregate the data.

•Scheduled and executed workflows in Oozie to run various jobs.

•Experience in using Hadoop ecosystem and processing data using Amazon AWS.

Environment: Hadoop, HDFS, HBase, Spark, Scala, Snowflake, Hive, MapReduce, Sqoop, ETL, Java, PL/SQL, Oracle 11g, Unix/Linux.

Capital One, Seattle, WA. Apr 2016 – Mar 2019

Hadoop Developer

Roles & Responsibilities:

•Build a framework Spark with Scala and migrated existing PySpark applications to improve the runtime and performance.

•Developed highly optimized Spark applications to perform various data cleansing, validation, transformation and summarization activities according to the requirement

•Performed Transformations like De-normalizing, Cleansing of data sets, Date Transformations, parsing some complex columns.

•Worked with different compression codecs like GZIP, SNAPPY and BZIP2 in MapReduce, Pig and Hive for better performance.

•Worked with Apache NiFi to automate the data flow between the systems and managed flow of information between systems

•Have used Ansible for automation of frameworks.

•Handled Avro, JSON and Apache Log data in Hive using custom Hive SerDes.

•Worked on batch processing and scheduled workflows using Oozie.

•Implemented installation and configuration of multi-node cluster on the cloud using Amazon Web Services (AWS) on EC2.

•Worked on designing and development of scalable solutions on Google Cloud Platform (GCP).

•Worked in agile sprint methodology environment.

•Have used the Knox gateway for having Hadoop security between the users and operators.

•Used cloud computing on the multi-node cluster and deployed Hadoop application on cloud S3 and used Elastic Map Reduce (EMR) to run Map-reduce.

•Used Hive-QL to create partitioned RC, ORC tables, used compression techniques to optimize data process and faster retrieval.

•Implemented Partitioning, Dynamic Partitioning and Buckets in Hive for efficient data access.

Environment: Apache Hadoop, HDFS, AWS EMR, Java, MapReduce, Eclipse Indigo, Hive, HBASE, PIG, Sqoop, Oozie, SQL, Spring.



Contact this candidate