Data Engineer Big

Location:

Posted:

November 13, 2023

Resume:

Certified Sr. Data Engineer with over * years of IT experience in various levels of software development processes like analysis, development, testing, and deployment process. Deep technical knowledge and practical understanding of tools like Big Data, Cloud, and DevOps. Also experienced in various programming languages like Python, PySpark, Scala, SQL, and Snow SQL.

Technical Summary:

Implemented end-to-end data pipelines by collaborating with Cross-functional teams and utilizing agile methodologies to deliver high-quality products on time.

Worked in multiple Big Data distributions like Horton Works, Cloudera, Microsoft Azure, Snowflake, and AWS

Experience working with AWS Cloud services like S3, EMR, Redshift, Athena, DynamoDB, Glue Metastore, etc.,

Experience working with Spark and Spark Streaming frameworks for building batch and real-time data transformation pipelines.

Experience working with Kafka for storing real-time feeds and developing Kafka producers and consumers.

Experience in performance tuning of long-running Spark jobs and troubleshooting failures.

Experience working with Microsoft Azure Cloud services like Azure Data Lake Storage (ADLS) Gen2, Azure Blob Storage, Azure Functions Apps, Azure Data Factory (ADF), Azure Databricks, Azure Event Hubs, etc.,

Experience in using various Hadoop services like Hive, Hbase, Sqoop, Oozie, Impala, Hue, Yarn, HDFS etc.,

Experience in using different optimized file formats like Avro, Parquet, Delta Lake, etc.,

Experienced in building optimized queries and code in Hive, Pig, and MapReduce.

Experience in extending Hive functionality by writing custom UDFs.

Experience building end-to-end data pipelines, orchestrating, and automating them using Oozie, Azure Databricks workflow, and Apache Airflow.

Experience building REST API’s to enable data consumption

Expertise in working with different formats of data like Structured, Semi-Structured, and Un-Structured

Experience in working with reporting tools like QlikView, Tableau, and Zoom data.

Experienced in writing Unix shell and Python scripting for various validations and automations.

Hands-on experience in NoSQL databases like Document DB, DynamoDB, Cassandra, and MongoDB.

Experienced in branching, tagging, and maintaining the version across the environments using SCM tools like GIT, Subversion (SVN), and CVS on Linux and Windows platforms.

Built Talend and NiFi integrations for ingestion of bi-directional data into different sources.

Experience in working in Agile environments and waterfall models.

Worked on different tools and utilities like Eclipse, IntelliJ, SBT, and Maven.

Experience using tools like Elasticsearch, Logstash, Kibana, and Grafana for alerting and Monitoring.

Experience in working with analytical querying and loading data using Spark snowflake connectors.

Experience in working as L3 Support and On-Call rotations for Production support.

Highly motivated, self-learner with a positive attitude, willing to learn new concepts, and accepting challenges.

Skills Summary:

Programming Languages: Python, PySpark, SQL, SnowSQL and Scala

Big Data: Spark, Hadoop, Map Reduce, Sqoop, Pig, Hive, Kafka, Yarn, Impala

NoSQL: HBase, DynamoDB, Cassandra, MongoDB

Scheduling Tools: Oozie, Apache Airflow, AWS Simple workflow, Databricks workflow

Cloud Platforms: Amazon Web Services and Azure

Build and CI/CD: Maven, Jenkins, Git

Monitoring: ELK stack, Grafana

Scripting Languages: Shell Script

ETL Tools: Datastage, DBT

Certifications:

Snowflake SnowPro Core Certification

Databricks Certified Data Engineer Professional

Education:

Jawaharlal Nehru Technological University, Hyderabad, India, Bachelor’s Degree, Electrical and Electronics Engineering, Aug 2008- May 2012

Experience Summary:

AT&T, Plano, TX Jun 2021 – Till Date

Worked as: Data Engineer

Responsibilities:

Responsible for maintaining and ingesting large volumes of workforce analytical data to database tables in Databricks.

Done performance tuning of the PySpark code and analyzing the Spark logs and DAGs on Palantir Foundry

Migrated the existing workflows from Palantir Foundry to Microsoft Azure.

Ingested Office 365 MGDC data into Data Lake ADLS Gen2 using the ADF data pipeline.

Created linked services for multiple source systems like Azure SQL Server, ADLS Gen2, Blob Storage, and Rest APIs.

Implemented various data validations, and data quality checks by maintaining the privacy and security of the data.

Reduced manual data handling by building end-to-end efficient data pipelines using Data Lake ADLS Gen2, Azure Blob Storage, Azure Databricks, Snowflake using PySpark, and SnowSQL.

Developed the UDF connectors for Snowflake and SQL server to read and save the transformed data.

Developed the incremental delta load logic for the ETL data pipeline in Azure Databricks using PySpark and Python.

Developed PySpark and SnowSQL scripts to parse various formats of structured and semi-structured data.

Created Views and Materialized Views according to the business needs in Snowflake using SnowSQL.

Implemented dynamic masking and tagging in Snowflake to mask sensitive data and keep track of all tables created in databases.

Worked closely with our data scientist teams and business consumers to shape the datasets as per the requirements.

Maintaining the Databricks spark daily scheduled workflows, monitoring and running data quality checks for the scheduled workflows.

Developed ETL mapping for data collection from various data feeds using REST APIs

Extracting data from APIs and websites like ATT stack overflow and tmatomo a webpage analytical platform by Google through APIs and web scrapping as per the requirements using Python and PySpark in Azure Databricks and Azure Function Apps.

Automated data ingestion process using Python scripts in Azure Function Apps to extract the data daily and save it to Snowflake.

Worked directly with the client user community such as data analysts to define and document data.

Created reusable software components (e.g., specialized spark UDFs) and analytics applications Support architecture evaluation of the enterprise data platform through implementation and launch of data preparation and data science capabilities.

Implemented data mining techniques to achieve data synchronization, redundancy elimination, source identification, data reconciliation, and problem root cause analysis.

Investigated and resolved data issues across platforms and applications, including discrepancies in definition, format, and function.

Utilized Azure DevOps and Repos for git integration and version control and collaborated with development teams to ensure code integrity and traceability throughout the CI/CD pipeline.

Skills: Spark, Azure Data Lake Storage (ADLS) Gen2, Azure Blob Storage, Azure Functions Apps, Azure Data Factory (ADF), Azure Databricks, Azure Dev Ops, PySpark, SQL, Snowflake, and SnowSQL

Discovery Communication, Sterling, VA Feb 2020 – May 2021

Worked as: Data Engineer

Responsibilities:

Responsible for migrating on-prem data lake to AWS S3-based data lake.

Implemented Spark-Streaming to consume the data from Kafka topic which enabled the real-time data availability for Data Lake users.

Developed PySpark-based Spark applications in AWS EMR used for performing data cleansing, event enrichment, data aggregation, de-normalization, and data preparation needed for machine learning and reporting teams to consume.

Worked on troubleshooting spark applications to make them more error-tolerant.

Experienced in handling large datasets using Spark in Memory capabilities, using broadcast variables in Spark, effective & and efficient Joins, transformations, and other capabilities.

Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.

Designed and developed ETL workflows using AWS Glue to extract data from Amazon S3 and load it into Amazon Redshift, ensuring data accuracy and integrity.

Configured and optimized AWS Glue crawlers to automatically infer schema and metadata from semi-structured and unstructured data stored in S3.

Monitored ETL job performance and resource utilization, optimizing job parameters for improved efficiency and reduced processing time.

Ensured data security by implementing encryption for data at rest and in transit and enforced access controls using AWS IAM policies.

Developed the DBT connections to connect to Redshift

Designed and developed the ETL jobs using DBT to achieve the best performance.

Developed Python operators in Airflow for data validations and quality checks for the daily scheduled data pipelines.

Implemented Airflow for authoring, scheduling, orchestration, and monitoring data pipelines and designed several DAGs for automating ETL pipelines.

Responsible for end-to-end Data pipeline code development with Bit-Bucket & Bamboo CI/CD flow.

Collaborated with the infrastructure, network, database, application, and BA teams to ensure data quality and availability.

Designed, and documented operational problems by following standards and procedures using JIRA.

Skills: AWS S3, IAM, CloudWatch, Service Catalogues, Cloud Formation Stacks (CFT’s), Spark, Kafka, AWS Glue, AWS EMR, Apache Airflow, PySpark, Python, SQL, Redshift, Bit-Bucket, Bamboo, and JIRA

Sprint, Reston, VA Jan 2019– Jan 2020

Worked as: Data Engineer

Responsibilities:

Responsible for ingesting large volumes of user behavioral data and customer profile data to the Analytics Datastore.

Defining the schema, staging tables, and landing zone tables, configuring base objects, foreign-key relationships, complex joins, and building efficient views.

Migrated the data from Teradata to Azure Data Lake by using Azure Data Factory (ADF) with self-integrated runtime.

Automated the data pipelines by Orchestrating and Scheduling them in Azure Data Factory (ADF).

Monitoring the daily logs and debugging errors in the Azure Data Factory (ADF) data pipelines

Developed ETL pipelines using Spark and Hive for performing various business-specific transformations.

Building data applications and automating the pipelines in Spark for bulk loads as well as Incremental Loads of various Datasets.

Worked closely with our data scientist teams and business consumers to shape the datasets as per the business requirements.

Automated the data pipeline to ETL all the Datasets along with full loads and incremental loads of data.

Utilized Azure services like ADF, ADLs gen2, Databricks, and Azure Blob Storage, Teradata extensively for building the data applications.

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backward

Wrote spark applications to perform operations like data inspection, cleaning, loading, and transforming large sets of structured and semi-structured data.

Developed Spark with PySpark and Spark-SQL for testing and processing of data.

Reporting the spark job stats, monitoring, and running data quality checks are made available for each dataset.

Skills: Spark, Hive, PySpark, Spark-SQL, ETL, Azure Data Factory (ADF), Azure Databricks, Azure Data Lake, Azure Blob Storage, and Teradata

Citizens Bank, Providence, RI May 2017–Dec 2018

Worked as: Big Data Engineer

Responsibilities:

Responsible for building scalable distributed data solutions using Spark.

Migrated MR jobs into Spark applications utilizing Data frames API and Spark SQL to improve performance.

Used Restful API to connect with the MapRtable.

Used Spark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real-time and persists into Cassandra.

Expertise in performance tuning of Spark Applications for setting the right Batch Interval time, the correct level of Parallelism, and memory tuning.

Developed DF's and case Classes for the required input data and performed the data transformations using Spark-Core.

Developed Scala scripts, and UDFs using both Data frames/SQL and RDD in Spark for Data Aggregation, queries, and writing data back into OLTP systems.

Strong working experience on Cassandra for retrieving data from Cassandra clusters to run queries.

Deployed and maintained multi-node Dev and Test Kafka Clusters.

Developed Spark scripts by using Scala as per the requirement.

Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.

Optimizing existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames, and Pair RDDs.

Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.

Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.

Developed an equivalent Spark Scala code for existing SAS code to extract summary insights on the hive tables.

Responsible for importing the data from different sources like MYSQL databases into HDFS to save it in the form of AVRO, and JSON file formats.

Imported data from S3 to HIVE using Sqoop and Kafka.

Involved in creating partitioned Hive tables, and loading and analyzing data using Hive queries, Implemented Partitioning and bucketing in Hive.

Worked on a POC to compare the processing time of Impala with Apache Hive for batch applications to implement the former in the project.

Developed Hive queries to process the data and generate the data cubes for visualization.

Worked on Talend open studio for designing ETL Jobs for Processing data.

Configured Hadoop clusters and coordinated with Big Data Admins for cluster maintenance.

Skills: Spark, MR, Hadoop, YARN, Spark-SQL, ELK, Hive, HDFS, Kafka, ETL, Cassandra, Scala

Honeywell, India Jun 2015 – Mar 2017

Worked as: Java Developer

Responsibilities:

Used Java Core, Web Services, and JDBC technologies automating reading information, storing on database, and displaying results.

Involved in various phases of the Software Development Life Cycle (SDLC) of the application Requirement gathering, Design, Analysis, and Code development, as well as the implementation in a production environment.

Generated Use case diagrams, Class diagrams, and Sequence diagrams.

Designed and developed various modules of the application with design architecture, Spring MVC architecture, and Spring Bean Factory.

Developing the web application using Core Java concepts like Collections, OO concepts, and Exception handling for developing application modules

Created websites with extensive use of JavaScript (including Cookies), and database connectivity.

Implemented the database connectivity using JDBC with Oracle / SQL Server / MySQL database as backend.

Database development required the creation of new tables, SQL stored procedures, functions, views, indexes and constraints, triggers, and required SQL tuning to reduce the response time in the application.

Used Maven for generating system builds and managing dependencies.

Wrote unit-testing codes using JUnit and resolved bugs and other defects.

Wrote JUnit Test cases for Spring Controllers and Web Service Clients in the Service Layer using JUnit.

Used Log 4j for debugging, testing, and maintaining the system state.

Make a program to consult information from MongoDB and with this data create all the data structures to allow the front end to present charts and OLAP objects.

Create the configuration for production, perform deployments and monitor results.

Skills: Java Core, Web Services, and JDBC technologies, Spring MVC architecture, Spring Bean Factory, JavaScript, SQL Server, MySQL, Maven, MongoDB.

Contact this candidate