Resume

Data Engineer

Location:

United States

Posted:

March 14, 2024

Contact this candidate

Resume:

Name: Saideep K

Senior Data Engineer

Contact: +1-469-***-****

Email: ad4b8h@r.postjobfree.com

PROFESSIONAL SUMMARY:

●Proficient IT professional with over 9 years of experience, specialized in Azure Data Engineering, Big Data Ecosystem- Data Acquisition, Ingestion, Modelling, Storage Analysis, Integration, Data Processing

●Extensive Experience in working with Azure Databricks, Synapse Analytics, Azure Data Factory, Stream Analytics.

●Azure Analysis Services, Data Lake, Azure Storage, Azure SQL Database, SQL Data Warehouse, Azure Cosmos DB.

●Designed, implemented, and managed complex data processing pipelines using AWS Batch, integrating with Amazon S3 for data storage and retrieval.

●Automated deployment workflows using CI/CD tools like Jenkins and GitLab CI/CD, streamlining the release process of AWS Batch applications.

●Designed and implemented serverless applications using AWS Lambda, API Gateway, and other AWS services.

●Integrated third-party services with AWS Lambda, such as payment gateways and external APIs, to enhance application functionality.

●Architected and developed a multi-region, highly available serverless platform using AWS Lambda, DynamoDB, and API Gateway.

●Expertise in working with Azure services like HDInsight, Application Insights, Azure Monitoring, Azure AD, Function apps, Logic apps, Event Hubs, Iot hubs, Storage Explorer, Key Vault.

●Strong working experience with SQL and NoSQL databases (Azure Cosmos DB, MongoDB, HBase, Cassandra), data modelling, tuning, disaster recovery, backup and creating data pipelines.

●Designed and developed interactive Qlik Sense dashboards and reports for business users to gain actionable insights and make informed decisions.

●Led end-to-end data integration projects using Oracle Data Integrator (ODI) to extract, transform, and load data from various source systems into the data warehouse.

●Integrated Qlik Sense with various data sources, including relational databases and cloud-based data warehouses.

●Conducted data quality checks and ensured data accuracy and integrity within Qlik Sense applications.

●Have extensive experience in creating pipeline jobs, schedule triggers using Azure data factory.

●Leveraged Kubernetes for developing, deploying, and orchestrating microservices, ensuring high availability, scalability, and fault tolerance of distributed applications.

●Designed and implemented streaming and batch data pipelines using Apache Spark, Apache Flink, Amazon Kinesis, and Apache Kafka to process large-scale unstructured data sets efficiently.

●Proficient in handling and ingesting terabytes of Streaming data (Kafka, Event Hub, IOT Hub, Kinesis, Spark streaming, Storm), Batch Data, Automation and Scheduling (Oozie).

●Have good experience designing cloud-based solutions in Azure by creating Azure SQL database, setting up Elastic pool jobs and designing tabular models in Azure analysis services.

●Experience in working with AWS services AWS Glue, Amazon Managed Kafka, Athena, AWS cloud formation templates, ECS, Network Load Balancer, API Gateway, IAM roles and policies.

●Experience in working with GCP services Dataflow jobs, Apache beam, Pub Sub, Cloud Composer, Apache air flow DAG for scheduling jobs, Big Query, Google Cloud Storage.

●Developed data integration strategies, ensuring seamless data flow between various systems, including CRM, CMS, and marketing automation platforms.

●Strong knowledge in working with ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.

●Implemented and managed the end-to-end deployment of the company's CDP, utilizing tools such as Segment, Tealium, and Adobe Audience Manager.

●Skilled in using Azure authentication and authorization and experience in using Visualizations tools like Tableau, Power BI. Basic hands-on experience working with Kusto.

●Strong knowledge in working with ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.

●Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and spark jobs on

Amazon Web Services (AWS).

TECHNICAL SKILLS:

Microsoft Azure

Azure Databricks, Azure Data Factory, Synapse Analytics, HDInsight, ADLS, Azure Storage, Data Explorer, Azure Functions, Event Hub, Iot Hub, Logic apps, Stream Analytics, Azure Web App, Azure Analysis Services, Application Insights, Azure Active Directory, Key vault

Hadoop/BigData/AWS Technologies

Apache Hadoop 2.x/1.x, Cloudera data, Hortonworks HDP, HDFS, Map Reduce, Sqoop, Hive, Oozie, Spark, Zookeeper, Kafka, Airflow, Flume. Amazon AWS [ EMR, EC2, EBS, RDS, S3, Kinesis]

Prog/Scripting Languages

Python, SQL, Scala, Java, R, Pig, HiveQL, C, C++

NO SQL Database

Azure Cosmos DB, MongoDB, DynamoDB, HBase, Cassandra

Relational Databases

Azure SQL Database, SQL Server, Oracle, MY SQL, PL/SQL

Monitoring/ Reporting

PowerBI, Tableau, Azure Monitor, Log Analytics, Custom shell scripts

Java Technologies

Servlets, JSP, JDBC, EJB, Struts, Spring, Hibernate

Web Development

JavaScript, HTML, CSS, jQuery

Development / Build Tools

PyCharm, Anaconda, Eclipse, Ant, Maven, IntelliJ, Junit, log4J, Gradle

Operating Systems

Linux, Unix, Mac OS-X, CentOS, Windows 10, Windows 8, Windows 7

Version Control

GIT, Bitbucket, SVN,Github

Methodologies

Agile, Waterfall

EDUCATION:

●Master of Science– Texas A&M university, Kingsville

●Bachelor of Technology, Kakatiya institute of technology and science, Warangal India.

PROFESSIONAL EXPERIENCE:

Client: UPS, NJ July 2023-Present

Senior Data Engineer

Created Function as a service is a category of cloud computing services that provides a platform allowing customers to develop, run, and manage application functionalities without the complexity of building and maintaining the infrastructure typically associated with developing.

Implemented a 'serverless' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets.

Implemented CI/CD pipelines using AWS Code Pipeline and Code Build for automated deployment of AWS Lambda functions.

Designed and developed data visualization dashboards and reports using IDMC's reporting and analytics tools, providing actionable insights to stakeholders.

Partnered closely with Engineering stakeholders to gather requirements, understand project objectives, and design scalable data pipelines needs.

Demonstrated expertise in deploying, managing, and orchestrating containerized applications using Kubernetes, ensuring high availability, scalability, and fault tolerance.

Optimized application performance by fine-tuning AWS Lambda function configurations, including memory allocation and execution time.

Utilized Kubernetes to implement and maintain microservices architectures, leveraging features such as service discovery, load balancing, and automatic scaling to streamline application development and deployment.

Designed and implemented data storage architectures on Google Cloud Platform (GCP), leveraging services such as Cloud Storage, Cloud Bigtable, and Cloud SQL to accommodate large volumes of

Led the design and implementation of data warehouse solutions using Kimball's dimensional modeling techniques, enabling business users to access critical information easily.

Integrated AWS Lambda with AWS API Gateway to create scalable and efficient RESTful APIs.

Optimized ETL processes for performance and scalability, identifying and resolving bottlenecks to meet SLAs and processing requirements.

Integrated ETL processes with business intelligence (BI) tools such as Tableau, Power BI, and MicroStrategy for interactive data visualization and reporting.

Designed, implemented, and managed Tealium Tags for various web properties, ensuring accurate data collection and improved analytics capabilities.

Collaborated with marketing and analytics teams to define tracking requirements, resulting in enhanced data-driven decision-making.

Developed and maintained scalable Spark data pipelines using Scala, processing large datasets to extract, transform, and load data from diverse sources, resulting in improved data quality and reliability.

Successfully designed, deployed, and managed Tealium Tags for multiple web properties, ensuring accurate data tracking and improved analytics capabilities.

Employed ETL frameworks such as Airflow, Flume, and Oozie to orchestrate and automate the execution of ETL workflows, ensuring robustness, reliability, and scalability of data pipelines in production environments.

Utilized high-level programming languages such as Java, Scala, and Python to implement data processing logic and algorithms within Spark and MapReduce frameworks, optimizing performance and resource utilization.

Utilized [CI/CD Tool] for automated build and deployment of SaaS applications, reducing release cycles by [time] and improving software quality.

Created ECS container, Network Load Balancer, API Gateway, and other AWS services using AWS cloud formation templates in Dev, QA and prod environments.

Collaborated with development teams to integrate CI/CD practices into the development process, ensuring faster time-to-market.

Leveraged Spark/MapReduce development expertise to design and deploy production-quality ETL pipelines, integrating with distributed storage and compute technologies such as S3, Hive, and Spark for efficient data processing and analysis.

Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into S3 AWS Redshift

Developed file cleaners using Python libraries and utilized Python libraries Boto3, Pandas

Used Amazon EMR for map reduction jobs and test locally using Jenkins.

Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark.

Create external tables with partitions using Hive, AWS Athena, and Redshift.

Developed the PySpark code for AWS Glue jobs and for EMR.

Good Understanding of other AWS services like S3, EC2 IAM, RDS Experience with Orchestration and Data Pipeline like AWS Step functions/Data Pipeline/Glue.

Environment: AWS Glue, Athena, AWS Jupiter Notebooks, ECS Containers, IDMC, Network Load Balancer, API Gateway, GCP, Data Flow Jobs, Cloud Composer, Terraform, Big Query, PYTHON, IDMC, Snowflake.

Client: Texas Higher Education Coordinating Board, Austin, TX Jun 2020 – 2023

Senior Azure Data Engineer

Extensive working experience in working Microsoft Azure and Hadoop Cloudera Distribution platform.

Made use of Azure Data Factory (ADF) for Extract Transform Load (ETL) to ingest data from legacy external data stores -

SAP (Hana), SFTP servers, Cloudera Hadoop’s HDFS to Azure Data Lake Storage (ADLSGen2).

Demonstrated proficiency in big data technologies such as Hadoop, Spark, and Hive, contributing to the design and implementation of distributed data processing solutions to address complex business requirements.

Responsible for implementing complex ETL jobs that transform data visually with ADF data flow and using Azure Databricks, Azure Blob Storage, Azure SQL Database, Cosmos DB.

Created and automated data ETL processes for the database application and mappings in Informatica.

Managed incoming data from multiple sources and involved in maintenance of the HDFS and loading of the structured data.

Successfully led the implementation and integration of SaaS CDP solutions, enabling real-time data synchronization across multiple systems and platforms.

Successfully designed, deployed, and managed Tealium Tags for multiple web properties, ensuring accurate data tracking and improved analytics capabilities.

Worked with Spark SQL, Apache Spark RDD, Data frame APIs and applied appropriate transformations and aggregations provided by the Spark.

Involve in creating database objects like tables, views, stored procedures, triggers, packages, and functions using T-SQL to provide structure and maintain data efficiently.

Coordinated with the Kafka team and built an on-premise data pipeline. Supported Kafka Integrations, performance tuning and identified bottlenecks to improve performance and throughput.

Collaborated with cross-functional teams using Git and Jira for version control and task management, ensuring transparency, collaboration, and accountability throughout the development lifecycle.

Involved from project analysis to production implementation, with emphasis on identifying data validation, developing logic and transformations as per project requirements, and developing notebooks to load the data into Delta Lake.

Implemented Dynamic Data Masking, Transparent Data Encryption, TLS, Virtual Network Firewall rules.

Acted on Spark integration with Hive and DB2 at ingestion layers to work with different file formats like Parquet and JSON.

Developed automated Databricks workflow notebook in Python to run multiple data loads parallelly.

Developed ETL Pipelines using Apache PySpark, Spark SQL, Data Frame APIs.

Created Airflow Scheduling scripts in Python.

Proficient in writing complex SQL queries and leveraging Snowflake's advanced analytical functions for data transformations, aggregations, and business intelligence tasks.

Utilized Snowflake's security features, including role-based access control (RBAC), encryption, and external token-based authentication, to enforce data protection and compliance.

Demonstrated expertise in Apache Spark and Scala, leveraging these technologies to design, develop, and deploy large-scale data processing and analytics solutions.

Utilized Spark's distributed computing capabilities to perform ETL (Extract, Transform, Load) operations, batch processing, and real-time stream processing on massive datasets.

Leveraged Scala's functional programming features to write concise, expressive, and high-performance Spark applications, optimizing data processing workflows and improving efficiency.

Worked with Sqoop framework to load batch process data from distinctive data sources into Hadoop.

Identified required tables, views and exported them into Hive. Performed ad-hoc queries using Hive joins, partitioning, bucketing techniques for faster data access.

Implemented various Hive queries for analytics. Created External tables, optimized Hive queries and improved the cluster performance by 30%.

Build Delta Lake for the curated layer and made the high-quality data available for the data scientist team.

Use various types of activities: data movement activities, transformations, and control activities; copy data, data flow, get metadata, lookup, stored procedure, execute the pipeline.

Creating Hive tables as per requirement. Designed and implemented HIVE queries, functions for evaluation, filtering, loading, and storing of data.

Worked with azure functions, web apps and logic apps and implementation of encryption techniques.

Developed, maintained, and monitored data ingestion & CI/CD pipelines as per the design architecture.

Worked on integrating GIT into the continuous Integration (CI) environment along with Jenkins.

Worked with azure DevOps for collaboration on code development, build and deploy applications.

Environment: Microsoft Azure – Data Lake storages (Gen1 & Gen2), Azure Data Factory (ADF), Azure SQL Database, Cosmos DB, Azure Function apps, Azure Logic Apps, Azure Web apps, Azure Blob, Azure DevOps, Airflow v1.9.0, Git, Azure Databricks, Spark SQL, pandas, NumPy; Parquet files, Delta files; Cloudera Hadoop (CDH5), HDFS, Sqoop, Hive, Apache Spark, PYTHON, IDMC, Snowflake, Git

Client: Texas Health Care, Arlington, TX Jun 2017 - May 2020

Senior Azure Data Engineer

Architectured and implemented robust API services on the AWS cloud, harnessing Java, Scala, and Kotlin for high-performance and scalable solutions.

Led AWS cloud projects, specializing in PySpark, to process large-scale datasets efficiently and derive valuable insights, ensuring optimal performance on EMR and EC2 instances.

Implemented continuous monitoring solutions for PostgreSQL databases in AWS, using CloudWatch and other tools to proactively identify and address potential performance bottlenecks.

Designed and implemented ETL pipelines using Apache Spark and Scala, extracting data from various sources, applying transformations, and loading it into target data stores for analysis and reporting.

Utilized Spark SQL, DataFrame API, and Spark Streaming to perform complex data transformations, aggregations, and analytics tasks, ensuring data quality and consistency throughout the process.

Integrated ETL pipelines with other data processing technologies and tools such as Apache Kafka, Apache Hadoop, and Apache Hive, enabling end-to-end data processing and analytics solutions.

Conducted training sessions for team members on PostgreSQL best practices, AWS database services, and healthcare industry compliance requirements.

Leveraged AWS Step Functions to design and implement scalable and resilient serverless workflows, providing a seamless and reliable orchestration mechanism for distributed applications.

Developed Spark/Scala and Python for a regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.

Demonstrated mastery in deploying and managing containerized applications on AWS, utilizing ECS to achieve optimal resource utilization and scalability.

Implemented comprehensive performance tuning strategies for AWS-hosted applications, ensuring optimal resource utilization and responsiveness.

Developed and maintained ETL pipelines using AWS Glue, SNS, S3, PySpark to process and analyze healthcare data.

Implemented a 'serverless' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets.

Implemented automation processes using AWS services, streamlining deployment workflows, and reducing manual intervention.

Applied DevOps practices to enhance collaboration between development and operations teams.

Include specific projects where AWS was instrumental in achieving healthcare goals, emphasizing your role and the impact on the organization.

Environment: AWS Glue, Athena, AWS Jupiter Notebooks, ECS Containers, IDMC, Network Load Balancer, API Gateway, GCP, Data Flow Jobs, Cloud Composer, Terraform, Big Query, PYTHON, IDMC, Snowflake.

Client: Activision, CA Oct 2016 - May 2017

Data Engineer

Involved in developing batch and stream processing applications that require functional pipelining using Spark APIs.

Developed Databricks notebooks for data preparation i.e. data cleansing, data validation, et cetera and applied transformations as per our requirement.

Involved in building a data pipeline and performing analytics using AWS stack (EMR, EC2, S3, RDS, Lambda, SQS, Redshift).

Developed Sqoop jobs for data ingestion, incremental data loads from RDBMS to HDFS.

Utilized Spark’s in memory capabilities to handle large datasets on S3 Data Lake. Loaded data into S3 buckets, then filtered and loaded into Hive external tables.

Involved in extracting and enriching multiple Cassandra tables using joins in Spark SQL. Also converted Hive queries into Spark transformations.

Developed various data loading strategies and performed various transformations for analyzing the datasets by using Hortonworks Distribution for Hadoop ecosystem.

Working knowledge in creation and modification of SQL stored procedures, functions, views, indexes, and triggers

Fetched live data from Oracle database using Spark Streaming and Kinesis using the feed from API Gateway REST Service.

Automated the process of transforming and ingesting terabytes of monthly data using Kinesis, S3, and Oozie. Performed ETL operations using Python, Spark SQL, S3 and Redshift on terabytes of data to obtain customer insights.

Developed Oozie workflows for scheduling and orchestrating the ETL process.

Experience with Apache big data Hadoop components like HDFS, MapReduce, YARN, Hive, HBase, Sqoop,

Analyzed and optimized pertinent data stored in Snowflake using PySpark and Spark SQL.

Managed and deployed configurations for the entire datacenter infrastructure using Terraform.

Experience with analytical reporting and facilitating data for Tableau dashboards.

Used Git for version control and Jira for project management and tracking issues and bugs.

Used Streamsets for analytics and involved in debugging and optimizing data pipelines collecting logs and metrics from various application APIs.

Environment: Databricks, Python, SQL, Hadoop 2.x, Hive, Hortonworks HDP 2.0, Snowflake, Spark, AWS, AWS EC2, AWS S3, Lambda, SQS, Redshift, ECS,, Sqoop, Kinesis, Oozie, HBase, Oracle, Terraform, Ansible, Streamsets, Cassandra, Tableau, Maven, Git, Jira.

Client: Value Labs, Hyderabad, India March 2014 - Jul 2016

Hadoop Developer

Involved in import and export of data between Hadoop Data Lake and Relational Systems like MySQL, Oracle using Sqoop.

Working knowledge on creating Kafka topics, partitions and writing custom partitioners classes.

Experienced in writing Spark applications in Python (Pyspark).

Made use of the AWS environment to launch the applications in different regions and implemented CloudFront with AWS Lambda to reduce latency.

Worked and acquired a lot of knowledge in Amazon Web Services (AWS) cloud services like EC2, S3, EMR, EBS, RDS and VPC.

Extracting real time data using Kafka and Spark streaming by Creating D streams and converting them RDD, processing it and storing it into Cassandra.

Configured, deployed, and maintained multi-node Dev and Test Kafka Clusters.

Processed and transferred the data from Kafka into HDFS through Spark Streaming APIs.

Experience in building Real-time Data Pipelines with Kafka Connect and Spark Streaming.

Building the Cassandra nodes using AWS & setting up the Cassandra cluster using Ansible automation tools.

Developed Oozie Bundles to Schedule Pig, Sqoop and Hive jobs to create data pipelines.

Developed Hive queries to do analysis of the data and to generate the end reports to be used by business users.

Developed Scala scripts, UDF’s using both Data frames/SQL, RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.

Written storm topology to emit data into Cassandra DB.

Involved in the process of Cassandra data modelling and building efficient data structures.

Environment: Apache Hadoop, Hive, Impala, Snowflake Oracle, Spark, Python, Pig, Sqoop, Oozie, Map Reduce, GIT, HDFS, Cassandra, Apache Kafka, Storm, Bitbucket, Linux, Tableau, Jenkins, Jira.

Contact this candidate