Azure Data Engineer

Location:

Las Vegas, NV

Posted:

February 22, 2025

Contact this candidate

Resume:

Venkateshwara Raju

Email id: ***************************@*****.***

PH: +1-478-***-****

Senior Data Engineer

Professional summary:

5+ years of IT experience in a variety of industries working on Big Data technology using technologies such as Cloudera and Hortonworks distributions. Hadoop working environment includes Hadoop, Spark, MapReduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.

Fluent programming experience with Scala, Java, Python, SQL, T-SQL and Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.

Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala. Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.

Hands-on experience in Azure Analytics Services - Azure Data Lake Store (ADLS), Azure Data Lake Analytics (ADLA), Azure SQL DW, Azure Data Factory (ADF), Azure Data Bricks (ADB) etc.

Experience working with Data warehouse like Oracle, SAP, HANA and databases like Azure SQL DB, Azure SQL DW.

Experience in building ETL(Azure Data Bricks) data pipelines leveraging PySpark, Spark SQL. Experience in building the Orchestration on Azure Data Factory for scheduling purposes and Experience working with Azure Logic APP Integration tool.

Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, AWS Lambda, EMR and other services of the AWS family.

Instantiated, created, and maintained CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications and worked on various automation tools like GIT, Terraform, Ansible.

Technical Skills:

Data Engineering Tools: Azure Data Factory, Azure, Synapse Analytics, Databricks, AWS Glue, Amazon Redshift, Google Cloud Dataflow, BigQuery.

Data Visualization: Power BI, Tableau, Kibana.

CI/CD Tools: Azure DevOps, Jenkins, Maven.

Data Management: SQL, NoSQL, Cosmos DB, Cassandra, MySQL, PostgreSQL, MongoDB.

Programming & Scripting: Python (Pandas, PySpark), Scala, .NET, Java, Power Shell.

Data Processing& Analytics: Spark, Hadoop, Data Bricks, Machine Learning, Stream Analytics, Power BI

Hadoop Components / Big Data: HDFS, Hue, MapReduce, PIG, Hive, HCatalog, HBase, Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, PySpark Airflow, Kafka, Snowflake Spark Components

Databases: Oracle, MySQL, SQL Server, PostgreSQL, HBase, Snowflake, Cassandra, MongoDB.

Education: Masters from Trine University,MI, USA.

Project experience:

Client: Starlegacy IT Solutions Jan 2023 – Till date

Role: Azure Data Engineer

Responsibilities:

Developing, deploying, and maintaining scalable and efficient data pipelines on Azure to support various healthcare data needs. This includes integrating data from multiple sources, ensuring data quality, and making it available for analytics, reporting, and operational processes.

Utilizing Azure services like Azure Data Factory, Azure Synapse Analytics, and Azure Databricks to process and transform large volumes of healthcare data.

Ensured data integrity by implementing robust validation protocols and data quality checks throughout clinical trials.

Designed and optimized data capture mechanisms to streamline patient data collection processes in clinical trials.

Developed systems for anomaly detection, enabling early identification of irregular patterns in clinical trial data.

Collaborated with research teams to perform metadata analysis on clinical trial datasets, ensuring proper documentation and organization.

Focused on patient safety by analyzing and managing clinical trial data to identify any potential safety concerns.

Leveraged AI technologies to improve data processing and predictive analytics in clinical trials.

Communicated complex data insights to stakeholders, enhancing communication skills in cross-functional teams.

Maintained thorough documentation of data processes, ensuring transparency and traceability in clinical trial workflows.

Ensuring that all data management processes comply with healthcare regulations (e.g., HIPAA) and UC Health’s internal security policies. Implementing robust security measures, including encryption, access controls, and monitoring.

Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Migration of on premise data (Oracle/ Teradata) to Azure Data Lake Store(ADLS) using Azure Data Factory(ADF V1/V2).

Designed and created optimal pipeline architecture on Azure platform and Created pipelines in Azure using ADF to get the data from different source systems and transform the data by using many activities.

Used Azure Devops & Jenkins pipelines to build and deploy different resources (Code and Infrastructure) in Azure.

Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks etc. Maintain and provide support for optimal pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks. Automated jobs using different triggers like Events, Schedules and Tumbling in ADF.

Created pipelines CI/CD pipelines in Azure using ADF to get the data from different source systems and transform the data by using many activities.

Created, provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.

Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.

Developed custom UDFs in python to be used in spark-SQL. Utilized Elasticsearch and Kibana for indexing and visualizing.

Environment: API, Azure, Azure Analysis Services, Azure Synapse Analytics, Blob, Cassandra, CI/CD, Data Factory, Elastic search, ETL, Factory, HBase, HDFS, Java, Jenkins, Kafka, Kubernetes, lake, Oracle, Power BI, PySpark, Python, RDBMS, Services, Snowflake, Spark, Spark Streaming, SQL, SSIS, Teradata.

Client: Jaguar Land Rover, India July 2018 - Jan 2022

Role: GCP Data Engineer

Description: Unilever is a global consumer goods company known for its diverse portfolio of products in food, beverages, cleaning agents, beauty, and personal care. Contributed to the design and optimization of data architectures and pipelines to enhance data analytics and reporting capabilities, facilitating seamless data integration and migration to cloud platforms.

Responsibilities:

Developing and designing scalable, secure, and efficient data architectures on Google Cloud Platform (GCP) to support the insurance company’s data analytics, reporting, and business intelligence needs.

Creating and maintaining data pipelines using GCP services like Google Cloud Dataflow, Google Cloud Dataproc, and Apache Beam to process and integrate large volumes of data from multiple sources.

Implementing and managing data warehouses on GCP, such as Google BigQuery, to store structured and unstructured data, enabling advanced analytics and reporting.

Integrating data from various internal and external sources, including policy management systems, customer databases, and third-party APIs, ensuring seamless data flow across the organization.

Continuously optimizing data processes, improving performance, and reducing costs by fine-tuning GCP resources and services.

Got involved in migrating on perm Hadoop system to using GCP (Google Cloud Platform).

Worked on Analysis and understanding of the data from different domains in order to integrate to Data Market Place.

Developed Pyspark programs and created the date frames and worked on transformations.

Working with AWS/GCP cloud using in GCP Cloud storage, Data-proc, Data Flow, Big-Query, EMR,S3, Glacier and EC2 with EMR Cluster.

Continuously optimizing data processes, improving performance, and reducing costs by fine-tuning GCP resources and services, with a focus on supporting fintech applications and SaaS solutions.

Analyzing data using Pyspark, Hive, based on ETL mappings, and implementing solutions tailored for financial technology and SaaS use cases.

Experienced in working Services like Data Lake, Data Lake Analytics, SQL Database, Synapse, Data Bricks, Data factory, Logic Apps and SQL Data warehouse and GCP services Like Big Query, Dataproc, Pub sub etc.

Worked on analyzing the data using Pyspark, Hive, bases on ETL mappings.

Experienced in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins, and GCP. Experienced in Hadoop and Hadoop

Developed multi cloud strategies in better using GCP (for its PAAS) and experienced in migrating legacy systems into GCP technologies.

Environment: GCP, Pyspark, GCPS Data Proc BigQuery, Fintech,Saas,AWS, Hadoop, Hive, GCS, Python, Snowflake, Dynamo DB, Oracle Database, Power Bi, SDK'S, Data Flow, Glacier, EC2, EMR Cluster, SQL Database, Synapse, Data Bricks.

Contact this candidate