big Data Engineer

Location:

Irving, TX

Posted:

March 14, 2024

Contact this candidate

Resume:

Venkata Madisa

Sr Data Engineer

******************@*****.***

Mobile: 682-***-****

Professional Summary:

Over 9+ years of working experience in Data Engineering with high proficient knowledge in Data Analysis and Big data.

Experienced using "Big data" work on Hadoop, Spark, PySpark, Hive, HDFS and other NoSQL platforms.

Azure specialist adept at architecting and deploying cloud-native solutions. Proficient in designing and optimizing Azure services, ensuring high-performance data pipelines, scalability, and robust security measures for impactful technical outcomes.

Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.

Experienced in Technical consulting and end-to-end delivery with architecture, data modeling, data governance and design - development - implementation of solutions.

Multi-Channel Marketing Automation experience on salesforce marketing cloud

Leveraging Journey Builder and Automation Studio

Orchestrating campaigns across email, mobile, social, and web platforms

Customer Segmentation and Personalization

Data extensions and SQL queries for precise segmentation

Dynamic content and personalization based on preferences and behaviors.

Analytics and Reporting Capabilities

Einstein Analytics for data-driven decision-making

Customizable dashboards and reports for campaign performance and customer insights

Hands on experience with Google cloud services like GCP, Big Query, GCS Bucketand G-Cloud Function. Experienced in Informatica ILM and Informatica Lifecycle Management and its tools.

Efficient in all phases of the development lifecycle, coherent with Data Cleansing, Data Conversion, Data Profiling, Data Mapping, Performance Tuning and System Testing.

Experience in Big Data Hadoop Ecosystem in ingestion, storage, querying, processing and analysis of Big data.

Good understanding of Ralph Kimball (Dimensional) & Bill Inman (Relational) model Methodologies.

Experienced working extensively on the Master Data Management (MDM) and application used for MDM.Experience in transferring the data using Informatica tool from AWS S3 to AWS Redshift.

Good Knowledge on SQL queries and creating database objects like stored procedures, triggers, packages, and functions using SQL and PL/SQL for implementing the business techniques.

Supporting ad-hoc business requests and Developed Stored Procedures and Triggers and extensively used Quest tools like TOAD. Good understanding and exposure to Python programming.

Excellent working experience in Scrum/Agile framework and Waterfall project execution methodologies.

Experience in migrating the data using Sqoop from HDFS and Hive to Relational Database System and vice-versa according to client's requirement. Extensive experience working with business users/SMEs as well as senior management. Strong experience in using MS Excel and MS Access to dump the data and analyze based on business needs. Good experienced in Data Analysis as a Proficient in gathering business requirements and handling requirements management.

Technical Skills:

Technology

Tools

Hadoop

Spark, Hive, Oozie, Sqoop, Kafka, HDFS, YARN, Zeppelin, and HBase

AWS

EMR, Glue, Athena, DynamoDB, Redshift, RDS, Data pipelines, Lake formation, S3, IAM, CloudFormation, EC2, ELB/CLB

Operating systems

Amazon Linux 1 and 2, Custom AMIs based on Amazon Linux with encryption, Windows, CentOS, RHEL

Programming languages

Java, Python, Scala, Spark, Glue ETL

Web

Servlets, JSP, Spring MVC, Spring Boot, Hibernate

Frontend

HTML, XML, React Js, AngularJS, NodeJS

Database

DynamoDB, HBase, Teradata, MongoDB MYSQL, SQL SERVER 2008, PostgreSQL

Version control

Git, SVN, SourceTree

Scripting languages

Shell scripting, PowerShell, Bash

DevOps platforms

Docker, Jenkins, *, Ansible

Streaming platforms

Kafka, Confluent Kafka

Azure

Data Lakes, Data Factory, SQL Data warehouse, Data Lake Analytics, Databricks, other azure services

certifications

Advanced excel with MS excel, Azure data engineer, google professional data analytics .

WORK EXPERIENCE:

Client : PNC, Pittsburgh, PA March 2022 – Till date

Position: Senior AWS Data Engineer

Responsibilities:

Implemented a 'serverless' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets. Created a Lambda function and configured it to receive events from your S3 bucket

Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis, creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements from Aurora.

Involved in Migrating Objects from Teradata to Snowflake and created Snow pipe for continuous data load.

Created Notebooks in Azure Data Bricks and integrated it with ADF to automate the same.

Writing code that optimizes performance of AWS services used by application teams and provide Code-level application security for clients (IAM roles, credentials, encryption, etc.)

Creating AWS Lambda functions using python for deployment management in AWS and designed and implemented public facing websites on Amazon Web Services and integrated it with other applications infrastructure.

Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using python to perform event driven processing.

Designed various azure data factory pipelines to pull data various data sources and load the data into Azure SQL database.

Develop batch processing solutions by using Data Factory and Azure Data bricks

Developed Spark code using Python and Spark-SQL for faster testing and processing of data.

Worked on analyzing the AT&T Inventory, Expenses, and orders data, performed data cleansing/engineering, and provided the BRD to the business.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.

Created Databricks and scheduled a spark job to extract data from files in ADLS gen2.

Experience in supporting data analysis projects using Elastic Map Reduce (EMR) on the AWS cloud.

Developed batch scripts to fetch the data from ADL storage and do required transformations in PySpark using Spark framework.

Designed and implemented highly performant data ingestion pipelines from multiple sources using Azure Data Factory and Azure Databricks

Involved in converting Hive/SQL queries into spark transformations using Spark data frame in AWS Glue.

Experience in creation of Snowflake objects like databases, schemas, tables, stages, sequences, views.

Worked on extracting data from various file formats such as JSON, XML and CSV, paraquets.

Creating data validation checks and scripts to ensure high data quality and availability.

Experience in GCP Dataproc, GCS, Cloud functions, Big Query.

Gather business requirements and design and develop data ingestion layer and presentation layer.

Created data pipeline for different events of ingestion, aggregation, and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for AWS Quick sight Dashboard.

Creating different AWS Lambda functions and API Gateways, to submit data via API Gateway that is accessible via Lambda function.

Worked in a SAFE (Scaled Agile Framework) team with daily standups, sprint planning, quarterly planning.

Developed Json scripts for deploying the pipeline in Azure Data Factory that process the data using the Cosmos Activity

Created complex ETL Azure Data Factory pipelines using mapping data flows with multiple Input/output transformations, Schema Modifier transformations, row modifier transformations using Scala Expressions

Working on Azure Data bricks to run Spark-Python Notebooks through ADF pipelines.

Responsible for Building Cloud Formation templates for SNS, SQS, Elastic search, Dynamo DB, Lambda, EC2, VPC, RDS, S3, IAM, Cloud Watch services implementation and integrated with Service Catalog.

Implement Azure Data bricks clusters, notebooks, jobs and auto scaling.

Regular monitoring activities in Unix/Linux servers like Log verification, Server CPU usage, Memory check, Load check, Disk space verification, to ensure the application availability and performance by using cloud watch and AWS X-ray. implemented AWS X-Ray service inside Confidential, it allows development teams to visually detect node and edge latency distribution directly from the service map Tools.

Developed dynamic Data Factory pipelines using parameters and trigger them as desired using events like file availability on Blob Storage, based on schedule and via Logic Apps.

Design and Develop ETL Processes in AWS Glue to migrate data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.

Utilized Python Libraries like Boto3, NumPy for AWS.

Used Amazon EMR for MapReduce jobs and test locally using Jenkins.

Created external tables with partitions using Hive, AWS Athena and Redshift.

Developed the PySpark code for AWS Glue jobs and for EMR.

Good Understanding of other AWS services like S3, EC2 IAM, RDS Experience with Orchestration and Data Pipeline like AWS Step functions/Data Pipeline/Glue.

Extensively worked on Azure Data Lake Analytics with the help of Azure Data bricks to implement SCD-1, SCD-2 approaches.

Experience in writing SAM template to deploy serverless applications on AWS cloud.

Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PYSpark.

Coordinated with team and Developed framework to generate Daily adhoc, Report’s and Extracts from enterprise data and automated using Oozie.

Worked on cloud deployments using maven, docker and Jenkins

Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets.

Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks.

Developed Custom Email notification in Azure Data Factory pipelines using Logic Apps and standard notifications using Azure Monitor.

Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch

Used AWS Glue for the data transformation, validate and data cleansing.

Hands-on experience in Azure Analytics Services - Azure Data Lake Store (ADLS), Azure Data Lake Analytics (ADLA), Azure SQL DW, Azure Data Factory (ADF), Azure Data Bricks (ADB) etc.

Used python Boto 3 to configure the services AWS glue, EC2, S3.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.

Worked closely with data engineers to define ETL processes, data transformations, and data integration workflows.

Environment: Teradata Data Warehouse, Data Engineering, Teradata, Snowflake, PostgreSQL, Server Integration Services (SSIS), Data Analysis, SQL Server Reporting Services (SSRS), Google Cloud Platform (GCP), Hadoop,

Client : Verizon Wireless, New Jersey May 2021 to Feb 2022

Position: Azure Data Engineer

Responsibilities:

-Designed and implemented scripts and indexing strategies for migrating data to Azure Redshift from SQL Server and MySQL databases.

-Utilized Azure Data Factory and Azure Databricks for data ingestion from Azure Blob Storage and performed SQL query operations.

-Configured data loads from Azure Blob Storage into Azure Redshift using Azure Data Factory.

-Implemented JSON schema for defining table and column mapping from Azure Blob Storage to Azure Redshift.

-Developed indexing and data distribution strategies optimized for sub-second query response in Azure Redshift.

-Developed a statistical model using artificial neural networks to enhance the admission process for ranking students.

-Created Triggers, PowerShell scripts, and parameter JSON files for deployment in Azure environments.

-Implemented end-to-end data solutions on Azure using Azure Databricks, Azure Data Factory, Azure SQL Data Warehouse, and Power BI.

-Developed PySpark code for Azure Databricks jobs.

-Proficient in Azure services like Azure Blob Storage, Azure SQL Database, Azure Data Lake Storage, and Azure Data Factory.

-Migrated on-premises data (Oracle, SQL Server, DB2, MongoDB) to Azure Data Lake Store using Azure Data Factory.

-Experience in writing ARM templates for deploying serverless applications on the Azure cloud.

-Created pipelines to load data from Azure Data Lake into staging SQL databases and Azure SQL Database.

-Utilized Python scripts for processing semi-structured data formats like JSON in Azure environments.

-Created on-demand tables on Azure Blob Storage files using Azure Functions and Azure Data Factory.

-Experienced in writing Spark applications in Python for Azure Databricks.

-Designed and developed schema data models for Azure environments.

-Managed Hadoop clusters on Azure HDInsight.

-Imported data from Azure Blob Storage into Spark RDDs and performed transformations and actions on RDDs.

-Designed and deployed ETL workflows using Azure Data Factory and SSIS packages.

-Gathered business requirements and worked with stakeholders to define data mapping and transformation logic.

-Ingested data from RDBMS sources and exported transformed data to Cassandra in Azure environments.

-Worked with version control systems like Azure DevOps (formerly Team Foundation Server) for code management.

-Built Azure Data Factory pipelines to extract data from various sources and load it into Azure SQL Database.

-Recreated application logic and functionality in Azure Data Lake, Azure Data Factory, Azure SQL Database, and Azure SQL Data Warehouse.

-Developed MapReduce programs for data extraction, transformation, and aggregation from various file formats in Azure environments.

-Implemented automated processes for flattening JSON data from upstream sources using Azure Databricks and Hive UDFs.

-Ingested data from Azure Data Lake into Azure Data Warehouse using Azure Data Factory and Azure Databricks.

-Performed data cleaning and preparation on XML files in Azure environments.

-Built analytical dashboards to track student records and GPAs using Azure services.

-Utilized deep learning frameworks like TensorFlow and Keras to build deep learning models for clients.

-Participated in requirements meetings and data mapping sessions to understand business needs in Azure environments.

Environment: Hadoop, AWS EMR, EC2, S3, Athena, Glue, DBT, Apache Spark, Airflow, Docker, PySpark, SparkSQL, Python(OOP), SQL, Kafka, HBase, HIVE, PIG, UNIX, Shell scripting, Tableau, Git, Jenkins, Jira.

Client : Microsoft-Kentucky March 2019 to April 2021

Position: Senior Data Engineer

Responsibilities:

Led the design and implementation of a scalable data processing pipeline on AWS using Sagemaker, Lambda functions, and S3 buckets, resulting in a 30% reduction in processing time and significant cost savings. Employed Python for ETL tasks, ensuring efficient data extraction, transformation, and loading processes.

Spearheaded the integration of Machine Learning Operations (MLOps) practices into the client's infrastructure, leveraging AWS Sagemaker for model training, deployment, and monitoring. Developed CI/CD pipelines using industry-standard tools, ensuring seamless delivery of machine learning models into production environments.

Acted as a key technical advisor to the Microsoft client, providing expertise in AWS services, Python programming, and ETL best practices. Collaborated with cross-functional teams to optimize data workflows and enhance system reliability, ultimately driving improvements in data quality and decision-making capabilities.

Developed pyspark job to load the CSV files into the S3 buckets and created AWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.

Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.

Developed a daily process to do incremental import of data from DB2 and Teradata into Hivetables using Sqoop.

Analyzed the SQL scripts and designed the solution to implement using Spark.

Worked on importing metadata into Hive using Python and migrated existing tables and the data pipeline from Legacy to AWS cloud (S3) environment and wrote Lambdafunctions to run the data pipeline in the cloud.

Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.

Extensively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hivequeries.

Designed, developed and created ETL (Extract, Transform and Load) packages using Python to load data into Data warehouse tools (Teradata) from databases such as Oracle SQL Developer, MS SQL Server.

Utilized inbuilt Python module JSON to parse the member data which is in JSON format using json.loads or json.dumps and load into a database for reporting.

Collaborated with data scientists and ML engineers at Microsoft to understand model requirements and develop scalable data pipelines and infrastructure on AWS cloud platform, supporting various AI and ML initiatives across Microsoft's product portfolio.

Designed and implemented optimized data ingestion processes on AWS, leveraging services like Amazon S3, Amazon RDS, and Amazon Redshift to efficiently collect data from diverse sources such as Azure, third-party APIs, and streaming platforms, ensuring data availability and accessibility for ML model development.

Built and maintained data transformation and cleansing routines on AWS using technologies like AWS Glue, Apache Spark, and AWS Lambda, ensuring data quality and consistency for ML model training and inference, while adhering to Microsoft's data governance and compliance standards.Used Python libraries and SQL queries/subqueries to create several datasets which produced statistics, tables, figures, charts and graphs and has good experience of software development usingIDEs: pycharm, Jupyter Notebook.

Worked on bash scripting to automate the Python jobs for day-to-day administration.

Performed data extraction and manipulation over large relational datasets using SQL, Python, and other analytical tools.

Extensively worked with Teradatautilities like BTEQ, Fast Export, Fast Load, Multi Load to export and load Claims & Callers data to/from different source systems including flat files.

Environment: AWS EMR, AWS Glue, Redshift, Hadoop, HDFS, Teradata, SQL, Oracle, Hive, Spark, Python, Hive, Sqoop, MicroStrategy, Excel. Guidewire (GW), Nextgen (NG), M1, Phoenix (PHX), Customer Data Management (CDM).

Client: Zen3, Hyderabad, INDIA 2017 Nov – Feb 2019

Position: Hadoop Engineer

Responsibilities:

Worked in Agile environment and used rally tool to maintain the user stories and tasks.

Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modelling, Star and Snowflake schemas.

Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data.

Involved in creating AWS Pipelines by extracting customer's Big Data from various data sources into Hadoop HDFS and this included data from Excel, Flat Files, Oracle, SQL Server, Teradata, and log data from servers

Developed a python script to validate the files in one S3 with same daily files in another bucket.

Collaborated with cross-functional teams, including business analysts, data engineers, and data scientists, to define data requirements and architectural needs.

Designed and implemented scalable data architectures, including data lakes and data warehouses, to support the organization's data-driven initiatives.

Developed stored procedures in MS SQL to fetch the data from different servers using FTP and processed these files to update the tables.

Analyse various type of raw file like Json, Csv, Xml with Python using Pandas, Numpy etc.

Responsible for Designing Logical and Physical data modelling for various data sources on Confidential Redshift.

Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.

Experience with building data pipelines in python/Pyspark/HiveSQL/Presto/BigQuery and building python DAG in Apache Airflow.

Used Microsoft Visual Source Safe (VSS) and Team Foundation Server (TFS) for integration, maintenance and Security of code.

Created ETL Pipeline using Spark and Hive for ingest data from multiple sources.

Involved in using SAP and transactions done in SAP - SD Module for handling customers of the client and generating the sales reports.

Coordinated with clients directly to get data from different databases.

Worked on MS SQL Server, including SSRS, SSIS, and T-SQL.

Designed and developed schema data models.

Implemented data security measures, including encryption, access controls, and data masking, to protect sensitive data.

Stayed current with industry trends and emerging technologies in data architecture, ensuring the organization's data infrastructure remained up to date.

Acted as a technical mentor to junior team members, providing guidance on data architecture best practices

Documented business workflows for stakeholder review.

Client: AT&T Telecommunications April 2015 – Aug 2017

Big Data Engineer/Spark Developer

Responsibilities:

Involved in implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.

Developed Spark batch job to automate creation/metadata update of external Hive table created on top of datasets residing in HDFS.

Developed Data Serialization spark common module for converting Complex objects into sequence bits by using AVRO, PARQUET, JSON, CSV formats.

Worked on ERModeling, Dimensional Modelling (Star Schema, Snowflake Schema), Data warehousing and OLAP tools.

Populated HDFS and PostgreSQL with huge amounts of data using Apache Kafka.

Design and develop Rest API (Commerce API) which provides functionality to connect to the PostgreSQL through Java services.

Created iterative macro in Alteryx to send Json request and download Json response from webservice and analyze the response data.

Designed Batch Audit Process in batch\shell script to monitor each ETL job along with reporting status which includes table name, start and finish time, number of rows loaded, status, etc.

Developed Spark jobs in Py Spark to perform ETL from SQL Server to Hadoop.

Responsible for continuous monitoring and managing Elastic MapReduce (EMR) cluster through AWS console.

Designed and implemented data acquisition, ingestion, Management of Hadoop infrastructure and other Analytics tools (Splunk, Tableau).

Working knowledge of build automation and CI/CD pipelines.

Developed python scripts to automate data ingestion pipeline for multiple data sources and deployed Apache Nifi in AWS.

Design and develop Tableau visualizations which include preparing Dashboards using calculations, parameters, calculated fields, groups, sets and hierarchies.

Environment: Hadoop, Map Reduce, Spark, Spark MLLib, Tableau, SQL, Excel, PIG, Hive, AWS, Postgres SQL, Python, PySpark, Flink, Kafka, SQL Server 2012, T-SQL, CI-CD, Git, XML, Tableau.

Contact this candidate