Resume

Data Engineer Senior

Location:

Raleigh, NC, 27603

Posted:

December 18, 2023

Contact this candidate

Resume:

Teja Anupoju

+1-682-***-****

ad12fi@r.postjobfree.com

Senior Data Engineer with experience in Automobile, Media, Health Care, Software and Engineering domains. Expertise in building data pipelines and dashboards that furnish insights used to advance opportunity identification and process re-engineering along with a story to tell. Specialized in Data Analytics, Data Engineering and Data Visualization. Analytically minded professional with a proven ability to solve complex quantitative business challenges. Exceptional verbal and written communication skills, with a track record of effectively conveying insights to both business and technical teams. Adept at utilizing data to drive strategic decision-making and delivering impactful presentations.

Professional Summary:

Over 8+ years of IT experience in Big Data Engineering, Analysis, Design, Implementation, Development, Maintenance, and test large scale applications using SQL, Hadoop, Python, Java, and other Big Data technologies.

Hands on experience in installation, configuration, supporting and managing Hadoop Clusters using Cloudera and Hortonworks distribution of Hadoop.

Experience working with Elasticsearch, Logstash and Kibana.

Experience working with AWS Stack (S3, EMR, EC2, SQS, Glue and RedShift).

Used AWS Glue and I designed jobs to convert nested JSON objects to parquet files and load them into S3.

Experience in developing solutions to analyze large data sets efficiently.

Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala for data cleansing, filtering, and data aggregation. Also possess detailed knowledge of MapReduce framework.

Proficient at leveraging cutting-edge data manipulation, transformation, and analysis tools to transform raw data into actionable insights.

Experienced in developing pySpark programs and creating data frames and working on transformations.

Created Hive, SQL and HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL, and a variety of portfolios.

Worked on developing ETL processes to load data from multiple data sources to HDFS using Kafka and Sqoop.

Experience in using Apache Kafka for collecting, aggregating and moving large amounts of data.

Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and from RDBMS to HDFS.

Experience in data analysis using Hive, Impala.

Experience in developing large scale applications using Hadoop and Other Big Data tools.

Responsible for updating and configuring different components of vehicle.

In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node.

Hands on experiences in Hadoop, Eco - system components like HDFS, Cloudera, YARN, Hive, HBase, Sqoop, Flume, Kafka, Impala, Airflow and Programming in Spark using Python and Scala.

Proficient at leveraging cutting-edge data manipulation, transformation, and analysis tools to transform raw data into actionable insights.

Skilled at working with cross-functional teams to comprehend business requirements and transform them into scalable pySpark solutions.

Experienced in developing pySpark programs and creating data frames and working on transformations.

Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.

Experience working with using ansible for automating deployment process.

Experience in scheduling and monitoring jobs using Airflow. Also Developed custom operators in Airflow.

Experience with database SQL and NoSQL (HBase and Cassandra.)

Developed Spark scripts to import large files from AWS S3 buckets.

To promote data exchange and interoperability, integrate SNS and Kinesis with hospital information systems (HIS), electronic health record (EHR) systems, and other healthcare databases.

Perform structural modifications using Hive and analyze data using visualization/reporting tools (Tableau).

Experience in using Hadoop ecosystem and processing data using Tableau, Quick sight and PowerBI.

Django and Flask, two top Python web frameworks, are used by an experienced data engineer to power data-driven solutions for the automobile sector.

In charge of the testing and planning for disaster recovery, which produced a solid backup plan with little delay in the event of data loss or cluster failure.

I have knowledge on how to gather, collect, and integrate data from a variety of sources, including databases, APIs, external datasets, and data lakes.

Improved API performance by optimizing database queries, implementing caching mechanisms, and reducing response times, resulting in a more responsive and scalable system.

Took part in cross-functional meetings to collect requirements and offer technical knowledge on issues pertaining to APIs.

Experienced in using database tools like SQL Navigator, TOAD.

Experience in the dynamic field of telematics, with a focus on the meeting point of connection and automotive technology.

Experienced in developing and implementing telematics systems for ford.

Experienced with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.

Extensively used Informatica Power Center in end-to-end of Data warehousing ETL routines, which includes writing custom scripts and loading data from flat files.

Hands on experience on Hadoop /Big Data related technology experience in Storage, Querying, Processing and analysis of data.

Experienced in using various Hadoop infrastructures such as Hive and Sqoop.

Expert in Amazon EMR, Spark, Kinesis, S3, Boto3, Bean Stalk, ECS, Cloud watch, Lambda, ELB, VPC, Elastic Cache, Dynamo DB, redshift, RDS, Athena, Zeppelin & Airflow.

Experienced in testing data in HDFS and Hive for each transaction of data.

Education Details:

Bachelors: Electronics and Communications. (K L University)

Certifications:

AWS Solution Architect: https://www.credly.com/badges/e96a9664-2ad6-4465-b952-e5d7860b9be4.

GCP Collaboration Engineer.

Professional Experience:

Client: FORD, Detroit, MI Apr 2022 to Till Now

Sr Data Engineer

Designed and implemented data pipelines using Hive, Flume, Sqoop, PIG and map reduce to ingest customer behavioral data and financial histories into HDFS for analysis.

Used AWS Glue ETL service that consumes raw data from S3 bucket and transforms raw data as per the requirement and write the output to s3 bucket in parquet format for data analytics purpose.

Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.

Managed the troubleshooting and optimization of AWS Glue jobs, removing difficulties in performance and increasing pipeline efficiency.

collaborated with data scientists to improve the organization's statistical analysis capabilities by integrating machine learning models into AWS Glue workflows.

Familiarity with AWS CloudWatch for monitoring and managing AWS resources.

solved problems with the data pipeline or slowdowns in performance swiftly by organizing and analyzing log data through the combination of CloudWatch Logs.

Stored data in AWS S3 like HDFS and performed EMR programs on data stored in S3.

Involved in designing Amazon Redshift DB Clusters, Schema, and tables AWS Redshift and EMR maintenance support.

Hands-on with AWS data migration between database platforms Local SQL Servers to Amazon RDS and EMR HIVE.

Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3.

Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

Involved in writing Java and Node.js API for Amazon Lambda to manage some of the AWS services.

Worked on AWS Lambda to run servers without managing them and to trigger run code by S3 and SNS.

Developed data transition programs from DynamoDB to AWS Redshift (ETL Process) using AWS Lambda by creating functions in Python for certain events based on us3e cases.

Implemented the AWS cloud computing platform by using RDS, Python, Dynamo DB, S3 and Redshift.

Moved data from the traditional databases like MS SQL Server, MySQL, and Oracle into Hadoop by using HDFS.

Involved in the creation of Hive tables, loading, and analyzing the data by using hive queries.

Have worked on installation and configuration of EC2 instances on AWS (Amazon Web Services) for the establishment of clusters on the cloud.

Developed Spark SQL scripts using PySpark to perform transformations and actions on Data frames, Data set in spark for faster data Processing.

Used PySpark and Spark SQL for extracting, transforming, and loading the data according to the business requirements.

Worked on CI/CD solution, using Git, Jenkins, Docker, and Kubernetes to setup and configure big data architecture on AWS cloud platform.

Exposed to all aspects of software development life cycle (SDLC) like Analysis, Planning, Developing, Testing, implementing and post-production analysis of the projects. Worked through Waterfall, Scrum/Agile Methodologies.

Used Pig as ETL tool to do event joins, transformations, filter, and some pre-aggregations.

Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.

Working on integrating Kafka Publisher in spark job to capture errors from Spark Application and push into database.

Implemented monitoring solutions to track API usage, identify performance issues, and proactively address potential issues.

Configuring a MongoDB cluster with High Availability, Load balancing and performing CRUD operations.

Loading data from flat files into the target database using ETL processes by applying business logic for inserting and updating records when loaded.

Performed day-to-day Database Maintenance tasks including Database Monitoring, Backups, Space, and Resource Utilization.

Used airflow to schedule the end-to-end jobs.

Implemented Apache Airflow and other dependency enforcement and scheduling tools.

Created multiple DAGs in airflow for daily, weekly, and monthly loads.

Data is properly prepared and integrated into PowerBI for effective reporting and analysis.

Used Tableau Data Visualization tool for reports, integrated tableau with Alteryx for Data & Analytics.

Used JIRA as the Scrum Tool for Scrum Task board and work on user stories.

Integrated Oozie workflows with MapReduce, Hive, Pig, HDFS, and other Hadoop ecosystem components to support a variety of data processing tasks.

Experienced in implementation of Retry and error handling mechanisms in Oozie workflows to improve the endurance of data processing jobs.

Worked on creating Oozie workflow and the coordinator jobs to remove the jobs in time for availability of data.

Experience in building scripts by using Maven and performing continuous integrations systems like Jenkins.

Environment: AWS (S3, EMR, EC2, LAMBDA, GLUE, Cloud Watch), Hadoop Ecosystem, Hive, Pig, ETL, Python, Java, node.js, PowerBI, MongoDB, API Dev, Integration, Apache Airflow, Snowflake, CI/CD, Kubernetes, Sqoop, MSSQL, Git, Jenkins.

Client: The Walt Disney, Orlando, FL Oct 2021 to Mar 2022

Sr Data Engineer

Designed, and built scalable distributed data solutions using AWS and planned migration for existing on-premises Cloudera Hadoop distribution to AWS on business requirement.

Deploying and configuring Cloudera clusters to ensure the availability and reliability of big data infrastructure.

Monitoring and tuning the performance of the Cloudera cluster to ensure efficient data processing and analytics.

Troubleshoot and maintain ETL/ELT jobs running using Matillion.

Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring integrity in a relational environment by working closely with the stakeholders and solution architect.

Maintained AWS Glue updates and best practices, offering suggestions for data engineering process optimization and continuous improvement.

AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3)

I was responsible for creating on-demand tables on S3 files using Lambda functions and AWS Glue using Python and Spark.

Worked in AWS environment for development and deployment of custom Hadoop applications.

Built AWS Data pipelines using various resources in AWS including AWS API Gateway to receive response from AWS lambda and retrieve data from snowflake using lambda function and convert the response into JSON format using Database as Snowflake, DynamoDB, AWS Lambda function and AWS S3.

Worked with Snowflake features such as Snowpipe for real-time data ingestion.

Implemented and maintained data security measures in Snowflake to protect sensitive information.

Collaborated with cross-functional teams to integrate Snowflake solutions into existing systems and workflows.

Conducted performance tuning and optimization of Snowflake queries and data processing workflows.

Created and maintained documentation for Snowflake architecture, data models, and ETL processes.

Used Git to manage API version control, making sure that branching strategies and versioning were clear to allow for backward compatibility.

Experienced in implementing APIs to synchronize data between systems, guaranteeing consistency and real-time updates across platforms.

I took part in API testing to confirm overall system reliability, error handling, and data accuracy.

Overcame difficulties with versioning and integrating disparate JSON formats for data.

Experience in tracking API usage, found performance problems, and took proactive measures to fix possible problems by implementing monitoring solutions.

Worked in AWS environment for development and deployment of custom Hadoop applications.

Collected data using Spark Streaming from AWS S3 bucket in near-real- time and performs necessary Transformations and Aggregations to build the data model and persists the data in HDFS.

Experienced in providing highly available and fault tolerant applications utilizing orchestration technology on Google Cloud Platform (GCP).

Experienced in the planning and capacity requirements for the migration path of IBM Big Insights (on-prem) solution to cloud native GCP based solution. This involved tools like DataProc, Dataflow, Cloud Functions, Google Cloud Storage and Pub/Sub.

Expertise in configuration, logging, and exceptional handling.

Overseen and evaluated Hadoop log file and worked in analyzing SQL scripts and designed the solution for the process using PySpark.

Used PyCharm, Jupyter Notebook as IDE and Git for version control. Testing and deploying the application on Tomcat and NGINX. Used Jenkins for continuous integration of the code.

Done stress and performance testing, benchmark for the cluster.

Environment: AWS, Snowflake, GCP, API Integration, Big Data Tools, IBM, ETL, JSON, LAMBDA, Git, Jupyter, PyCharm, Pyspark, Data Flow, SQL, Cloud Storage, Matillion.

Client: Anthem INC Oct 2019 to July 2021

Data Engineer

Designed the Azure Cloud relational databases analyzing the clients and business requirements.

Worked on environmental data migration from On-Premises to cloud databases (Azure SQL DB).

Implemented disaster recovery and failover servers in cloud by replicating the environmental data across multiple regions.

Extensive experience on AKS.

Have extensive experience in creating pipeline jobs and implementing scheduling triggers.

Mapped data flows using Azure Data Factory(V2) and used Key Vaults to store credentials.

Have an ample amount of experience working on Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW).

Experience in creating Elastic pool databases and scheduled Elastic jobs for executing T-SQL procedures.

Developed business intelligence solutions using SQL server and load data to SQL & Azure Cloud databases.

Worked on creating tabular models on Azure analysis services for meeting customers reporting requirements.

Built pipelines using Azure Data Factory, Azure Databricks and loading data into Azure data lake, Azure data warehouse and monitored databases.

created tabular models on Azure analysis services based on business requirements.

Proposed designs thinking about the expense/spend in Azure and develop recommendations right size information framework.

Worked on creating correlated sub-queries to determine complex business queries including various tables from various data sets.

Created Tableau reports with complex calculations and worked on Ad-hoc reporting using PowerBI.

Solid Experience in building interactive reports, dashboards, and integrating modeling results and have strong data visualization experience in Tableau and PowerBI.

Developed an ample amount of backend modules using Python Flask Web Framework using ORM models.

Expertise in efficient interaction between various components, web applications' core functionality became available through the design and development of RESTful APIs.

Created API endpoints for CRUD functions, data retrieval, and authentication in Python utilizing frameworks like Flask and Django.

Authorization and authentication systems based on tokens were implemented to guarantee API security.

Worked on projects designed with waterfall and agile methodologies, delivered high-quality deliverables on time.

Extensively worked on the tuning and optimizing SQL Queries to reduce run.

Designed and developed JAVA API (Commerce API) which provides functionality to connect to the Cassandra through Java services.

Successfully designed and developed Java Multi-Threading based collector parser and distributor process, when the requirement was to collect, parse and distribute the data coming at a speed of thousand messages per seconds.

Implemented Google Big Query which adds to the data layer between Google Analytics and PowerBI. We have a lot of Web behavior data being tracked in Google Analytics which needs to be pulled into a BI system for better reporting. With native PowerBI connector for GA, we were getting sampled data not giving us accurate results.

Environment: Azure, Azure SQL DB, DataBricks, SSIS & SSRS, JAVA API, Cassandra, PowerBI, Google Big Query, BigQuery, Python, SQL, ADF (Azure Data Factory), AKS, Sqoop.

Client: National Grid, Waltham, MA Jan 2018 to Sept 2019

Data Engineer/Analyst

Migrated the existing SQL CODE to Data Lake and sent the extracted reports to the consumers.

Created PySpark Engines processing huge environmental data load within minutes implementing various business logics.

Worked extensively on Data mapping to understand the source to target mapping rules.

Developed data pipelines using python, PySpark, Hive, Pig and HBase.

Migration of on-premises data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1N2).

Migrating an entire oracle database to Big Query and using PowerBI for reporting.

Developed PySpark scripts to encrypt specified set of data by using hashing algorithms concepts.

Designed and implemented a part of the product of knowledge lens which takes environmental data on a real time basis from all the industries.

Performed data analysis and data profiling using SQL on various extracts.

Effectively interacted with Business Experts and Data Scientists and characterized Planning archives and Configuration process for different Sources and Targets.

Created reports of analyzed and validated data using Apache Hue and Hive, generated graphs for data analytics.

Worked on data migration into HDFS and Hive using Sqoop.

Involved in business requirements gathering for successful implementation and POC (proof-of-concept) of Hadoop and its ecosystem.

Performed database cloning for testing purposes.

Written multiple batch processes in python and PySpark processing huge amount of time series data which created reports and scheduled these reports to industries.

Created analytical reports on this real time environmental data using Tableau.

Generated final reporting data using Tableau for testing by connecting to the corresponding Hive tables using Hive ODBC connector.

Responsible for HBase bulk load process, created HFiles using MapReduce and then loaded data to HBase tables using complete bulk load tool.

Commit and Rollback methods were provided for transactions processing.

Fetch data to/from HBase using Map Reduce jobs.

Used HBase for storing the Meta data of files and maintaining the file patterns.

Worked in complete Software Development Life Cycle (analysis, design, development, testing, implementation and support) using Agile Methodologies (Jira).

Environment: Python, Tableau, Pyspark, SQL, Hadoop, DB’s, Hive, Map Reduce, Sqoop, Data Analytics, Oracle, DB2, MongoDB, ADLS, ADF.

Client: Infosys Pvt Ltd, Bengaluru NOV 2014 to MAY 2017

Data Engineering Analyst

Understand how data is related to Big Data and with large data sets.

Hadoop’s method of distributed processing.

Yarn and MAP Reduce on Hadoop.

Hadoop Administration.

Moving Data into Hadoop.

Query optimization through SQL tools for quick response time.

Worked on Database views, tables, and objects.

Worked on various phases of projects like design, development and testing and deployment phases.

Worked on various bug fixes in the dev environment.

Used GitHub as a version control tool.

Developed backend modules using tornado framework, later started to use Flask framework.

Experienced working on Python versions 2.7 and 3.

Spark Fundamentals and how it is processor an essential tool set for working with Big Data.

Accessing Hadoop Data Using Hive.

Designing and maintaining data systems and databases, this includes fixing coding errors and other data-related problems.

Mining data from primary and secondary sources, then reorganizing said data in a format that can be easily read by either human or machine.

Using statistical tools to interpret data sets, paying particular attention to trends and patterns that could be valuable for diagnostic and predictive analytics efforts.

Demonstrating the significance of their work in the context of local, national, and global trends that impact both their organization and industry.

Collecting data from various data sources, refining, and preparing data for analytics.

Design, Develop, Manage, and utilize the Tableau platform to extract meaningful insights from it Drill-down data and prepare reports with utmost accuracy using various visualization and data modeling methods.

Define new KPIs and consistently measure in the datasets. Test published dashboards and scheduled refresh reports.

Use statistical methods to analyze data and generate useful business reports Work with management team to create a prioritized list of needs for each business segment Identify and recommend new ways to save money by streamlining business processes.

Use data to create models that depict trends in the customer base and the consumer population as a whole work with departmental managers to outline the specific data needs for each business method analysis project.

Preparing reports for executive leadership that effectively communicate trends, patterns, and predictions using relevant data.

Environment: GitHub, Python, SQL, Spark, Cassandra project, Data Analytics, Tableau, Hadoop, Hive, Sqoop Procedures, GitHub, Big Data Tools.

Technical Skills:

Hadoop Ecosystem: Hadoop, HDFS, Hive, Spark-streaming, Scala, Kafka, Storm, Zookeeper, HBase, Yarn,

Spark, Sqoop, Flume.

Programming Languages: C++, Java, Python, Scala, SQL, JavaScript

Hadoop Distributions: Apache Hadoop, Cloudera, Hadoop Distribution, CDH3, CDH4, CDH5 and Hortn works Data.

NoSQL Databases: HBase, Cassandra, MongoDB.

API: Restful API, API Doc

Cloud: AWS, EMR, Glue, Cloud watch, AWS S3, SNS, Azure Snowflake, kinesis.

Query Language: HiveQL, SQL, PL/SQL

IDE’s: Eclipse, IntelliJ, PyCharm

Frameworks: MVP, structs, Spring, Hibernate

Operating System: Linux, Unix, Windows.

Scripting: Shell Scripting

Version: SVN, GIT, CVS

Contact this candidate