data engineer

Location:

United States

Posted:

May 19, 2023

Contact this candidate

Resume:

Akshitha

************@*****.***

469-***-****

Senior Data Engineer

Professional Summary

Data Engineer professional with around 8 years of combined experience in the fields of Data Engineer, Big Data implementations and Spark technologies.

Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.

High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.

Expertise in writing end to end Data processing Jobs to analyse data using MapReduce, Spark and Hive.

Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and knowledge on Spark MLLib.

Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.

A solid experience and understanding of designing and operationalization of large-scale data and analytics solutions on Snowflake Data Warehouse.

Developing ETL pipelines in and out of Data Warehouse using a combination of Python and Snow SQL.

Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.

Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.

Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.

Strong Knowledge on architecture and components of Spark, and efficient in working with Spark Core.

Strong knowledge of Hive analytical functions, extending Hive functionality by writing custom UDFs.

Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.

Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.

Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instances.

Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.

Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB.

Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.

Experience in data warehousing and business intelligence area in various domain.

Created Tableau dashboards designing with large data volumes from data source SQL server.

Extract, Transform and Load (ETL) source data into respective target tables to build the required data marts.

Active involvement in all scrum ceremonies - Sprint Planning, Daily Scrum, Sprint Review and Retrospective meetings and assisted Product owner in creating and prioritizing user stories.

Strong experience in working with UNIX/LINUX environments, writing shell scripts.

Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, XML Files.

Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principles defined for the team.

Technical Skills

Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm.

Hadoop Distribution: Cloudera distribution and Horton works.

Programming Languages: Scala, Hibernate, JDBC, JSON, HTML, CSS, SQL, R, Shell Scripting

Script Languages: JavaScript, jQuery, Python.

Databases: Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL, Database (HBase, MongoDB).

Operating Systems: Linux, Windows, Ubuntu, Unix

Web/Application server: Apache Tomcat, WebLogic, WebSphere Tools Eclipse, NetBeans

Data Visualization Tools: Tableau, Power BI, SAS, Excel, ETL

OLAP/Reporting: SQL Server Analysis Services and Reporting Services.

Cloud Technologies: MS Azure, Amazon Web Services (AWS).

Professional Experience

Cigna, Bloomfield, Connecticut Jan 2022 to Present

Senior Azure Data Engineer

Responsibilities:

Used Azure Data Factory extensively for ingesting data from disparate source systems.

Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems.

Automated jobs using different triggers (Event, Scheduled and Tumbling) in ADF.

Used Cosmos DB for storing catalog data and for event sourcing in order processing pipelines.

Designed and developed user defined functions, stored procedures, triggers for Cosmos DB

Analyzed the data flow from different sources to target to provide the corresponding design Architecture in Azure environment.

Take initiative and ownership to provide business solutions on time.

Created High level technical design documents and Application design documents as per the requirements and delivered clear, well-communicated and complete design documents.

Created DA specs and Mapping Data flow and provided the details to the developer along with HLDs.

Created Build definition and Release definition for Continuous Integration and Continuous Deployment.

Created Application Interface Document for the downstream to create a new interface to transfer and receive the files through Azure Data Share.

Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks

Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks. Created, provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters. Integrated Azure Active Directory authentication to every Cosmos DB request sent and demoed feature to Stakeholders

Improved performance by optimizing computing time to process the streaming data and saved cost to company by optimizing the cluster run time. Perform ongoing monitoring, automation and refinement of data engineering solutions prepare complex SQL views, stored procs in azure SQL DW and Hyperscale Designed and developed a new solution to process the NRT data by using Azure stream analytics, Azure Event Hub and Service Bus Queue. Created Linked service to land the data from SFTP location to Azure Data Lake.

Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks etc.

Created several Databricks Spark jobs with PySpark to perform several tables to table operations.

Extensively used SQL Server Import and Export Data tool.

Created database users, logins and permissions to setup.

Working with complex SQL, Stored Procedures, Triggers, and packages in large databases from various servers.

Helping team members to resolve any technical issue, Troubleshooting, Project Risk & Issue identification and management Addressing resource issue, Monthly one on one, Weekly meeting.

Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure Data Lake, BLOB Storage, SQL server, Teradata Utilities, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Python.

CVS, New York, NY Aug 2020 to Dec 2021

Sr Data Engineer

Responsibilities:

worked with the analysis teams and management teams and supported them based on their requirements.

Architected, Designed and Developed Business applications and Data marts for reporting.

Interacted with users for verifying User Requirements, managing Change Control Process, updating existing documentation.

Facilitated and participated in Joint Application Development (JAD) sessions for communicating and managing expectations with the business users and end users.

Involved in Agile development methodology active member in scrum meetings.

Designed and Configured Azure Cloud relational servers and databases analyzing current and future business requirements.

Interacted with the SMEs (Subject Matter Experts) and stakeholders to get a better understanding of client business processes and gathered and analyzed business requirements.

Designed, documented and deployed systems pertaining to Enterprise Data Warehouse standards and best practices.

Installed Hadoop distribution system,

Worked on migration of data from On-prem SQL Server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).

Designed and developed data pipeline in Azure cloud which gets customer data from API and process it to Azure SQL DB.

Worked on catapulting data from database to consume on Databricks.

Performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.

To meet specific business requirements wrote UDF’s in Scala and PySpark.

Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.

Ingested data into HDFS using Sqoop and scheduled an incremental load to HDFS.

Worked with Hadoop infrastructure to storage data in HDFS storage and use HIVE SQL to migrate underlying SQL codebase in Azure.

Involved with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW).

Involved in creating Pipelines and Datasets to load the data onto Data Warehouse.

Build Data Warehouse in Azure platform using Azure data bricks and data factory.

Implemented a Python-based distributed random forest via Python streaming.

Worked on collection of large sets using Python scripting.

Worked on storing the data frame into Hive as table using Python (PySpark).

Used Python Packages for processing JSON and HDFS file formats.

Implemented ad-hoc analysis solutions using Azure Data Lake Analytics/Store and HDInsight.

Worked on creating tabular models on Azure analysis services for meeting business reporting requirements.

Worked with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW).

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.

Worked on all data management activities on the project data sources, data migration.

Worked with data compliance teams, Data governance team to maintain data models, Metadata, Data Dictionaries define source fields and its definitions.

Used Azure reporting services to upload and download reports.

Environment: Spark 2.8, Azure Blob, JSON, ADF, Scala 2.13, PySpark 3.0, Azure SQL DB, HDFS, Hive SQL, Azure Data Warehouse, SQL & Agile Methodology

Colorado Access, Aurora, CO Jul 2019 to Jul 2020

Senior Data Engineer

Responsibilities:

Integrated services like Bitbucket AWS Code Pipeline and AWS Elastic Beanstalk to create a deployment pipeline.

Created S3 buckets in the AWS environment to store files, sometimes which are required to serve static content for a web application.

Configured S3 buckets with various life cycle policies to arcHive the infrequently accessed data to storage classes based on requirement.

Possess good knowledge in creating and launching EC2 instances using AMI's of Linux, Ubuntu, RHEL, and Windows and wrote shell scripts to bootstrap instance.

Used IAM for creating roles, users, groups and implemented MFA to provide additional security to AWS account and its resources. AWS ECS and EKS for docker image storage and deployment.

Used Bamboo pipelines to drive all micro services builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes.

Design an ELK system to monitor and search enterprise alerts. Installed, configured and managed the ELK Stack for Log management within EC2 / Elastic Load balancer for Elastic Search.

Create develop and test environments of different applications by provisioning Kubernetes clusters on AWS using Docker, Ansible, and Terraform.

Worked on deployment automation of all the micro services to pull image from the private Docker registry and deploy to Docker Swarm Cluster using Ansible.

Installed Ansible Registry for local upload and download of Docker images and even from Docker Hub.

Create and maintain highly scalable and fault tolerant multi-tier AWS and Azure environments spanning across multiple availability zones using Terraform and CloudFormation.

Implemented domain name service (DNS) through route 53 to have highly available and scalable applications. Maintained the monitoring and alerting of production and corporate servers using Cloud Watch service.

Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR.

Migrated on premise database structure to Confidential Redshift Data Warehouse.

Wrote various data normalization jobs for new data ingested into Redshift.

Wrote scripts and indexing strategy for a migration to Confidential Redshift from SQL Server and MySQL databases.

The data is ingested into this application by using Hadoop technologies like PIG and HIVE.

Worked on AWS Data Pipeline to configure data loads from 53 to into Redshift

Used JSON schema to define table and column mapping from S3 data to Redshift.

Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift

Used JSON schema to define table and column mapping from S3 data to Redshift

On demand, secure EMR launcher with custom Spark submit steps using S3 Event, SNS, KMS and Lambda function.

Created EBS volumes for storing application files for use with EC2 instances whenever they are mounted to them.

Experienced in creating RDS instances to serve data through servers for responding to requests. Automated regular tasks using Python code and leveraged Lambda function wherever required.

Knowledge on Containerization Management and setup tools Kubernetes and ECS.

Environment: AWS (EC2, S3, EBS, ELB, RDS, SNS, SQS, VPC, Cloud formation, CloudWatch, ELK Stack), Bitbucket, Ansible, Python, Shell Scripting, PowerShell, ETL, AWS Glue, Jira, JBOSS, Bamboo, Docker, Web Logic, Maven, Web sphere, Unix/Linux

GM Financial, Texas Dec 2018 to Jun 2019

Data Engineer

Responsibilities:

As a Data Engineer worked with the analysis teams and management teams and supported them based on their requirements.

Architected, Designed and Developed Business applications and Data marts for reporting.

Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.

Developed reconciliation process to make sure elastic search index document count match to source records.

Maintained Tableau functional reports based on user requirements.

Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.

Used Agile (SCRUM) methodologies for Software Development.

Developed data pipelines to consume data from Enterprise Data Lake (MapR Hadoop distribution - Hive tables/HDFS) for analytics solution.

Created Hive External tables to stage data and then move the data from Staging to main tables.

Wrote complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist the data into HDFS.

Implemented the Big Data solution using Hadoop, Hive and Informatica to pull/load the data into the HDFS system.

Developed incremental and complete load Python processes to ingest data into Elastic Search from Hive.

Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.

Created Oozie workflow and Coordinator jobs to kick off the jobs on time for data availability.

Developed Rest services to write data into Elastic Search index using Python Flask specifications

Developed complete end to end Big-data processing in Hadoop eco system.

Used AWS Cloud with Infrastructure Provisioning / Configuration.

Used Hive to analyse the partitioned and bucketed data and compute various metrics for reporting on the dashboard.

Created dashboards for analysing POS data using Tableau.

Developed Tableau visualizations and dashboards using Tableau Desktop.

Involved in PL/SQL query optimization to reduce the overall run time of stored procedures.

Used Hive to analyse the partitioned and bucketed data and compute various metrics for reporting on the dashboard.

Continuously tuned Hive UDF's for faster queries by employing partitioning and bucketing.

Implemented partitioning, dynamic partitions and buckets in Hive.

Deployed RMAN to automate backup and maintaining scripts in recovery catalog.

Worked on QA the data and adding Data sources, snapshot, caching to the report.

Environment: AWS, Python, Agile, Hive 2.1, Oracle 12c, Tableau, HDFS, PL/SQL, Sqoop 1.2, Flume 1.6

Hill Soft Solutions, India Jan 2017 to Dec 2017

Hadoop Engineer

Responsibilities:

Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily data.

Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.

Import the data from different sources like HDFS/HBase into Spark RDD

Developed Spark scripts by using Python shell commands as per the requirement

Issued SQL queries via Impala to process the data stored in HDFS and HBase.

Used the Spark - Cassandra Connector to load data to and from Cassandra.

Used Restful Web Services API to connect with the MapRtable. The connection to Database was developed through restful web services API.

Involved in developing Hive DDLs to create, alter and drop Hive tables and storm, & Kafka.

Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.

Experience in data migration from RDBMS to Cassandra. Created data-models for customer data using the Cassandra Query Language.

Responsible for building scalable distributed data solutions using Hadoop cluster environment with Horton works distribution

Involved in developing Spark scripts for data analysis in both Python and Scala. Designed and developed various modules of the application with J2EE design architecture.

Implemented modules using Core Java APIs, Java collection and integrating the modules.

Experienced in transferring data from different data sources into HDFS systems using Kafka producers, consumers and Kafka brokers

Installed Kibana using salt scripts and built custom dashboards that can visualize aspects of important data stored by Elastic search.

Used File System Check (FSCK) to check the health of files in HDFS and used Sqoop to import data from SQL Server to Cassandra

Streaming the transactional data to Cassandra using Spark Streaming/Kafka

Implemented a distributed messaging queue to integrate with Cassandra using Apache Kafka and Zookeeper.

ConfigMap and Daemon set files to install File beats on Kubernetes PODS to send the log files to Log stash or Elastic search to monitor the different types of logs in Kibana.

Created Database in Influx DB also worked on Interface, created for Kafka also checked the measurements on Databases.

Installed Kafka manager for consumer lags and for monitoring Kafka Metrics also this has been used for adding topics, Partitions etc.

Successfully Generated consumer group lags from Kafka using their API.

Ran Log aggregations, website Activity tracking and commit log for distributed system using Apache Kafka.

Involved in creating Hive tables, and loading and analyzing data using Hive queries.

Developed multiple MapReduce jobs in java for data cleaning and pre-processing. Loading data from different source

(database & files) into Hive using the Talend tool.

Used Oozie and Zookeeper operational services for coordinating cluster and Scheduling workflows.

Implemented Flume, Spark, and Spark Streaming framework for real time data processing.

Environment: Hadoop, Python, HDFS, Hive, Scala, MapReduce, Agile, Cassandra, Kafka, Storm, AWS, YARN, Spark, ETL, Teradata, NoSQL, Oozie, Java, Cassandra, Talend, LINUX, Kibana, HBase

Smart Software Tech. Dev. Pvt. Ltd., India May 2014 – Dec 2016

Data Engineer/Data Analyst

Responsibilities:

Understand the data visualization requirements from the Business Users.

Writing SQL queries to extract data from the Sales data marts as per the requirements.

Developed Tableau data visualization using Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart.

Designed and deployed rich Graphic visualizations with Drill Down and Drop-down menu option and Parameterized using Tableau.

Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.

Explored traffic data from databases connecting them with transaction data, and presenting as well as writing report for every campaign, providing suggestions for future promotions.

Extracted data using SQL queries and transferred it to Microsoft Excel and Python for further analysis.

Data Cleaning, merging, and exporting the dataset was done in Tableau Prep.

Data processing and cleaning techniques carried out to reduce text noise, reduce dimensionality in order to improve the analysis.

Environment: Python, Informatica v9.x, MS SQL SERVER, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel.

Education:

Masters in Computer Science from Lamar university, Beaumont, Texas

Bachelors in Computer Science from JNTU Hyderabad.

Contact this candidate