Data Engineer Senior

Location:

Hyderabad, Telangana, India

Posted:

July 02, 2024

Contact this candidate

Resume:

Archana K

Senior Data Engineer / Hadoop / AWS & Big Data Specialist

Phone: +1-937-***-****

Email Id: ************@*****.***

LinkedIn: https://www.linkedin.com/in/archana-k-a04241304/

Summary:

10+ Years of strong experience as Data Engineer including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.

Strong experience in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Hue, Spark Streaming and Oozie.

Strong Experience with Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups.

Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.

Good understanding and knowledge of NoSQL databases like MongoDB, PostgreSQL, HBase and Cassandra.

In-depth knowledge and experience in creating, securing, and managing Databricks clusters of Amazon Elastic Compute Cloud (Amazon EC2) instances.

Experience in building large scale batch and data pipelines with data processing frameworks in AWS cloud platform using PySpark on EMR Glue ETL

Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats. Has good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO etc.

Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory.

Hands on experience in writing Map Reduce programs using Java to handle different data sets using Map Reduce tasks.

Solid experience of AWS services such as CloudFormation, S3, Athena, Glue, EMR/Spark, RDS, Redshift, Data Sync, DMS, DynamoDB, Lambda, Step Functions, IAM, KMS, SM etc.

Extensive knowledge in writing Hadoop jobs for data analysis as per the business requirements using Hive and worked on HiveQL queries for required data extraction, join operations, writing custom UDF's as required and having good experience in optimizing Hive Queries.

Experience in importing and exporting data using Sqoop from f3 to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.

Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.

Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.

Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.

Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.

Implemented data quality checks and monitoring systems to ensure the integrity and reliability of data used in AI/ML models

Worked with Databricks and Snowflake environments.

Hands on experience with data ingestion tools Kafka, Flume, and workflow management tools Oozie.

Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and Used Spark Data Frame Operations to perform required Validations in the data.

Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.

Extensive experience working on various databases and database script development using SQL and PL/SQL.

Worked on various programming languages using IDEs like Eclipse, NetBeans, and IntelliJ, Putty, GIT.

Have very strong interpersonal skills and the ability to work independently and with the group, can learn quickly and easily adaptable to the working environment.

Experience with Azure cloud platforms (HDInsight, Databricks, Blob, Data Factory, Synapse, SQL DB, SQL DWH).

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.

Technical Skills:

Big Data Technologies : HDFS, Yarn, MapReduce, Hive, HBase, Apache Spark, Scala, Kafka.

Programming Languages : SQL, HiveQL, Scala, Python, Unix Shell Scripting, PL/SQL and T-SQL

Hadoop Distributions : Apache Hadoop, Amazon EMR (EMR, EC2, EBS, RDS, S3, Glue, Elasticsearch, Lambda, SQS, DynamoDB, Redshift, ECS) Azure HDInsight (Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, Cosmos DB, Azure DevOps, Active Directory).

ETL Tools : Informatica, Talend

SDLC : Agile, Scrum

Data Modeling : ER-win, MS Visio

Development Tools : Microsoft SQL Studio, IntelliJ, PyCharm, Eclipse, Docker,

Kubernetes, Jenkins, Jira.

Databases : Snowflake, MongoDB, MariaDB, MYSQL

Reporting Tools : SSIS, SSRS, SSAS, Tableau and Power BI

Cloud Platform : Amazon Web Services (AWS), Azure

Source control tools : Git, Git Hub.

Client: U.S Bank Irving TX March 2022 - Present

Role: Data Engineer

Responsibilities:

Managed Apache Hadoop clusters for efficient processing of large-scale financial datasets, optimizing performance and resource utilization.

Established a robust AWS infrastructure, incorporating S3 for storage, Redshift for analytics, EMR for processing, EC2 for computation, Lambda for server less architecture, and SNS for event notifications.

Developed and maintained Unix shell scripts to automate data workflows, resulting in a 15% improvement in data processing efficiency.

Implemented Python scripts for ETL processes, facilitating the seamless integration of data from multiple sources into the data warehouse.

Applied Spark for large-scale data processing, optimizing system performance and enabling real-time analytics.

Leveraged Scala for specific data engineering tasks, show casings a strong commitment to staying abreast of emerging technologies.

Led migration projects to modern data platforms like Cloudera CDP and Hortonworks HDP, ensuring seamless integration and data continuity in the financial sector.

Deployed and managed Amazon EMR services (EMR, EC2, EBS, RDS, S3) for scalable and cost-effective data processing, enhancing operational efficiency.

Developed data pipelines with AWS Glue for automated ingestion, transformation, and loading of financial data, improving data accessibility and integrity.

Implemented Elastic search for real-time indexing and search capabilities, facilitating data discovery and analytics for financial insights.

Designed server less architectures using AWS Lambda to automate data processing tasks, optimizing workflows and reducing manual intervention.

Implemented real-time data streaming solutions with Amazon Kinesis, capturing and analyzing financial transactions and events for actionable insights.

Implemented a proof of concept deploying this product in Amazon Web Services AWS.

Utilized AWS SQS and DynamoDB to build scalable messaging and database solutions, ensuring high availability and data integrity for financial applications.

Optimized query performance and data warehousing with Amazon Redshift, enabling faster analytics and reporting for business stakeholders.

Deployed financial applications in containerized environments using Amazon ECS, ensuring portability and scalability across cloud infrastructure.

Developed data processing algorithms using Scala and Python, driving actionable insights and decision-making in financial analytics.

Environment: HDFS, Hive, Sqoop, Pig, Oozie, Cassandra, MySQL, Kafka, Spark, Redshift, Cornerstone, Apache Spark, DBT, Snowflake, Data modeling, Scala, Cloudera Manager (CDH4), HDFS, Amazon Web Services (AWS), SNS, GLUE, Python, GIT, SSIS, T-SQL, Jenkins

Client: First Tech Federal Credit Union, San Jose, CA April 2020 - February 2022

Role: Data Engineer

Responsibilities:

Responsible for providing data science solutions in Agile environment from data gathering to deliverables.

Responsible for researching and creating AWS architecture in collaboration with data engineers and DevOps team to migrate models from on-premises to AWS cloud.

Worked with the project team to develop and maintain the process of ETL (extract, transform, and load).

Running SQL scripts, creating indexes, and stored procedures for data analysis.

Designed various ETL strategies from various heterogeneous sources.

Worked on different data formats such as JSON, and XML.

Used Spark-SQL and Python on Spark engine to develop end-to-end ETL pipeline.

Participated in code/design analysis, strategy development, and project planning.

Gathered and Imported data from different sources into Spark RDD for further data transformations and analysis. Worked with vendors to on-board external data into Target s3 buckets.

Deployed the code for stream processing using Apache Kafka in Amazon S3.

Monitored and controlled Local disk storage and Log files using Amazon Cloud Watch.

Exploring AWS services and Blue Prism classification models for automating document classification.

Mentored analysts across different teams on python libraries, packages, frameworks, and AWS services.

Automated python scripts on server and locally depending on the job and data size.

Delivers effective presentation of findings and recommendations to multiple levels of leadership by creating visual displays of quantitative information.

Used Python 3.X (NumPy, SciPy, pandas, scikit-learn, seaborn) and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.

Developed and implemented predictive models using machine learning algorithms such as linear regression, classification, multivariate regression, Naive Bayes, Random Forests, K-means clustering, KNN, PCA and regularization for data analysis.

Utilized natural language processing (NLP) techniques to Optimized Customer Satisfaction.

Designed rich data visualizations to model data into human-readable form with Matplotlib.

Developed MapReduce/Spark Python modules for predictive analytics and machine learning in Hadoop on AWS.

Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.

Used NumPy, SciPy, pandas, NTPT (Natural Language Processing Toolkit), and matplotlib to build the model.

Application of various Artificial Intelligence (AI)/machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models.

Collaborates with cross-functional partners to understand their business needs, formulating end-to-end analysis that includes data gathering, analysis, ongoing deliverables, and presentations.

Environment: HDFS, Hive, Scoop, Pig, Impala, Flume, Oozie, Kafka, Spark, HBase, Unix Shell Scripting, Cloudera, Amazon Web Services (AWS), ETL, Yarn, Spark SQL, Redshift, Mongo DB, Databricks, R Programming, Athena, Power BI, Tableau, PySpark, Snowflake, Jenkins, SQL Server (SSIS, SSRS), Python

Client: Optum Eden prairie, MN September 2018 - March 2020

Role: Data Engineer

Responsibilities:

Orchestrated Azure cloud migration projects, ensuring seamless integration of insurance systems and data workflows.

Proficient in ETL processes using Informatica, facilitating efficient data movement and transformation.

Expertise in ER-win for data modeling, designing optimized structures for insurance domain datasets.

Managed Snowflake and MongoDB databases, ensuring data integrity, security, and scalability.

Proficient in SQL for querying and managing relational databases, facilitating data analysis and reporting.

Experienced in HiveQL for big data processing, enabling effective management of large datasets.

Utilized Scala for scalable data processing solutions, improving overall system performance and reliability.

Implemented Python scripts for automation, streamlining routine tasks and enhancing operational efficiency.

Conducted regular performance tuning of Azure services, ensuring optimal resource utilization and system responsiveness.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the SQL Activity.

Hands-on experience on developing SQL Scripts for automation purposes.

Good Understanding of Data ingestion, Airflow Operators for Data Orchestration and other related python libraries.

Led incident response and resolution efforts, minimizing downtime and ensuring uninterrupted business operations.

Provided mentorship and technical guidance to junior team members, fostering a collaborative learning environment.

Led data analytics project from requirement stage, scope analysis, model development, deployment and support.

Supervised entire data analysis including data collection, data transformation, and data loading.

Determined operational objectives by studying business rules, gathering information, evaluating output, and clarifying requirements and formats.

Prepared data mapping documents for the data migration process based on an analysis of source databases.

Environment: HDFS,Hive,Sqoop,ShellScripting,Oozie,MapReduce,Spark,SparkSQL,Kerberos,Kafka,Zookeeper,Python,Java,Hbase,Cassandra,MongoDB,Falcon,Yarn, MS Azure, Snowflake

Client: Macy's Duluth, GA. December 2016 - August 2018

Role: Data Engineer

Responsibilities:

Worked as a Data Engineer on several Hadoop Ecosystem components with Cloudera Hadoop distribution.

Worked on managing and reviewing Hadoop log files.

Tested and reported defects in an Agile Methodology perspective.

Worked on migrating Pig scripts programs to Spark and Spark SQL to improve performance.

Extensively involved in writing Oracle, PL/SQL, stored procedures, functions, and packages.

Loaded data from different source (database & files) into Hive using Talend tool.

Worked with NoSQL databases like HBase in creating tables to load large sets of semi structured data coming from source systems.

Worked on interviewing business users to gather requirements and documenting the requirements.

Used Flume to collect, aggregate, and store web log data from different sources.

Created data structure to store the dimensions in an effective way to retrieve, delete and insert the data.

Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.

Imported and exported data into HDFS and Hive using Sqoop and Flume.

Used Pattern matching algorithms to recognize the customer across different sources and built risk profiles for each customer using Hive and stored the results in HBase.

Developed and maintained stored procedures, implemented changes to database design including tables.

Ingested data from various sources and processed the Data-at-Rest utilizing Big Data technologies.

Developed Advance PL/SQL packages, procedures, triggers, functions, Indexes and Collections to implement business logic using SQL Navigator.

Environment: HBase, Oozie, Hive, Sqoop, SDLC, OLTP, SSAS, SQL, Oracle, PL/SQL, ETL, Sqoop, Flume

Client: RedPine Solutions, Hyderabad, India June 2013 - August 2016

Role: Hadoop Developer

Responsibilities:

Worked on analyzing Hadoop cluster using different big data analytic tools including Pig, Hive, and MapReduce.

Worked with the Data Science team to gather requirements for various data mining projects.

Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.

Worked on debugging, performance tuning of Hive & Pig Jobs.

Involved in running Hadoop jobs for processing millions of records and compression techniques.

Developed multiple MapReduce jobs in java for data cleaning and pre-processing.

Worked on tuning the performance of Pig queries.

Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest behavioral data into HDFS for analysis.

Moved Relational Database data using Sqoop into Hive Dynamic partition tables using staging tables.

Optimizing the Hive queries using Partitioning and Bucketing techniques, for controlling the data distribution.

Involved in loading data from LINUX file system to HDFS.

Importing and exporting data into HDFS and HBase using Sqoop from MYSQL.

Experience working on processing semi-structured data using Pig and Hive.

Supported MapReduce Programs those are running on the cluster.

Gained experience in managing and reviewing Hadoop log and JSON files.

Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.

Environment: Hadoop, HDFS, HBase, Pig, Hive, MapReduce, Sqoop, Oozie, LINUX, Big Data

Contact this candidate