Data Analyst Engineer

Location:

Charlotte, NC

Posted:

July 18, 2022

Contact this candidate

Resume:

Meghana Reddy Nayakanti

***********@*****.***

+1-205-***-****

Data Engineer

Around 4+ years of professional IT experience involving project development, implementation, deployment, and maintenance using Big Data technologies in designing and implementing complete end-to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark,Pyspark, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase.

Professional Summary:

●3 years of experience in Hadoop components like MapReduce, Flume, Kafka, Pig, Hive, Spark, HBase, Oozie, Sqoop, PySpark and Zookeeper.

●Experience in working with different Hadoop distributions like CDH and Hortonworks. Good knowledge on MAPR distribution & Amazon’s EMR.

●Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.

●Experience in analyzing data using Python, SQL, PySpark, Spark SQL for Data Mining, Data Cleansing and Machine Learning.

●Experience with Requests, Report Lab, NumPy, SciPy, Pytables, Matplotlib, HTTPLib2, Urllib2, Beautiful Soup, Data Frame and Pandas python libraries during development lifecycle.

●Experience developing Pig Latin and HiveQL scripts for Data Analysis and ETL purposes and extended the default functionality by writing User Defined Functions (UDFs), User Defined Aggregate Function (UDAFs) for custom data specific processing.

● Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python and Data.

●Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.

●Strong Knowledge on Architecture of Distributed systems and parallel processing, In-depth understanding of MapReduce programing paradigm and Spark execution framework.

●Good experience in creating data ingestion pipelines, data transformations, data management, data governance and real time streaming at an enterprise level.

●Solid experience in using the various file formats like CSV, TSV, Parquet, ORC, JSON and AVRO.

●Experience working with Data Lake is a system or repository of data stored in its natural/raw format, usually object blobs or files.

●Excellent understanding of Data Ingestion, Transformation and Filtering.

●Provides Output for multiple stake holders at the same time

●Coordinated with the Machine Learning team to perform Data Visualization using PowerBI and Tableau.

●Developed Spark and Scala applications for performing event enrichment, data aggregation, and denormalization for different stakeholders.

●Designed new data pipelines and made the existing data Pipelines to be more efficient.

●Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing, and optimizing the HiveQL queries.

●In depth understanding of Hadoop Architecture and its various components such as YARN, Resource Manager, Application Master, Name Node, Data Node, HBase design principles etc.

●Experience developing iterative algorithms using Spark Streaming in Scala and Python to build near real-time dashboards.

●Experience with migrating data to and from RDBMS and unstructured sources into HDFS using Sqoop.

●Experience in job workflow scheduling and monitoring tools like Oozie and good knowledge on Zookeeper to coordinate the servers in clusters and to maintain the data consistency.

●Profound understanding of Partitions and Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.

●Worked on NoSQL databases like HBase, Cassandra and MongoDB.

●Experienced with performing CRUD operations using HBase Java Client API and Solr API

●Good experience in working with cloud environment like Amazon Web Services (AWS) EC2 and S3.

●Experience in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins, and AWS.

●Experience writing Shell scripts in Linux OS and integrating them with other solutions.

●Strong Experience in working with Databases like Oracle 10g, DB2, SQL Server 2008 and MySQL and proficiency in writing complex SQL queries.

●Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.

●Hands-on experience on fetching the live stream data from DB2 to HBase table using Spark Streaming and Apache Kafka.

●Good knowledge on creating Data Pipelines in SPARK using SCALA.

●Experience in developing Spark Programs for Batch and Real-Time Processing. Developed Spark Streaming applications for Real Time Processing.

●Good knowledge on Spark components like Spark SQL, MLlib, Spark Streaming and GraphX

●Experience data processing like collecting, aggregating, moving from various sources using Apache Kafka.

●Experienced in moving data from Hive tables into Cassandra for real time analytics on hive tables and Cassandra Query Language (CQL) to perform analytics on time series data.

●Good Knowledge in custom UDF in Hive & Pig for data filtering.

●Excellent communication, interpersonal and analytical skills. Also, a highly motivated team player with the ability to work independently

Education:

●Master’s in Data Science from University of Alabama at Birmingham.(2019-2020)

●Bachelor’s in Information science from M.V.J College of Engineering - Bengaluru, Karnataka,India.(2012-2016)

Skill Set:

Big Data Space

Hadoop, MapReduce, Pig, Hive, HBASE, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari, Elastic Search, Solr, MongoDB, Cassandra, Avro, Storm, Parquet, Snappy, AWS, PySpark

Hadoop Distributions

Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR, Apache EMR

Databases & warehouses

NoSQL, Oracle, DB2, MySQL, SQL Server, MS Access, Teradata.

Java Space

Core Java, J2EE, JDBC

Languages

Python, Java, SQL, PL/SQL, Scala, JavaScript, Shell Scripts, C/C++

Web Technologies

HTML, CSS, JavaScript, DOM, XML, XSLT

IDE

Eclipse, IntelliJ IDEA.

Operating systems

UNIX, LINUX, Mac OS, Windows

RDBMS

Teradata, Oracle 9i,10g,11i, MS SQL Server, MySQL, DB2.

Version controls

GIT, SVN, CVS

ETL Tools

Informatica, AB Initio, Talend

Reporting

Tableau, Spotfire, Power BI

Professional Experience:

Client

Location

Designation

Duration

CapitalOne

Remote(Plano,TX)

Data Engineer

May 2021-Present

Responsibilities:

●Migrate data from on-premises to AWS storage buckets

●Developed a python script to transfer data from on-premises to AWS S3

●Developed a python script to hit REST API’s and extract data to AWS S3

●Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions

●Created yaml files for each data source and including glue table stack creation

●Worked on a python script to extract data from Netezza databases and transfer it to AWS S3

●Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, EventBridge, SNS)

●Created a Lambda Deployment function, and configured it to receive events from S3 buckets

●Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)

●Implemented data streaming capability using Kafka and Talend for multiple data sources.

●Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inman’s Approach.

●Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.

●Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.

●Well experience in Normalization and Denormalization techniques for optimum performance in relational and dimensional database environments.

●Experience in writing stored procedures and complex SQL queries using relational databases like Oracle, SQL Server and MySQL.

●Knowledge on ETL methods for data extraction, transformation and loading in corporate-wide ETL solutions and Data warehouse tools for reporting and data analysis.

●Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.

●Experienced in version control tools like GIT and ticket tracking platforms like JIRA.

●Experienced in using Atlassian tools like Jira and bitbucket.

Environment: Python, Scala, Harmony, Shell scripting, Linux, Oracle,Spotfire Git, Atlassian tools and Agile Methodologies, Hadoop, Hive, Presto, Snowflake, MicroStrategy, ThoughtSpot, Tableau,PyCharm Airflow.

Client

Location

Designation

Duration

P&G

Fractal Analytics(India)

Data Engineer(Consultant)

April 2018 - May 2019

Responsibilities:

●Partnered with third party vendors to securely load their data from file systems to sql servers.

●Developed the workflows based on schema designed as per the requirement.

●Installing, configuring and maintaining Data Pipelines

●Worked with data analysts and a part of product teams or business intelligence teams

●Responsible for data cleansing from source systems using Ab Initio Components such as join, dedup sorted, De-normalize, Normalize, Reformat, Filter by expression, Rollup

●Developed python scripts to connect between various databases (Teradata, SQL Server Management Studio, Snowflake) to retrieve and consolidate the data that is available in P&G’s organization for monthly Data Analysis for the Corporate Planning Project

●Developed python scripts to access amazon S3 source data that is used for running the BAU (Business as Usual Process) monthly cycles and to load the data to different database

●Analyzed SQL scripts and designed the solutions to implement using PySpark.

●Developing data processing tasks using PySpark such as reading data from external sources, merge data, perform data enrichment and load into target data destinations.

●Connected Mysql to Tableau for creating a dynamic dashboard for the analytics team.

●Involved in running Hadoop streaming jobs to process terabytes of text data. Worked with different file formats such as Text, Sequence files, Avro, ORC and Parquet.

●Developed a holistic database with retail panel data provided by different vendors for different countries in the Asia Pacific Region and harmonizing data at different levels of granularities.

●Developed interactive dashboards for the leadership team which are a monthly refresh, so that the leadership team can take a decision on a real-time basis.

●Undertake Ad-hoc request for building dashboards or reports to make the more insightful decision for different markets.

●Experienced in creating and managing event handlers, Package Configurations, Logging, System and User-defined Variables for SSIS Packages

●Configured, supported, and maintained all network, firewall, storage, load balancers, operating systems, and software in AWS EC2.

●Implemented the use of Amazon EMR for Big Data processing among a Hadoop Cluster of virtual servers on Amazon related EC2 and S3.

●Good knowledge in using Data Manipulations, Tombstones, Compactions in Cassandra. Well experienced in avoiding faulty Writes and Reads in Cassandra.

●Performed data analysis with Cassandra using Hive External tables.

●Designed the Column families in Cassandra.

●Experienced in running Hadoop streaming jobs to process terabytes of xml format data.

●Configured Continuous Integration system to execute suites of automated test on desired frequencies using Jenkins, Maven & GIT.

●Involved in loading data from LINUX file system to HDFS.

●Followed Agile Methodologies while working on the project.

Environment: Postgres, Alteryx,Knime VBA AWS EC2, S3, ERM, Shell Scripting, Scala,Agile methods, MySQL, Linux, Shell Scripting, Mongo dB, Python, GIT, JIRA

Client

Location

Designation

Duration

Reckitt Benckiser

Fractal Analytics(India)

Data Analyst(Associate)

June 2016 - March 2018

Responsibilities:

●Cuddle AI-based app delivering relevant business insights in the bite-sized cards on mobiles.

●End to End ownership of key insights generation for multiple clients by giving them a more detailed view on what are the sales trends and how to optimize them.

●Data Extraction and manipulation of raw data provided using tools like Alteryx to refine the data and give more meaningful data for better business understanding

●Comprehensive knowledge of Software Development Life Cycle (SDLC).

●Exposure to Waterfall, Agile and Scrum models.

●Highly adept at promptly and thoroughly mastering new technologies with a keen awareness of new industry developments and the evolution of next generation programming solutions.

●Worked on Ab initio sandboxes, developed graphs.

●Experience using data manipulation tools like Knime for ETL processing.

●Prepared Low-level Design documents and mapping documents.

●Prepared Unit Test Plan Document after the thorough data validation.

●Prepared Hand-Off documents for the maintenance team.

●Provided one month of warranty support.

●Converting business requirements into ETL Design and developing the jobs.

●Developed Multi-dimensional cubes to prepare reports and compared the table data with the new cubes to ensure end to end validation.

●Developed TI script to clean the data and transform into our requirement.

Contact this candidate