Data Engineer

Location:

Austin, TX, 73301

Salary:

Posted:

August 28, 2023

Contact this candidate

Resume:

Poojitha.V - Data Engineer

469-***-****

***************@*****.***

Professional Summary:

●Dynamic and motivated IT professional with over 8 years of experience as a Big Data Engineer with expertise in designing data intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data engineering, Data Warehouse/ Data Mart, Data Visualization, Reporting, and Data Quality solutions.

●In - depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker and Map Reduce programming paradigm.

●Extensive experience in Hadoop led development of enterprise level solutions utilizing Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.

●Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations). Strong Knowledge on Architecture of DistributedSystems and Parallel Processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework.

●Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, CI/CD Data frame API, Spark Streaming, MLlib, Pair RDD's and worked explicitly on PySpark and Scala.

●Handled ingestion of data from different data sources into HDFS using Sqoop, Flume and perform transformations using Hive, Map Reduce and then loading data into HDFS. Managed Sqoop jobs with incremental load to populate HIVE external tables.

●Good working experience in using Apache Hadoop eco system components like MapReduce, HDFS, Hive, Sqoop, Pig, Oozie, Flume, HBase and Zookeeper.

●Extensive experience working on spark in performing ETL using Spark Core, Spark-SQL and Real-time data processing using Spark Streaming.

●Extensively worked with Kafka as middleware for real-time data pipelines.

●Writing UDFs and integrating with Hive and Pig using Java.

●Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie. Experienced with using most common Operators in Airflow - Python Operator, Bash Operator, Google Cloud Storage Download Operator, Google Cloud Storage Object Sensor, GoogleCloudStorageToS3Operator.

●Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL server, and PostgreSQL. Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase.

●Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions.

●Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.

●Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.

●Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.

●Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension).

●Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and also worked on various applications using python integrated IDEs like Sublime Text and PyCharm.

●Building and productionizing predictive models on large datasets by utilizing advanced statistical modeling, machine learning, or other data mining techniques.

●Developed intricate algorithms based on deep-dive statistical analysis and predictive data modeling that were used to deepen relationships, strengthen longevity and personalize interactions with customers.

Technical Skills:

Big Data Technologies: Hadoop, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, Yarn, Apache Spark, Mahout, Sparklib

Databases: Oracle, MySQL, SQL Server, MongoDB, Cassandra, DynamoDB, PostgreSQL, Teradata, Cosmos.

Programming: Python, PySpark, Scala, Java, C, C++, Shell script, Perl script, SQL

Cloud Technologies: AWS (lambda, EC2, EMR, AmazonS3, Kinesis, Sage maker), Microsoft Azure, GCP

Frameworks: Django REST framework, MVC, Hortonworks

Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman

Versioning tools: SVN, Git, GitHub (Version Control)

Network Security: Kerberos

Database Modelling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling

Monitoring Tool: Apache Airflow, Agile, Jira, Rally

Visualization/ Reporting: Tableau, ggplot2, matplotlib, SSRS and Power BI

Machine Learning Techniques: Linear & Logistic Regression, Classification and Regression Trees, Random Forest, Associative rules, NLP and Clustering.

Machine LearningTools: Scikit-learn, Pandas, TensorFlow, SparkML, SAS, R, Keras

Experience

Client: Capital One, Plano, Tx March 2021- Present

Role: Data Engineer

Responsibilities:

●Working with AWS Platform and data technology(Python, SQL, Snowflake) to build pipelines to improve data reliability and efficiency.

●Automating python programs and scripts to fetch data through APIs and loading them into Snowflake and using it to build Tableau dashboard for end users.

●Developing, building, and executing programs and scripts to automate various tasks using different programming languages and tools like Python, Airflow, SQL, Snowflake, Tableau etc.

●Documenting and assisting in the resolution of all cyber issues by generating daily security reports through Python programming script to handle configuration and infrastructure vulnerabilities.

●Worked on Nifi processors for creating data pipelines to copy data from JMS MQ to Kafka topics and processing data in between like JSON to XML conversion.

●Created a customized Nifi processor for processing data with business logics.

●Involved in development of data frameworks with Python, Java and Scala languages.

●Written Shell scripts to initiate jobs with required features and environment.

●Monitored Spark Web UI, DAG scheduler and Yarn resource manager UI to optimize queries and performance in spark.

●Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement using PySpark.

●Developed Spark Streaming Jobs in Scala to consume data from Kafka Topics, made transformations on data and inserted to HBase.

●Worked on different file formats like Parquet, Avro files and ORC file formats.

●Created a framework using spark streaming and Kafka to process data in Realtime which feeds data to APIs.

Environment: Hive, Spark, Spark SQL, Kafka, Spark Streaming, Scala, Nifi, AWS EC2, AWS EMR, AWS S3, Unix Shell Scripting, No-Sql database HBase,CI/CD, Control-m, Kafka, YARN, Jenkins, JIRA, JDBC.

Client: Walgreens, Wheeling, IL Sept 2019- March-2021

Role: Data Engineer

Responsibilities:

●Evaluate, extract/transform data for analytical purpose within the context of Big data environment.

●In - depth understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames.

●Developed spark application by using python (pyspark) to transform data according to business rules.

●Used Spark-SQL to Load data into Hive tables and Written queries to fetch data from these tables.

●Experienced in performance tuning of Spark Applications for setting correct level of Parallelism and memory tuning.

●Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins.

●Sourced Data from various sources into Hadoop Eco system using big data tools like Sqoop.

●Worked with Oracle and Teradata for data import/export operations from different data marts.

●Involved in creating hive tables, loading with data and writing hive queries.

●Worked extensively with data migration, data cleansing, data profiling.

●Worked in tuning Hive to improve performance and solved performance issues in Hive with understanding of Joins, Group and aggregation and how does it translate to Map Reduce jobs.

●Implemented partitioning, dynamic partitioning and bucketing in hive.

●Model hive partitions extensively for data separation.

●Involved in build applications using Maven and integrated with Continuous Integration servers like Jenkins to build jobs.

●Expertise in hive queries, created user defined aggregated function worked on advanced optimization techniques.

●Implemented ETL code to load data from multiple sources into HDFS using pig scripts.

●Automation tools like oozie was used for scheduling jobs.

●Export the analyzed data to all the relational databases using Sqoop for visualization and to generate reports.

●Responsible for building scalable distributed data solutions using Hadoop. Managed data coming from different sources and involved in HDFS maintenance and loading of structured andunstructured data

●Written multiple MapReduce programs in Java for Data Analysis.

●Performed performance tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log files.

●Load data from various data sources into HDFS.

●Experienced in migrating HiveQL to minimize query response time.

●Implemented Avro and parquet data formats for apache Hive computations to handle custom business requirements.

●Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.

●Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL APIs.

●Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.

●Experienced in handling different types of joins in Hive.

●Implemented the Data Bricks API in Scala program to push the processed data to AWS S3.

●Responsible for performing extensive data validation using Hive.

●Used storage format like AVRO to access multiple columnar data quickly in complex queries.

●Experience working for EMR cluster in AWS cloud and working with S3, Redshift, Snowflake.

●Sqoop jobs, PIG and Hive scripts were created for data ingestion from relational databases to compare with historical data.

●Used Pig as ETL tool to do transformations, event joins, filter and some pre-aggregations.

●Worked on different file formats and different Compression Codecs.

●Involved in story-driven agile development methodology and actively participated in daily scrum meetings.

Environment:Hadoop, Map Reduce, HDFS, Pig, Hive, Sqoop, Oozie, Java, Linux, Maven, Teradata, Zookeeper,Git, AutoSys, HBase,Java, Apache Hadoop, Hive, Map Reduce, CI/CD,SQOOP, Spark, Python, Cloudera Manager CM 5.1.1, HDFS, Oozie, Putty.

Client: JP Morgan Chase, Columbus, OH December 2018- August 2019

Role: Hadoop Developer

Responsibilities:

●Involved in creating Hive tables, loading and analyzing data using hive queries.

●Optimized existing Hive scripts using many hive optimization techniques.

●Built reusable Hive UDF libraries for business which enables users to reuse.

●Built and Implemented Apache PIG scripts to load data from and to store data into Hive.

●Involved in loading data from LINUX file system to HDFS.

●Worked on implementing Flume to import streaming data logs and aggregating the data to HDFS through Flume.

●Imported data from various data sources performed transformations using spark and loaded data into hive.

●Worked with spark core, Spark Streaming and spark SQL modules of Spark.

●Used Scala to write the code for all the use cases in spark

●Monitor Hadoop cluster using tools like Ambari and Cloudera Manager.

●Explored various modules of Spark and worked with Data Frames, RDD and Spark Context.

●Data analysis using Spark with Scala.

●Created RDD’s, Data Frames and Datasets.

●Experience in using D-Streams in spark streaming, accumulators, Broadcast variables, various levels of caching and optimization techniques in spark.

●Create numerous Spark streaming jobs that pull JSON messages from Kafka topics, and parse them using Java code in flight, and land them on our Hadoop platform

●Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Data bases.

●Experience working on Spark in performing ETL using Spark-SQL Loaded all data-sets into Hive from Source CSV files using spark from Source CSV files using Spark.

●Experience in working with NoSQL database like HBase, Cassandra and Mongo DB.

●Worked closely with admin team on Configuring Zookeeper and used Zookeeper to co-ordinate cluster services.

●Created views in Tableau Desktop that were published to internal team for review and further data analysis and customization using filters and actions.

Environment: Hive, MapReduce, Ambari, Spark, Knox, Spark SQL, Spark Streaming, Scala, Kafka, Zookeeper, Unix Shell Scripting, Oozie, UNIX Shell/Bash Scripting, Python, Tableau, YARN, JIRA, JDBC

Client: Net App, India June 2014- July 2017

Role: Bigdata Developer

Responsibilities:

●As a Big Data Developer, I worked on Hadoop eco-systems including Hive, HBase, Oozie, Pig, Zookeeper, Spark Streaming MCS (MapR Control System) and so on with MapR distribution.

●Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in Java for data cleaning and pre-processing.

●Built code for real time data ingestion using Java, MapR-Streams (Kafka) and STORM.

●Involved in various phases of development analyzed and developed the system going through Agile Scrum methodology.

●Worked on Apache Solr which is used as indexing and search engine.

●Involved in development of Hadoop System and improving multi-node Hadoop Cluster performance.

●Worked on analyzing Hadoop stack and different Big data tools including Pig and Hive, HBase database and Sqoop.

●Developed data pipeline using flume, Sqoop and pig to extract the data from weblogs and store in HDFS

●Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.

●Worked with different data sources like Avro data files, XML files, JSON files, SQL server and Oracle to load data into Hive tables.

●Used J2EE design patterns like Factory pattern & Singleton Pattern.

●Used Spark to create the structured data from large amount of unstructured data from various sources.

●Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)

●Performed transformations, cleaning and filtering on imported data using Hive, MapReduce, Impala and loaded final data into HDFS.

●Developed Python scripts to find vulnerabilities with SQL Queries by doing SQL injection.

●Experienced in designing and developing POC’s in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.

●Responsible for coding MapReduce program, Hive queries, testing and debugging the MapReduce programs.

●Extracted Real time feed using Spark streaming and convert it to RDD and process data into Data Frame and load the data into Cassandra.

●Involved in the process of data acquisition, data pre-processing and data exploration of telecommunication project in Scala.

●Implemented a distributed messaging queue to integrate with Cassandra using Apache Kafka and Zookeeper.

●Specified the cluster size, allocating Resource pool, Distribution of Hadoop by writing the specification texts in JSON File format.

●Imported weblogs & unstructured data using the Apache Flume and stores the data in Flume channel.

●Exported event weblogs to HDFS by creating a HDFS sink which directly deposits the weblogs in HDFS.

●Used RESTful web services with MVC for parsing and processing XML data.

●Utilized XML and XSL Transformation for dynamic web-content and database connectivity.

●Involved in loading data from Unix file system to HDFS. Involved in designing schema, writing CQL's and loading data using Cassandra.

●Built the automated build and deployment framework using Jenkins, Maven etc.

Environment: Hadoop, Hive, HBase, Oozie, Pig, Zookeeper, MapR, HDFS, MapReduce, Java, MS, Jenkins, Agile, Apache Solr, Apache Flume, Amazon EMR, Spark, Scala, Cassandra, Apache Kafka, MVC

Education:

Graduation in Computer Science from Gitam University, India

Masters in Computer Science from University of IL, Springfield.

Contact this candidate