Harini Reddy
*******.************@*****.***
SUMMARY:
•Over 5 years of Professional IT experience as Data Engineer/Data Analyst in building data pipelines using Big data Hadoop ecosystem, Spark, Google cloud storage, Python, SQL, Tableau, GitHub and ETL tools.
•Experience in using and writing SQL queries, database creation, and writing stored procedures, DDL, DML SQL queries
•Experienced in multiple domains like Finance, Retail, E commerce and Healthcare.
•Used Spark, Google cloud storage to build scalable and fault-tolerant systems infrastructure to process 15 TB data/day resulting in 15% increase in the total number of users.
•Experience in GCP services (Big Query/Bigtable, Dataproc, Pub sub, Data Flow, App engine and looker)
•Efficient in working with Hive data warehouse tool creating tables, data distributing by implementing Partitioning and Bucketing strategy, writing and optimizing the HiveQL queries.
•Experience in ingestion, storage, querying, processing and analysis of Big Data with hands-on experience in Big Data including Apache Spark, Spark SQL and Spark Streaming.
•Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
•Expertise in collecting, exploring, analyzing and visualizing the data by generating tableau/Looker reports/dashboards.
•Experience in designing, developing and deploying projects in GCP suite including GCP Suite such as BigQuery/ Bigtable, Data Flow, Data proc, Google Cloud Storage, Composer, Looker etc
•Designed, tested, maintained the data management and processing systems using spark, GCP, Hadoop and shell scripting.
•Build the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using SQL, GCP, and big data technologies
•Knowledge on Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud Dataproc, Cloud Pub/Sub, cloud SQL, BigQuery, stack driver monitoring, Cloud spanner, looker and deployment manager.
•Experience in Azure Development, worked on Azure web application, Azure storage, Azure SQL Database, Virtual machines, Azure Data Factory, HDInsights, Azure search, and notification hub.
TECHNICAL SKILLS:
Big Data
Apache Spark, HDFS, YARN, Hive, Sqoop, MapReduce, Tez, Ambari, Zookeeper, Data warehousing.
Database
MySQL, SQL Server, DB2, Cassandra, Teradata, BigQuery, Druid
Cloud
Google Cloud Platform (Storage, Big Query, Data proc, Data flow, Cloud Pub/Sub, Data CatLog), Azure.
Methodologies
Agile, Waterfall
Languages
Scala, Python, Pyspark, SQL, HiveQL, Shell Scripting
Data Visualization Tools
Looker, PowerBI, Microsoft Excel (Pivot tables, graphs, charts, Dashboards)
Tools
Automic, Hue, Looker, IntelliJ IDEA, Eclipse, Maven, Zookeeper, VMware, Putty, DBvisualizer.
PROFESSIONAL EXPERIENCE:
Data Engineer at Miracle Software Systems Jan 2019 – Till Date
Client: Walmart, Inc. Bentonville, AR
Responsibilities:
•Responsible to Build the ETL Pipelines (Extract, Transform, Load) from data lake to different databases based on the requirements.
•To ensure successful creation of the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using Spark, SQL, HDFS, Hive, MapReduce, Druid, Python, Unix, Hue and Shell Scripting.
•Worked on creating data ingestion processes to maintain Global Data lake on the GCP cloud and Big Query.
•Expertise in handling the huge volume (TB’s) of data and writing the code in an efficient way using spark SQL, beam and load the data into google cloud storage to lower the run time during cloud migration.
•Built the catalog tables using batch processing, multiple complex joins to combine multiple dimension tables of the store transactions and the E-commerce transactions which has millions of records every day.
•Developed proof-of-concept prototype with faster iterations to develop and maintain design documentation, test cases, monitoring and performance evaluations using Git, Putty, Maven, Confluence, ETL, Automic, Zookeeper, Cluster Manager.
•Automated the data quality rules and de-duplication processes to keep the data lake more accurate.
•Used shell scripting to automate the validations between different databases of each module, and report to the users to show the data quality, using frame works Aorta and Unified Data Pipeline.
•Built data pipelines for multiple modules (like sales, traffic, wages, etc.) to load the data from on prem Hadoop cluster to google cloud storage to serve the users requirements and also responsible to support all the business & Engineering teams (650 users).
•Modifying and creating new workflows in Automic to schedule the hive queries, Hadoop commands and shell scripts on daily basis.
•Worked on a migration project to migrate data from different sources to Google Cloud Platform (GCP) using UDP framework and transforming the data using Spark Scala scripts.
•Troubleshoot and analyze complex production problems related to data, network file delivery, application issues independently identified by the business owners and provide solutions to recovery.
•Improving the performance and optimizing existing algorithms in Hadoop using Spark context, Spark- SQL, Data Frames, Pair RDD’s & Spark YARN.
•Built tableau/looker dashboards to report store level sales and region level sales for Walmart US and global data.
•Building the pipelines to copy the data from source to destination in Azure Data Factory (ADF V1)
•Monitoring Produced and Consumed Data Sets of ADF
•Successfully creating the Linked Services on the source and as well for the destination servers
•Created automated workflows with the help of triggers
•Use DDL and DML for writing triggers, stored procedures, and data manipulation.
•Setup SQL Server Linked Server to connect multiple Servers/databases.
Environment: Hadoop, GCP, Google cloud storage, BigQuery, Dataproc, Spark, Scala, Teradata, Hive, Aorta, Sqoop, SQL, DB2, UDP, GitHub, Azure (Azure Data Factory, Azure Databases), powerBI, etc.,
Hadoop Developer at Crayon Data Apr 2016 - July 2017
Client: Independence Blue Cross, PA
Responsibilities:
•Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL databases for huge volume of data.
•Toiled on numerous file formats like Text, Sequence files, Avro, Parquet, ORC, JSON, XML files and Flat files using Map Reduce Programs.
•Expanded daily process to do incremental import of data from DB2 and Teradata into Hive tables using Sqoop.
•Analyzed the SQL scripts and designed the solution to implement using Scala.
•Resolved performance issues in Hive and Pig scripts with analyzing Joins, Group and Aggregation and how it translates to MR jobs.
•Hands-on experience with the Hadoop eco-system (HDFS, MapReduce, Hbase, Hive, Impala, Spark, Kafka, Kudu, Solr)
•Stock the data into Spark RDD and Perform in-memory data computation to generate the output exact to the requirements.
•Involved in scripting Spark applications using Scala to perform various data cleansing, validation, transformation and summarization activities as to the requirements.
•Developed data pipelines using Spark, Hive and Sqoop to ingest, transform and analyze operational data.
•Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into HBase tables.
•Expansively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
•Designing Oozie workflows for job scheduling and batch processing.
Environment: Hadoop, Spark, Scala, Teradata, Hive, pig, Impala, Sqoop, Oozie, SQL, DB2, spark SQL.
Data Engineer at Thoughtpulse Software Technology July 2015 - Mar 2016
Client: Health partners Bloomington, MN
Responsibilities:
•Experience in the principles and best practices of Software Configuration Management (SCM) in Agile, scrum, and Waterfall methodologies.
•Designing Oozie workflows for job scheduling and batch processing.
•Expertise in performing investigation, analysis, recommendation, configuration, installation and
testing of new hardware and software.
•Worked on verifying and validating Business Requirements Document, Test Plan, & Test Strategy
documents.
•Experience in working GIT for branching, tagging, merging, and maintained GIT source code tool.
•Used Shell Scripting (Bash and ksh), PowerShell, Ruby and Python based scripts for merging,
branching, and automating the processes across the environments.
•Work closely with other data engineers, product managers, analysts to gather & analyze data
requirements to support reporting & analytics.
•Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
•Effectively followed Agile Methodology and participated in Sprints and daily Scrums to deliver software tasks on-time and with good quality in coordination with onsite and offshore teams.
CERTIFICATIONS:
•Google Cloud Certified - Professional Data Engineer.
See credential
EDUCATION:
Master of Science, Computer Science
-University of Illinois at Springfield, Springfield, IL
Bachelor of Technology, Information Technology
-Jawaharlal Nehru Technological University - Hyderabad, India