Anushya
****.****@*****.***
OBJECTIVE
Skilled professional with nearly 7 years of experience and proven knowledge in building large-scale data processing systems and executing solutions for complex business problems, capable of designing, implementing and optimizing data transformation processes, aiming to leverage my skills to successfully fill the Data Engineer role at your company.
SUMMARY
Hands-on, successful Software and Data Engineer with around 7 years of verifiable success leading teams in delivering appropriate data solutions.
Excellent understanding and hands on experience with Hadoop architecture, Hadoop Distributed File System and various components such as Name Node, Data Node, Job Tracker, Task Tracker, YARN and MapReduce concepts, Pig, Hive, Sqoop, Oozie, Flume, Spark, Python.
Experience in supporting data analysis projects using EMR, EC2, S3, Data Pipeline, RDS, Lambda, Glue, Athena on the Amazon Web Services (AWS) cloud.
Good understanding of Spark internals and performance optimization techniques with hands-on experience in creating optimized spark jobs using Python, Spark, and SQL.
Experience in working with Hive data warehouse tool - creating tables, data distribution by implementing Partitioning and Bucketing, writing and optimizing the HiveQL queries.
Worked with various SQL and NoSQL database systems and data warehouses including MySQL, Oracle, SQL Server, MongoDB, DynamoDB, Redshift and Snowflake.
Experience in building ETL systems using python and in-memory computing framework (Apache Spark), scheduling and maintaining data pipelines at regular intervals in Apache Airflow.
Experience in working with Apache Sqoop to import and export data to and from HDFS and Hive.
Experience in configuring and working with Flume to load data from multiple sources to HDFS.
Shell scripting for scheduling many ETL processes and to export the results to RBMS and NoSQL databases.
Experience in building custom python scripts that go to the external service API, enrich the data and store it into the Datawarehouse.
Comprehensive knowledge of SDLC, enterprise architecture, agile methodologies, cloud services, and web-based applications.
Exceptional ability to learn new technologies and to deliver outputs in short deadlines.
KEY SKILLS
Python, Spark, SQL, Hive, Pig, Sqoop, Oozie, Flume, Shell scripting, Airflow, Git
Big Data Ecosystem: Hadoop, Map-Reduce, HDFS
AWS Services: S3, EMR, EC2, Athena, Data Pipeline, Lambda, Glue, SNS, SES
Databases: PostgreSQL, MySQL, Oracle, SQL server, MongoDB, DynamoDB, Snowflake
PROFESSIONAL EXPERIENCE
TRANSAMERICAN AUTO PARTS, Compton, California
Data Engineer Oct 2018 - Jan 2020
Extensive working knowledge structured query language (SQL), python, spark, Hadoop, HDFS, AWS, RDBMS, data warehouses and document-oriented No-SQL databases.
Automated the process of downloading raw data into Data Lake from various sources systems like SFTP/FTP/S3 using shell scripting, which helps business users to use the data in the form of Job as a service, and query as a service.
Developed Hive Scripts for data parsing of raw data using EMR and store the results in S3 and ingest into Data warehouse (Snowflake), which is utilized by enterprise customers.
Designed ETL Jobs to process the raw data using Spark and python in Glue, EMR, and Databricks.
Implemented Spark jobs in Python in AWS Glue, which process and transform semi-processed data to processed data where data is utilized by data scientists.
Implemented connectors using python to pull raw data from various sources like Google DCM, DBM, AdWords, Facebook, Twitter, Yahoo, and Tubular also this is parsed using Spark framework and injected the data into Hive tables.
Responsible for creating a Data pipeline flows, scheduling jobs programmatically (DAG) in Airflow workflow engine, and providing support for the scheduled jobs.
Implemented MapReduce programs using pyspark to parse out the raw data as per business user requirements and store the results in Data Lake (AWS S3).
Implemented several data pipeline jobs to pull the raw data from different sources to AWS S3 bucket, then processed using pyspark in EMR cluster and store the processed data in AWS S3 bucket.
Created Spark jobs as per business requirements, jobs run on EMR and are triggered by Lambda.
Implemented AWS integration services such as SQS, SNS to notify the engineers about the job state.
Regularly interact with management and product owners on project status, priority setting and sprint timeframe.
SD MACTEC IT SOLUTIONS (P) LTD. HYDERABAD, INDIA
Data Engineer Mar. 2015 – Mar. 2018
Responsible for doing validations and cleansing the data.
Designed, documented operational problems by following standards and procedures using a software reporting tool JIRA.
Create/Modify shell scripts for scheduling various data cleansing scripts and ETL loading process.
Load the data into Spark RDD and do in memory data Computation to generate the output response.
Implemented a batch process from RDS to downstream systems using AWS Glue, S3 and Python.
Imported/Exported data from Oracle databases using Sqoop, and loaded the same into Hive managed table, which then these tables were used for Tableau visualization
Extending Hive core functionality by using custom User Defined Functions (UDF), and User Defined Aggregating Functions (UDAF) for Hive.
Created Hive tables and implemented Partitioning, Dynamic Partitions, Buckets on the tables.
Managed Hive tables in a Big Data environment while facilitating transfer of data between HDFS to RDBMS and vice versa.
Experience working with pyspark programming to transform the logs and ingest the data into Hive tables and RDBMS.
Exported the result set from Hive to MySQL and PostgreSQL using Shell scripts.
Experience with version control tools such as GIT
Used Spark dataframes to ingest data from different databases.
Worked with ETL tools like SSIS for SQL server and reporting tools like SSRS, Power BI, and tableau.
Involved in full development life cycle including requirements analysis, high-level design, coding, testing, and deployment.
Developed and executed detailed ETL related functional, performance, integration and regression tests, and documentation
GLOBAL LOGIC TECHNOLOGIES LTD. HYDERABAD, INDIA
Software/QA Engineer Aug. 2012 – Feb. 2015
Writing automated scripts using Python for System Testing. GUI based testing using Selenium and Python.
Took a leading role in test automation and manual testing, actively involved in the creation of detailed test plans test cases and test scenarios for different application modules according to functional requirements and business specifications.
Extensively used SQL queries for data validation and backend testing.
Validated the data files from source to make sure correct data has been captured to be loaded to target tables.
Performed analysis of Mapping Documents and Schema Compare with database tables, logged the defects and worked with the Database Modeling Team to resolve them.
Ensure data warehousing related test process, methodologies, and tools are applied appropriately and that test phase entry/exit criteria are defined as agreed to by stakeholders and applied by the test team.
Created index views, complex stored procedures, functions, triggers, to assist efficient data manipulation and data consistency and used DDL, DML commands for the application
Created indexes on selective columns to speed up queries and analyses in SQL Server Management Studio
Efficient in creating joins and sub-queries for complex queries involving multiple tables
Used SQL in order to check the data validity and data integrity
Created database objects like triggers to enforce data integrity
Scheduling jobs and alerting using SQL Server agent
EDUCATION
CAMPBELLSVILLE UNIVERSITY, CAMPBELLSVILLE, KY
Master of Science, Information Technology and Management
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY, Hyderabad, INDIA
Bachelor of Technology, Electrical and Electronics Engineering