Resume

Data Engineer

Location:

Scarborough, ON, Canada

Salary:

110000 CAD

Posted:

July 19, 2023

Contact this candidate

Resume:

Sr Data Engineer

Name: SREEKANTH VANGALA Email: adyen3@r.postjobfree.com

Phone: 365-***-**** LinkedIn: https://www.linkedin.com/in/sreekanthvangala/

Diligent Data Engineer with over 6+ years of work experience. Experience working with large data sets of structured and unstructured data. Performed risk analysis, data visualization, and reporting on various projects.

Technical expertise in Data Integration, Data profiling, and Data Cleaning. Worked on Tableau, Power BI, and Shiny in R to create dashboards and visualizations Understanding of statistical modeling and supervised/unsupervised Machine Learning techniques. Adaptable to a fast-paced work environment with interpersonal, leadership, and coordination skills.

Expert in Data warehousing, Data engineering, Feature engineering, big data, ETL/ELT, and Business Intelligence. As a big data architect and engineer, specializing in AWS and Azure frameworks, Cloudera, Hadoop Ecosystem, Spark/PySpark/Scala, Data bricks, Hive, Redshift, Snowflake, relational databases, tools like Tableau, Airflow, DBT, Presto/Athena, and Data DevOps Frameworks/Pipelines with strong Programming/Scripting skills in Python, Data Warehouse developer, Data modeler, Data Analyst, Data Modeling, Implementation and Support and maintenance of various applications in both OLTP and OLAP systems. Worked in Data Engineering, Data Pipeline Design, Development, and Implementation as a Sr. Data Engineer and Data Modeler/Data Analyst using AWS, Azure Cloud

Professional Summary:

Over 6+ years of experience in Analysis, Design, Development, and Implementation as a Data Engineer.

Hands-on experience in developing Spark applications using Spark tools like RDD transformations, Spark core, Spark streaming, and SparkSQL.

Experience in the development and design of various scalable systems using Hadoop technologies in various environments. Extensive experience in analyzing data using Hadoop Ecosystems including HDFS, MapReduce, Hive & PIG.

Leveraged and integrated Google Cloud Storage and Big Query applications, which connected to Tableau for end-user web-based dashboards and reports.

Good Knowledge of Amazon Web Service (AWS) concepts like Athena, EMR, and EC2 web services which provide fast and efficient processing of Teradata Big Data Analytics.

Experience in developing pipelines in Spark using Scala and Python, Worked on Snowflake.

Expertized in Big Data architecture like Hadoop (Azure, Hortonworks, Cloudera) distributed system, MongoDB, NoSQL and HDFS, parallel processing - MapReduce framework.

Experience in Extraction, Transformation, and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka, PowerBI, and Microsoft SSIS.

Experience in designing, developing, and scheduling reports/dashboards using Tableau.

Experienced in the development of Big Data projects using Hadoop, Hive, HDP, Pig, Flume, Storm, and MapReduce open-source tools.

Experience in Azure Cloud Platform-Data Lake, Data Storage, Data Factory, Data Bricks, Azure SQL Data Base, and Migration experience from SQL Databases to Azure data pipeline through Kafka-Spark API.

Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server

Proficient in data processing like collecting, aggregating, and moving from various sources using Apache Flume and Kafka. Worked on Airflow Scheduler engine for job scheduling, Worked on large datasets by using Pyspark, NumPy, and pandas, Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.

Technical Skills:

Hadoop Components / Big Data

HDFS, Hue, MapReduce, PIG, Hive, HCatalog, HBase, Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, pyspark

Airflow, Kafka Snowflake,

Languages

Scala, SQL, Python, Hive QL, KSQL. Boto3

IDE Tools

Eclipse, IntelliJ, pycharm.

Cloud platform

AWS, Azure

Reporting and ETL Tools

Tableau, Power BI, Talend, AWS GLUE.

Databases

Oracle, SQL Server, MySQL, MS Access, NoSQL Database (HBase, Cassandra, Mongo DB)

Big Data Technologies

Hadoop, HDFS, Hive, Pig, Oozie, Sqoop, Spark, Machine Learning, Pandas, NumPy, Seaborn, Impala, Zookeeper, Flume, Airflow, Informatica, Snowflake, DataBricks, Kafka, Cloudera

Data Analysis Libraries:

Pandas, NumPy, SciPy, Scikit-learn, NLTK, Plotly, Matplotlib

BI Tools:

Alteryx, Tableau Power BI, Sisense

Containerization

Docker, Kubernetes

CI/CD Tools

Jenkins, Bamboo, GitLab

Operating Systems

UNIX, LINUX, Ubuntu, CentOS.

Software Methodologies

Agile, Scrum, Waterfall

Reporting Tools

PowerBI, Qlikview, Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5. x/ 6.x, Cognos 7.0/6.0.

Data Engineer/Big Data Tools/Cloud / Visualization

Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Flume, YARN, Hortonworks, Cloudera, Mahout, MLlib, Oozie, Zookeeper, etc. AWS, AWS Lambda, AWS Glue, Redshift, Athena, Azure Databricks, Azure Data Explorer, Azure HDInsight, GCP, BigQuery, PubSub, Salesforce, Google Shell, Linux, PuTTY, Bash Shell, Unix, etc., Tableau, Power BI, SAS, Matplotlib, Seaborn, Bokeh.

Professional Experience:

Client: Carfax, London, ON Jan 2022 – Till Date

Role: Data Engineer

Client provides worldwide express delivery, ground small-parcel delivery, less-than-truckload freight delivery, supply chain management services, customs brokerage services, and trade facilitation and electronic commerce solutions. Implementation is on hybrid cloud involving Azure, AWS, developing and adding features to existing data analytic applications built with Spark and Hadoop on a Scala, Java, and Python development platform on top of AWS services. Tacked and monitored the project and reported the bugs. Here we are using different technologies to process the incoming data using Oozie, Spark, Python, and Scala. We are processing the entire flow on the Premises AS400 server and the final data we are loading to Teradata and loaded a few intermediate datasets into Hive tables.

Key Contributions

Developed highly complex Python and Scala code, which is maintainable, easy to use, and satisfies application requirements, data processing, and analytics using inbuilt libraries.

Involved in designing and optimizing Spark SQL queries, Data frames, importing data from Data sources, performing transformations; performing read/write operations, save the results to the output directory into HDFS/AWS S3.

Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.

Built dashboards using Tableau to allow internal and external teams to visualize and extract insights from big data platforms.

Responsibilities:

Designed a Data Quality Framework to perform schema validation and data profiling on Spark (PySpark).

Created Hive tables, loaded data, and wrote Hive queries that run within the map.

Worked on ETL Processing which consists of data transformation, data sourcing and mapping, Conversion, and loading.

Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.

Used Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as a storage mechanism.

Worked with building data warehouse structures, and creating facts, dimensions, and aggregate tables, by dimensional modeling, Star and Snowflake schemas.

Worked on Dockers containers by combining them with the workflow to make them lightweight.

Written multiple MapReduce programs for data extraction, transformation, and aggregation from multiple file formats including XML, JSON, CSV, and other compressed file formats.

Worked on Kafka REST API to collect and load the data on the Hadoop file system and used Sqoop to load the data from relational databases.

Worked on MongoDB by using CRUD (Create, Read, Update, and Delete), Indexing, Replication, and Shading features. Worked on SQL queries in dimensional data warehouses and relational data warehouses. Performed Data Analysis and Data Profiling using Complex SQL queries on various systems.

Followed agile methodology for the entire project.

Environment: Spark, Scala, AWS, ETL, Kafka, Tableau, Hadoop, Python, Snowflake, HDFS, Hive, MapReduce, PySpark, Pig, Docker, Sqoop, Teradata, JSON, MongoDB, SQL, Agile Unix, Groovy, Kubernetes, Jenkins

Client: Ritchie Bros, Vancouver, CA Jan 2020 - Dec 2021

Role: Data Engineer / Hadoop Developer

Description: Leading financial services company and pioneer in the online brokerage industry. Having executed the first-ever electronic trade by an individual investor has long been at the forefront of the digital revolution, focused on delivering complete and easy-to-use solutions for traders, investors, and stock plan participants aims to enhance the financial independence of traders and investors through a powerful digital offering and professional guidance.

Pulling the data from various data sources into the Hadoop platform and standardizing all data through a series of master data management processes to reach client goals. As a Hadoop developer, involved in maintaining huge data and designing and developing predictive data models for business users according to the requirement.

Key Contributions:

Developed end-to-end Spark applications using Scala to perform various data cleansing, validation, transformation, and summarization activities according to the requirements.

Developed Spark code in Python and SparkSQL environment for faster testing and processing of data and loading the data into Spark RDD and doing In-memory computation to generate the output response with less memory usage.

Responsibilities:

Involved in Requirements and Analysis: Understanding the requirements of the client and the flow of the application.

Developed Simple to complex MapReduce Jobs using Hive and Pig.

Developed PIG scripts to transform the raw data into intelligent data as specified by business users.

Utilized Spark, Scala, and Python for querying, and preparing from big data sources.

Wrote pre-processing queries in Python for internal spark jobs

Prepared dashboards using Tableau for summarizing Configuration, Quotes, Orders e-commerce data.

Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) with Data Frames in Spark.

Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.

Implemented AWS Elastic Container Service (ECS) scheduler to automate application deployment in the cloud using Docker Automation techniques.

Analyzed the SQL scripts and designed the solution to implement using PySpark.

Extracted files from MongoDB through Sqoop and placed in HDFS and processed.

Use SQL queries and other tools to perform data analysis and profiling.

Environment: Spark, Scala, AWS, ETL, Hadoop, Python, Snowflake, HDF

S, Hive, Tableau, MapReduce, PySpark, Pig, Tableau, Teradata, Docker, JSON, XML, Apache Kafka, SQL, PL/SQL, Agile and Windows, JavaScript, JSP, Struts, J2EE, batch, Servlet, HTTP, JDBC, Spring framework MVC, Spring ORM, Spring MVC, Spring, Hibernate

Client: Discover, India Apr 2017 – Nov 2019

Role: Big Data Engineer / BI Developer

The project is to implement BI reporting solutions by creating reports, dashboards, and scorecards in Tableau allowing end users to measure the results of key metrics and identify/ interpret significant trends.

Key Contributions :

As a developer, created various methods in Python to automate the manual ingestion and Post Ingestion workflow from scratch to end. The Main methods such as fetching the configuration parameters of respective file sections from the HDFS, storing the original file in HDFS, decompressing the original file (unzipping, extracting individual or necessary worksheets from spreadsheets), building hive tables on top of HDFS files, creating new directories in HDFS, executing the shell commands, data profiling, automatic detection of column data types, automatic HiveQL DDL generation of processing file, converting spreadsheets and CSV files to specified delimited text files, stripping the unnecessary header and footer rows, ingesting the raw file into the HDFS,

Responsibilities:

Gathering business requirements, business analysis, and designing various data products.

Developed spark applications for performing large-scale transformations and denormalization of relational datasets.

Developed PIG UDFs to provide Pig capabilities for manipulating the data according to Business Requirements and worked on developing custom PIG Loaders and Implemented various requirements using Pig scripts.

Worked on developing ETL workflows on the data obtained using Python for processing it in HDFS and HBase using Flume.

Developed simple and complex MapReduce programs in Hive, Pig, and Python for Data Analysis on different data formats.

Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks, and analysis.

Involved in writing Scala code using higher-order functions for iterative algorithms in Spark for Performance considerations.

Performed Spark jobs with the Spark core and SparkSQL libraries for processing the data.

Involved in converting Map Reduce programs into Spark transformations using Spark RDDs using Scala and Python.

Performed transformations using Python and Scala to analyze and gather the data in the required format.

Involved in making Hive tables, stacking information, composing Hive inquiries, and producing segments and basins for enhancement.

Written Terraform scripts to automate AWS services which include ELB, CloudFront distribution, RDS, EC2, database security groups, Route 53, VPC, Subnets, Security Groups, and S3 Bucket and converted existing AWS infrastructure to AWS Lambda deployed via Terraform and AWS Cloud Formation.

Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDDs.

Created scripts to read CSV, JSON, and parquet files from S3 buckets in Python and load them into AWS S3, DynamoDB, and Snowflake.

Worked with No-SQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.

Involved in daily SCRUM meetings to discuss the development/progress and was active in making scrum meetings more productive.

Environment: Spark, Scala, Python, PySpark, AWS, MapReduce, Pig, ETL, HDFS, Hive, HBase, SQL, Agile and Windows, Unix, GCP, Hadoop 2.6.2, HDFS 2.6.2, Map Reduce 2.9.0, Hive 1.1.1, Sqoop 1.4.4, Apache Spark 2.0, Nifi, ETL, Pig, Oozie, Java 7, Python 3, Snowflake Apache airflow, Tableau

Contact this candidate