Active Directory Data Analyst

Location:

Dallas, TX

Salary:

65$/hr

Posted:

August 04, 2021

Contact this candidate

Resume:

Aayush Joshi

612-***-**** **************@*****.***

Professional Summary:

8+ years of technical software development experience with 6+ years of expertise in Big Data, Hadoop Ecosystem, Cloud Engineering, Data Warehousing.

Experience in large scale application development using Big Data ecosystem - Hadoop (HDFS, MapReduce, Yarn), Spark, Kafka, Hive, Impala, HBase, Sqoop, Pig, Airflow, Oozie, Zookeeper, Ambari, Flume, Nifi, AWS, Azure, Google Cloud Platform.

Sound Involvement with AWS services like Amazon EC2, S3, EMR, Amazon RDS, VPC, Amazon Elastic Load Balancing, IAM, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQS, and Lambda to trigger resources.

Experience with building data pipelines using Azure Data Factory, Azure Databricks, and stacking data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse to control and concede database access.

Good experience with Azure services like HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, Storage Explorer.

Worked on Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.

In-profundity understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Map Reduce, Spark.

Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.

Strong Hadoop and platform support experience with all the whole set-up of tools and services in major Hadoop Distributions – Cloudera, Amazon EMR, Azure HDInsight, and Hortonworks.

Expertise recorded as a hard copy Hadoop occupations utilizing MapReduce, Apache Crunch, Hive, Pig, and Splunk.

Profound information in developing production-ready Spark applications using Spark Components like Spark SQL, MLlib, GraphX, DataFrames, Datasets, Spark-ML and Spark Streaming.

Strong working experience with SQL and NoSQL databases (Cosmos DB, MongoDB, HBase, Cassandra), data modelling, tuning, disaster recovery, backup and creating data pipelines.

Experienced in scripting with Python (PySpark), Java, Scala and Spark-SQL for development, aggregation from various file formats such as XML, JSON, CSV, Avro, Parquet, ORC.

Great experience in data analysis using HiveQL, Hive-ACID tables, Pig Latin queries, custom MapReduce programs and achieved improved performance.

Experience in ELK stack to develop search engines on unstructured data within NoSQL databases in HDFS.

Created Kibana visualizations and dashboards to view the number of messages processing through the streaming pipeline for the platform.

Extensive information in all phases of Data Acquisition, Data Warehousing ( gathering requirements, design, development, implementation, testing, and documentation), Data Modelling(analysis using Star Schema and Snowflake for FACT and Dimensions Tables), Data Processing and Data Transformations (Mapping, Cleansing, Monitoring, Debugging, Performance Tuning and Troubleshooting Hadoop clusters).

Implemented CRUD operations using Cassandra Query Language (CQL), analyse the data from Cassandra tables for quick searching, sorting, and grouping on top of the Cassandra File System.

Hands-on experience on Ad-hoc queries, Indexing, Replication, Load balancing, Aggregation in MongoDB.

Good information in understanding the security requirements like Azure Active Directory, Sentry, Ranger, and Kerberos authentication and authorization infrastructure.

Expertise in creating Kubernetes cluster with cloud formation templates and PowerShell scripting to automate deployment in a cloud environment.

Sound information in growing highly scalable and resilient Restful APIs, ETL solutions, and third-party integrations as part of Enterprise Site platform using Informatica.

Experience in utilizing bug following and tagging frameworks such as Jira, and Remedy, used Git and SVN for version control.

Highly associated with all aspects of SDLC using Waterfall and Agile Scrum methodologies.

Experience in designing interactive dashboards, reports, performing ad-hoc analysis, and visualizations using Tableau, Power BI, Arcadia, and Matplotlib.

Sound information and Hands-on-experience with – NLP, Image Detection, MapR, IBM infosphere suite, Storm, Flink, Talend, ER Studio and Ansible.

Extensive involvement with the execution of Continuous Integration(CI), Continuous Delivery and Continuous Deployment (CD) on Various Java based applications using Jenkins, TeamCity, Azure Devops, Maven, Git, Nexus, Docker and Kubernetes.

Technical Skills:

Big Data Ecosystem

HDFS, Yarn, MapReduce, Spark, Kafka, Kafka Connect, Hive, Airflow, StreamSets, Sqoop, HBase, Flume, Pig, Ambari, Oozie, ZooKeeper, Nifi, Sentry, Ranger

Hadoop Distributions

Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP, Amazon AWS - EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, SQS, DynamoDB, Redshift, ECS, QuickSight, Kinesis, Microsoft Azure - Databricks, Data Lake, Blob Storage, Azure Data Factory, SQL Database, SQL Data Warehouse, Cosmos DB, Azure Active Directory)

Scripting Languages

Python, Java, Scala, R, PowerShell Scripting, Pig Latin, HiveQL.

Cloud Environment

Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)

NoSQL Database

Cassandra, Redis, MongoDB, Neo4j

Database

MySQL, Oracle, Teradata, MS SQL SERVER, PostgreSQL, DB2

ETL/BI

Snowflake, Informatica, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, Power BI

Operating systems

Linux (Ubuntu, Centos, RedHat), Windows (XP/7/8/10)

Web Development

JavaScript, NodeJS, HTML, CSS, Spring, J2EE, JDBC, Okta, Postman, Angular, JFrog, Mokito, Flask, Hibernate, Maven, Tomcat, WebSphere

Version Control

Git, SVN, Bitbucket

Others

Machine learning, NLP, Spring Boot, Jupyter Notebook, Jenkins, Splunk, Jira

Certification:

Microsoft Certified: Azure Data Engineer Associate

Exam DP-200: Implementing an Azure Data Solution

Exam DP-201: Designing an Azure Data Solution

Analysing and Visualizing Data with Power BI

Professional Work Experience:

Client: Medtronic, Chanhassen, MN Nov 2020 - Present

Role: Senior Big Data Engineer

Key Responsibilities:

The objective was to generate a dashboard with advanced search filters on PowerBI to show client data. Data is ingested from AWS S3 in zipped excel format to Azure Data Lake Gen2 (ADLS) container in parquet format using watermark file stored in ADLS. The subsequent step is to clean the data according to the business requirements which is done in Spark Scala. Then external tables are created on top of the curated data which serves as data source to the Power BI dashboard.

Installed and designed with Apache bigdata Hadoop components like HDFS, MapReduce, YARN, Hive, HBase, Sqoop, Pig, Ambari and Nifi.

Migrated from JMS solace to Apache Kafka, used Zookeeper to manage synchronization, serialization, and coordination across the cluster.

Designing and Creating Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and Non-relational to meet business functional requirements.

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.

Data Ingestion to at least one Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

Created, provisioned numerous Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.

Developed ADF Pipelines to load data from on prem to AZURE cloud Storage and databases.

Creating pipelines, data flows and complex data transformations and manipulations using ADF and Pyspark with Databricks.

Extensively chipped on Spark Context, Spark-SQL, RDD's Transformation, Actions and Data Frames.

Developed custom ETL solutions, batch processing and real-time data ingestion pipeline to move data in and out of Hadoop using Pyspark and shell scripting.

Ingested gigantic volume and variety of data from disparate source systems into Azure Data Lake Gen2 using Azure Data Factory V2 by using Azure Cluster services.

Designed numerous applications to devour and transport data from S3 to EMR and Redshift and maintained by EC2.

Processed numerous terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift.

Ingested data through AWS Kinesis Data Stream and Firehose from various sources to S3.

Performed full stacking of data from S3 to Azure Data Lake Gen2 and SQL Server using Azure Data Factory V2.

Developed Spark Applications by utilizing Scala, Java, and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.

Developed spark Scala notebook to perform data cleaning and transformation on various tables.

Implemented end-to-end data pipeline using FTP Adaptor, Spark, Hive, and Impala.

Implemented Spark using Scala utilized Spark SQL heavily for faster development, and processing of data.

Building and creating scripts for data modelling, mining toive easier access to Azure Logs, App Insights to PMs & EMs.

Handled bringing in enterprise data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HBase tables.

Worked on data analysis and reported using Power BI on customer usage metrics. I used this analysis to present to the leadership towards a product growth to motivated team of engineers and product managers.

Worked on various execution enhancements like using distributed cache for small datasets, Partition, Bucketing in Hive and Map Side joins.

Perform ongoing monitoring, automation, and refinement of data engineering solutions.

Created connected help to land the data from SFTP location to Azure Data Lake.

Created a few Databricks Spark jobs with Pyspark to perform several tables to table operations.

Experience in dealing with both agile and waterfall methods in a fast pace manner.

Environment: Azure Data Factory (ADF v2), Azure Databricks (PySpark), Azure Data Lake, Spark (Python/Scala), Hive, Apache Nifi 1.8.0, Jenkins, Kafka, Spark Streaming, Docker Containers, PostgreSQL, RabbitMQ, Celery, Flask, ELK Stack, AWS, MS-Azure, Azure SQL Database, Azure functions Apps, Azure Data Lake, BLOB Storage, SQL server, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, ADLS Gen 2, Azure Cosmos DB, Azure Event Hub, Sqoop, Flume, Impala, Kafka, AWS S3, Spark SQL, SQL, Agile Methodology.

Client: Deloitte, Mechanicsburg, PA Dec 2019 - Oct 2020

Role: Big Data Engineer

Key Responsibilities:

Building and chipping on data streaming pipelines to handle the creation and dominating of new and existing patients records to consolidate patient information across healthcare providers, sharing this information securely across multiple organizations (such as insurance, settlement companies, etc.) and maintaining classification and security. This data will be utilized in creating the CRM platform to offer value addition to overall people health care insight.

Developed different data stacking methodology and performed different changes for investigating the datasets by using Hortonworks Distribution for Hadoop ecosystem.

Worked in a Databricks Delta Lake environment on AWS using Spark.

Developed spark-based ingestion framework for ingesting data into HDFS, making tables in Hive and executing complex computations and parallel data processing.

Ingested data real-time data application of flat files and API’s using Kafka.

Developed data ingestion pipeline from HDFS into AWS S3 buckets using Nifi.

Created outer and permanent tables in Snowflake on the AWS data.

Developed Impala queries for quicker querying and perform data transformations on Hive tables.

Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle & MySQL.

Implemented Spark to relocate MapReduce jobs into Spark RDD changes and Spark streaming.

Developed application to clean semi-structured data like JSON into structured files before ingesting them into HDFS.

Automated the way toward changing and ingesting terabytes of monthly data using Kafka, S3, Lambda and Oozie.

Developed the diagram and designs for mapping the raw input format (CSV) into a common record format (CRF).

Created the pattern and design to change over and map the CRF into CIF-Common Interchange Format for interacting with another system called Reltio (Mastering data management tool).

Executed the program by using python API written in python to support Apache spark or PySpark.

Modified the pipeline for permitting messages to receive the new approaching records for combining, handling missing (NULL) values and triggering a corresponding merge in records on receipt of a merge event.

Integrated Apache Storm with Kafka to perform web analytics and to perform clickstream data from Kafka to HDFS.

Created inner instrument for contrasting the RDBMS and Hadoop with the end goal that all the data in source and target matches using shell script and can decrease the complexity in moving data.

Written python scripts for internal testing which pushes the data reading form a file into Kafka queue which in turn is consumed by the Storm application.

Worked on Spark SQL, created data frames by stacking data from Hive tables and made arranged data and put away in AWS S3 and also interact with the SQL interface using the command line or JDBC.

Worked with NoSQL databases like HBase in creating HBase tables to stack large sets of semi-organised data coming from various sources.

Helped the QA team that work on testing and investigating spark job run failures.

Created and managed Kafka topics and producers for the streaming data.

Worked in Agile development environment and took interest in daily scrum and other design related meetings.

Imported and Exported the analysed data to the relational databases using Sqoop for visualization and to produce reports for the BI team using Power BI with mechanised trigger API.

Continuous checking and dealing with the Hadoop cluster using Cloudera Manager.

Environment: Python, Databricks, PySpark, Kafka, Reltio, GitLab, PyCharm, AWS S3, Delta Lake, Snowflake. Cloudera CDH 5.9.16, Hive, Impala, Flume, Apache Nifi, Java, Shell-scripting, SQL, Sqoop, Oozie, Java, Python, Oracle, SQL Server, HBase, PowerBI, Agile Methodology.

Client: BYJU’S, Bangalore, KA Feb 2017 - Aug 2019

Role: Big Data Analyst

Key Responsibilities:

Worked with Hortonworks distribution. Installed, configured, and maintained a Hadoop cluster based on the business and the team requirements.

Experience with bigdata parts like HDFS, MapReduce, YARN, Hive, HBase, Druid, Sqoop, Pig and Ambari.

Involved in end to end implementation of ETL pipelines using Python and SQL for high volume analytics, additionally surveyed use cases before on boarding to HDFS. Caught data and bringing it to HDFS using Flume and Kafka for semi-structured data and Sqoop for existing relational databases.

Used Cloudera Hue and Zeppelin notebooks to interact with HDFS cluster. Utilizing Cloudera Manager, Search and Navigator to configure and monitor resource utilization across the cluster.

Enhanced scripts of existing modules written in Python. Relocate ETL jobs to Pig scripts to apply transformations, joins, aggregations and to load data into HDFS.

Developed an ETL pipeline to extract archived logs from different sources for additional processing using PySpark. Used Cron schedulers for weekly automation.

Designing and creating Apache NiFi jobs to get the records from exchange frameworks into data lake raw zone.

Responsible to stack, oversee and audit terabytes of log files using Ambari and Hadoop streaming jobs.

Migrated from JMS solace to Kafka, used Zookeeper to oversee synchronization, serialization, and coordination.

Used Zookeeper to provide co-ordination, synchronization, and grouping services across a bunch.

Involved in writing rack topology scripts and Java map reduce programs to parse raw data.

Used Sqoop to migrate data between traditional RDBMS and HDFS. Ingested data into HDFS from Teradata, Oracle, and MySQL. Identified required tables and views and exported them into Hive. Performed ad-hoc queries using Hive joins, partitioning, bucketing techniques for faster data access.

Worked with new Business Data Warehouse (BDW) improved query/report performance, reduced the time needed to develop reports and established self-service reporting model in Cognos for business users.

Used Nifi to computerize the data flow between disparate systems. Planned dataflow models and complicated target tables to obtain relevant metrics from various sources.

Developed Bash scripts to get log files from FTP server and executed Hive jobs to parse them.

Implemented different Hive queries for analytics and called them from a java client engine to run on different nodes. Worked on writing APIs to load the processed data to HBase tables.

Created External tables, optimized Hive queries and improved the cluster performance by 30%.

Performed data analysis using HiveQL, Pig Latin and custom MapReduce programs in Java.

Responsible for gathering, cleaning, and removing data from var generate reports, dashboards, and analytical solutions. Helped in debugging the Tableau dashboards.

Troubleshooted deserts by distinguishing root cause and fixed them during QA phase. Used SVN for version control.

Environment: Hortonworks 2.0, Hadoop, Hive v1.0.0, HBase, Sqoop v1.4.4, Pig v0.12.0, Druid, Zookeeper, Kafka v0.8.1, Python, SQL, Java, Teradata, Oracle, MySQL, Tableau v9.x, SVN, Jira.

Client: Innova point InfoTech, Pune, MH Jan 2015 - Jan 2017

Role: Data Analyst

Key Responsibilities:

Worked in reporting project where we dissected by developing the test cases to meet the business requirements and performs distinctive types of testing like Unit Testing, Integration testing. Clean the data by using the python and define new process improvement by Maintaining the databases and interpreting data, analysing results in statistical methods.

Collaborating with different teams throughout the product development life cycle as per business requirements.

Managed data or databases that support execution improvement exercises.

Developed data collection systems, data analytics and other strategies that optimize statistical efficiency and quality.

Planning various test case scenarios to detect bugs, classify the errors based on severity, priority, and informing the development team.

Good in formulating database designs, data models, data mining skills and segmentation techniques.

Designed SQL, SSIS, and Python based batch and real-time ETL pipelines to extract data from transactional and operational databases and load the data into data warehouses.

Designed complex SQL queries and scripts to extract, aggregate and validate data from MS SQL, Oracle, and flat files using Informatica and loaded into a single data warehouse repository.

Gathering and Extracting data, generating the reports using Tableau / SRSS and discovering trends in the analysed data.

Preparing experiment case modules for frameworks, and integration testing to assist in tracking defects.

Conducted Quality inspections on products and parts.

Expert in UAT testing and working on absconds using JIRA and ALM.

Documenting the experiments, their results and the normal outcome to determine the quality of the software product.

Produce data cleansing results reports,error exemption reports for the business SMEs.

Performed regression testing on software after the bugs were fixed to verify if new bugs are generated in the product.

Worked on major data cleansing and conversion project where the legacy data resides in MS Access, CSV Files, MS Excel and My SQL and target system is Salesforce.

Extract, change and analyse measures from different sources to generate reports, dashboards, and analytical solutions for increasing the sales.

Environment: Python, MS SQL SERVER, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel, Tableau, Informatica.

Client: Alliant InfoTech, Indore, MP Nov 2012 - Dec 2014

Role: Junior Data Analyst

Key Responsibilities:

Involved in Java, net offerings and Hibernate in a fast-paced improvement surroundings.

Followed agile method, interacted immediately with the consumer at the features, implemented top of the line solutions, and tailor software to client needs.

The use of SVN brought and pushed code to Integration and QA environments on time for BA and QA signoffs.

Severe attention to accuracy, element, presentation and timeliness of delivery.

Involved in layout and implementation of web tier using Servlets and JSP. Experience in using Apache POI for Excel files reading. Developed the user interface using JSP and Java Script to view all online trading transactions.

Designed and developed Data Access Objects (DAO) to access the database.

Considerably used DAO Factory and value object layout styles to organize and integrate the JAVA Objects

Coded Java Server Pages for the Dynamic front end content that use Servlets and EJBs.

Worked on HTML pages using CSS for static content generation with JavaScript for validations.

Concerned in using JDBC API to connect to the database and carry out database operations.

Used JSP and JSTL Tag Libraries for developing User Interface components.

Good information in appearing Code critics.

Experience with Salesforce Object Query Language(SOQL).

Have Knowledge on Spring Batch, which affords capabilities like processing huge volumes of records, including job processing statistics, job restart, skip, and resource management.

Designed and evolved web-based application using HTML5, CSS, JavaScript, AJAX, JSP framework.

Performed unit testing, system testing and integration testing.

Environment: Java, SQL, Hibernate, Eclipse, Apache POI, CSS, JDK5.0, J2EE, Servlets, JSP, spring, HTML, Java Script Prototypes, XML, JSTL, XPath, jQuery.

EDUCATION:

St. Cloud State University

Master of Science, Engineering Management CPGA:3.66/4

Coursework: Facility Systems Design, Big Data Analytics, Project Management.

Institute of Management & Entrepreneurship Development, MH, India

Bachelor of Business Administration, Finance & Information Technology. GPA:8.85/10

Coursework: Management Information Systems(MIS), Cyber Security, Organisational Behaviour, Financial Management.

Contact this candidate