Post Job Free

Resume

Sign in

Data Analysis

Location:
Naperville, IL
Salary:
70
Posted:
July 11, 2023

Contact this candidate

Resume:

Binduvallika Karuturi

Sr. Data Engineer

Email: adx8e7@r.postjobfree.com

Contact: +1-575-***-****

LinkedIn: linkedin.com/in/binduvallika-karuturi6226

Professional Summary

•Over 9+ years of experience as a data engineer, with extensive knowledge and skill in statistical data analysis, including the translation of business needs into analytical models, the development of algorithms, and the implementation of strategic solutions applicable to large data sets.

•Experience working in a variety of Big Data distributions, including Horton Works, Cloudera, and MapR.

•Familiarity with HDFS, MapReduce, Hive, Pig, Sqoop, Spark, Kafka, Yarn, Oozie, and Zookeeper, among other Hadoop Ecosystem tools.

•Skills with Sqoop, Flume, Spark Streaming, and Kafka, as well as other batch and streaming ingestion tools.

•Proficiency working with a variety of data and file optimization formats, including Avro, Parquet, Sequence, and ORC.

•Proficient knowledge of Scala and Python for developing Spark applications.

•Skilled in writing efficient queries and code for use with the Hadoop frameworks Hive, Pig, and MapReduce.

•Ability to add new features to Hive by creating user-defined functions.

•Ability to do Pig Latin transformations for the purpose of data purification.

•Proficient in using the Oozie scheduler to automate a wide variety of processes, including Hive and spark.

•Knowledgeable with reporting applications such as QlikView, Tableau, Apache superset, and DOMO.

•Skilled in creating validation and automation scripts in Unix shell and Python.

•Experience working with NoSQL databases such as DocumentDB, DynamoDB, Cassandra, and MongoDB is a must.

•Used Spring boot to construct Rest end points in scalable microservices apps.

•Constructed NiFi and Talend connectors to allow for the bidirectional ingestion of data from various sources.

•Experience with many cloud services, including AWS, Azure, and GCP, is a plus.

•Ability to use Amazon Web Services (AWS) resources including RDS, Redshift, DynamoDB, and Elastic Cache (Memcached, Redis).

•Applied Azure's Cosmos DB, AZURE Databricks, Event Hub, and Azure HD Insights to application development.

•Power Bi and Tableau reporting dashboards with both historical data and incremental updates were created.

•To deploy the logic apps to Azure, Python was used to execute the ansible playbook.

•Utilized Kubernetes and docker to work on scaling microservices applications.

•Big Query, Pub/Sub, Dataproc, and Data lab expertise, as well as other GCP tools, is a plus.

•Proven success in both Agile and waterfall settings.

•Eclipse, IntelliJ, SBT, and Maven are just some of the tools and utilities on which I have worked.

•Ability to create Java-based microservices modules with Spring Boot experience.

•Knowledge of alerting and monitoring tools such as Elasticsearch, Logstash, Kibana, and Grafana.

•Ability to use tools like Jenkins, Drone, and Travis CI for integration and deployment.

•Skills in using Snowflake to make data accessible to third-party apps via API.

•Contributed to application scaling using containerization tools like Docker and Kubernetes.

•Proven track record in both Level 3 Support (L3) and On-Call Shifts (OTS) for Production Support.

•Proficiency with Bit Bucket and Git, in addition to Eclipse and IntelliJ, as integrated development environments.

Technical Skills:

Big data technologies : Hadoop, mapreduce, yarn, zookeeper, hive, pig, sqoop, spark, impala, oozie, kafka.

Cloud technologies : aws- ec2, emr, redshift, cloudwatch, lambda, s3, kinesis, athena, dynamodb. Azure

Gcp- bigquery, composer, airflow, dataflow, dataproc, gcs bucket and gsutil

Languages : java, scala, python, sql, pl/sql, pig latin, HiveQL, shell scripting, java script.

Distributions : hortonworks, cloudera.

NoSQL databases : HBase, mongodb, Cassandra.

Java & j2ee technologies : core java, spring, struts, jms, ejb, restful, servlets, hibernate,

Application servers : web sphere, jboss, tomcat, ecp cloud.

Databases : Microsoft SQL server, MySQL, oracle, db2, PostgreSQL

Operating systems : Unix, windows, Linux

Build tools : Jenkins, screwdriver, maven.

Business intelligence tools : Splunk.

Development tools : IntelliJ, PyCharm, vs code, eclipse

Development methodologies : agile/scrum, waterfall

Version control tools : git, bit bucket

Professional Experience:

Client: Sonesta Hospitality - Newton, MA Oct 2020 – Till Date

Sr Azure Big Data Engineer

Roles& Responsibilities:

•Analyze, design, and develop modern data solutions utilizing the Azure PaaS service to facilitate data visualization. Determine the impact of the new implementation on existing business processes and the current state of Production for the application.

•Use Azure Data Factory, T-SQL, Spark SQL, and U-SQL to extract, transform, and load (ETL) data from Source Systems to Azure Data Storage services. Azure Data Lake Analytics. Data ingestion into one or more Azure Services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and data processing in Azure Databricks.

•Employ Python NumPy and Pandas for data ingestion, transformation, and cleaning.

•Implement and evaluate punctuated text data as a post-processor for speech recognition RNN using Keras and Tensor Flow.

•Use S3 for your data lake solution, as well as DynamoDB and RDS for your database solution. Utilize CloudWatch to monitor the efficacy of the product. Implement structures with automatic scaling for failover management.

•Created a Power Bi Data Model utilizing datasets from multiple data sources (Postgres, Oracle, FAST) by establishing meaningful relationships between them, resulting in a 60% performance boost.

•Responsible for integrating data from diverse data sources into HDFS using Sqoop and executing transformations with Hive, Impala, and MapReduce prior to importing data into HDFS.

•Proficient in designing and modeling data structures and schemas within Azure Cosmos DB, including document-based, key-value, graph, or columnar data models, based on specific application requirements.

•Implemented extensive Lambda architectures utilizing Azure Data platform capabilities such as Azure Data Lake. Azure ML, Azure Data Factory, Azure Data Catalog, Azure SQL Server, HDInsight, and Power BI.

•Conceived and created Hive Custom Hqls.

•Designed and architected data-centric applications, considering data requirements, scalability, and performance. Developed application architecture patterns that leverage data processing and storage technologies effectively.

•Performed data quality assessments and implemented data cleansing and enrichment processes in Neo4j to ensure the accuracy, completeness, and consistency of graph data, leveraging graph algorithms and custom procedures.

•Worked on migration activity of on-premises data warehouse to GCP Big Query.

•Developed Cypher queries in Neo4j to uncover hidden patterns and relationships within large graph datasets.

•Web Scraping Automation project, Designed & Developed web scraping end-to-end pipeline from

public URL to GCP and Big query using python.

•Utilized numerous machines learning algorithms, including Linear regression and logistic regression. Ecosystems with a random decision tree. Support vector algorithms are used boosting on client requirements.

•Utilizing diverse SQL queries and RDBMS and ETL tools, modeled a new data mart on SQL server to comprise a subset of data from multiple data sources.

•Developed and maintained technical documentation for Hadoop cluster deployment and Hive query execution.

•Handled importing data from multiple data sources, performed transformations using Hive, Impala, and MapReduce, inserted data into HDFS, and extracted MySQL data using Sqoop into HDFS.

•Used Spark SQL to access hive tables in spark to accelerate data processing.

•Using Scala, Spark streaming was configured to receive real-time data from Kafka and to store stream data in HDFS.

•Deep understanding of the consistency models offered by Azure Cosmos DB, including strong, bounded staleness, session, and eventual consistency. Selected and configured appropriate consistency models based on application requirements.

•Stayed up to date with the latest developments in Neo4j technology, attending conferences, participating in online communities, and exploring new features and enhancements to continually enhance graph data engineering capabilities.

•Responsible for developing a data infrastructure using flume, Sqoop, and Pig to extract and store weblog data in HDFS.

•Benchmarked Cassandra cluster based on the expected traffic for the use case and optimized for low latency

•Designed and implemented a data ingestion pipeline using Azure Data Factory to extract data from various sources and load it into Cosmos DB for further processing and analysis.

•Created dimensional models within the Data Mart schema, capturing the relevant business entities and their attributes to facilitate easy data analysis.

•Experience as a Big Data Engineer in Agile development environments, utilizing Agile principles and practices to ensure the success of projects.

•Developed NiFi workflow to pick up the multiple retail files from ft location and move those to HDFS on a daily basis.

•Expertise in creating, debugging, scheduling, and monitoring tasks using Airflow for ETL batch processing to transfer into Snowflake for analytical processes.

•Utilized PowerShell scripting to interact with distributed computing frameworks such as Apache Hadoop and Apache Spark, submitting and monitoring jobs for large-scale data processing.

•Wrote PowerShell scripts to extract data from various sources, including APIs, databases, log files, and flat files, enabling seamless data integration into the Big Data ecosystem.

•The Data Catalog service has been transformed into a spring-boot application.

•Used NiFi data Pipeline to process large sets of data and configured Lookup's for Data Validation and Integrity.

•Used REST API endpoints to interact with Power Bi web services to orchestrate report deployment across multiple environments, resulting in a productivity increase of 40%.

•Implemented Python scripts and frameworks for data extraction from various sources, including APIs, databases, and log files.

•Developed Hive-integrated Tableau dashboards and reports to visualize the purchase value time series to monitor business metrics and provide stakeholders with business insights.

•Expertise in writing SQL queries and optimizing the queries in Oracle, SQL Server 2008/12/16 and Teradata and involved in developed and managing SQL, Python code bases for data cleansing and data analysis using Git version control.

•Developed Spark code with Scala and Spark-SQL/Streaming to accelerate testing and data processing.

•Prepared data analytics processing and data regression to make analytics results accessible to visualization systems, applications, and external data stores.

•wrote functions whenever required to make column validations, data cleansing as required to achieve logics in Scala.

•Utilized Azure DevOps (ADO) to manage and version control configuration files, ensuring consistency and traceability in the deployment and configuration of data engineering environments.

•Developed machine learning modes for predictive analytics on time series using recurrent neural networks - LSTM.

•Scheduled Informatica Jobs with the Autosys scheduling application.

•Implement a One-Time Data Migration from SQL Server to Snowflake at the Multistate Level Using Python.

•Design and implement data enhancements for synthetic voice and text data.

•Utilize Python to perform data manipulation duties such as grouping, aggregating, filtering, and replacing missing values.

•Introduced variations and diversity in synthetic data to represent a wide range of scenarios and edge cases.

•Generated large-scale synthetic datasets to validate system performance, scalability, and stress testing.

•Designed and implemented scalable GenRocket-based test data generation solutions. Generated synthetic test data closely resembling production data to ensure realistic and exhaustive testing scenarios.

•During test data generation, data masking techniques have been implemented in GenRocket to safeguard sensitive and personally identifiable information (PII).

•Implemented security measures within the Data Hub to protect sensitive data.

•Collaborated with cross-functional teams, including data scientists, analysts, and business stakeholders, in an Agile environment. Utilized Azure DevOps (ADO) boards and agile project management methodologies for effective project planning and tracking.

•Designed and implemented data models and schemas within the Data Hub, facilitating efficient data storage, retrieval, and analysis.

•Involved in requirements gathering and capacity planning for multi data center Apache Cassandra cluster.

•Extensively worked on spark streaming and Apache Kafka to fetch live stream data.

•Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster.

•Extensive knowledge of Informatica power center Mappings, Mapping Parameters, Workflows, Variables, and Session Parameters.

•Develop analysis and prediction algorithms for feature correlations and conduct Hypothesis Testing to determine the significance level.

•Implemented data quality frameworks and processes within the Data Hub to identify and address data quality issues.

•Developed and optimized complex SQL queries and stored procedures to extract data from various data sources for use in SSRS reports.

•Knowledge of ETL performance tuning and optimization techniques, such as partitioning, indexing, and parallel processing, to ensure efficient data processing and faster execution times.

•RPA Centre Of Excellence to ensure quality in project deliveries. Mentoring colleagues through the RPA Training and guidance on best practice and development techniques

Environment: Python, NumPy, Pandas, Keras, Tensor Flow, Azure, Azure CLI, Azure HD Insights, Eclipse, IntelliJ

Client: Centene Corporation - St Louis, Missouri Jun 2018 – Sep 2020

Sr AWS Data Engineer

Roles & Responsibilities:

•Design, develop, and implement RDBMS and NoSQL databases, as well as construct views, indexes, and stored procedures.

•Responsible for designing and deploying multi-tier applications utilizing all AWS services (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) with an emphasis on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.

•Created and maintained graph database schemas and models in Neo4j, ensuring optimal performance, data integrity, and query performance by utilizing best practices for graph data modeling and indexing.

•Supporting Continuous storage with Elastic Block Storage, S3, and Glacier on AWS. Volumes and Snapshots have been created and configured for EC2 instances.

•Data modeling of product information and customer characteristics; development of a data warehouse solution to support BI operations.

•Implemented NiFi pipelines to export data from HDFS to different end points and cloud locations like Aws S3.

•Implemented data ingestion and handling clusters in real-time processing using Kafka.

•SQL queries on RDBMS like MySQL/PostgreSQL and HiveQL on Hive tables for data extraction and initial data analysis. Migration of Access databases to SQL Server.

•Construct data pipelines comprising data ingestion, data transformations including aggregation, filtering, and cleansing, and data storage.

•Migrate data into RV Data Pipeline using DataBricks. Spark SQL and Scala.

•Used Databricks for encrypting data using server-side encryption.

•Successfully led the migration project from Oracle 19 to PostgreSQL 15, ensuring a smooth transition and minimizing downtime.

•Collaborated with cross-functional teams, including data scientists, analysts, and stakeholders, to understand data requirements and generate synthetic datasets that aligned with project objectives.

•Utilized Agile project management tools and software, such as Jira to track and manage work items, user stories, tasks, and sprint backlogs, ensuring transparency and visibility into project progress.

•Implement a One-time Data Migration of Multistate-level data from SQL server to Snowflake utilizing Python and Snow SQL.

•Participated in online meetings to demonstrate the capabilities of RPA for on-going projects and potential clients.

•Utilized MongoDB monitoring tools and metrics to monitor database performance, identify bottlenecks, and optimize resource utilization. Implemented performance tuning strategies to improve query response times and overall database performance.

•Experienced in working real time streaming with Kafka as a data pipeline using spark streaming module.

•Integrated SSRS reports with other data visualization tools and technologies, such as Power BI, to provide a comprehensive reporting solution.

•Used Spark Streaming APIs to process data from Kafka in real-time and store it in Cosmos DB

•Used Pig for data cleansing and extracting the data from the web server output files to load into HDFS.

•Developed Data Mapping, Data Governance, and transformation and cleansing rules involving OLTP, ODS.

•Proficient in using Python libraries such as Pandas, NumPy, and SciPy for data manipulation, cleaning, and exploratory data analysis. Leveraged these libraries to handle large datasets efficiently and extract valuable insights.

•Worked on scheduling all operations with Python-based Airflow scripts. Adding various duties to DAGs and establishing dependencies between them.

•Optimized MongoDB queries and indexes to improve query performance and minimize response times. Utilized explain plans and query profiling to identify and address performance bottlenecks.

•Ingestion of data from SQL and NoSQL databases and multiple data formats, including XML, JSON, and CSV.

•Data ingestion of real-time customer behavior data into HDFS utilizing Flume, Sqoop, and Kafka, and data transformation utilizing Spark Streaming.

•Spark for ETL follower, Databricks Enthusiast, Cloud Adoption & Data Engineering enthusiast in Open-source community.

•Utilized statistical modeling techniques to generate synthetic data that accurately reflected the underlying patterns and distributions of real-world data.

•Implemented mechanisms in GenRocket to refresh or reset test data for repeated testing cycles.

•Experience using ETL tools like Informatica, Talend, or Apache NiFi to extract, transform, and load data from multiple sources into a data warehouse or data lake.

•Performed Data Analysis and Data Visualization on survey data using Tableau Desktop, as well as compared respondent demographics data using Python (Pandas, NumPy, Seaborn, and Matplotlib).

•Establishing a pipeline to import data from Spark into Snowflake Database using AWS Data Pipeline.

•Perform ETL operations using Scala Spark and PySpark with IntelliJ for Java and PyCharm for Python.

•Implement and execute the parallel processing of a Map-Reduce job using Java for server log data.

•Monitor and assess the health of the data warehouse by providing cost-effective failover and disaster recovery solutions.

•Employed the Informatica Power Center ETL tool to extract data from disparate sources and transfer it into the target systems.

•RPA Tool (UIPath) Installation and Infrastructure Test

•Utilize Yarn for large-scale distributed data in addition to diagnosing and resolving Hadoop cluster performance issues.

•Expertise in creating, debugging, scheduling, and monitoring Airflow jobs for ETL batch processing to integrate into Snowflake for analytical processes.

•Constructed ETL using PySpark. Dataframe API and Spark SQL API were utilized.

•Perform data management and data queries using Spark and deal with streaming data using Kafka to ensure rapid and reliable data transfers and processing.

•Use AWS S3 as the HDFS storage solution, AWS Glue as the ETL solution, and AWS kinesis as the data streaming solution to deploy the data pipeline in the cloud.

•Migrate data warehouse from RDBMS to AWS Redshift and perform log data analysis using AWS Athena on S3. A Hadoop cluster can be managed using AWS EMR.

•Conducted A/B tests on metrics including customer retention, acquisition, sales revenue, and volume growth to evaluate product performance.

•Pandas, NumPy, and Seaborn were utilized for exploratory data analysis.

•Extend the functionality of Hive by employing User Defined Functions such as UDF, UDTF, and UDAF.

•Developed predictive modeling using Python packages such as SciPy and scikit-learn as well as Mixed-effect models and time series models in R in accordance with business requirements.

•Involved in the process of adding a new Datacenter to existing Cassandra Cluster

•Involved in upgrading the Cassandra test clusters.

•Flatten the API or Kafka Data (in JSON file format) before loading it into Snowflake DB for various functional services.

•Employed Python and R to perform Dimension Reduction with PCA and Feature Engineering with Random Forest to capture key features for predicting annual sales and best-selling product.

•Utilized Agile project management tools and software, such as Jira to track and manage work items, user stories, tasks, and sprint backlogs, ensuring transparency and visibility into project progress. Created dimension tables within the Star Schema to store descriptive attributes and hierarchies, enabling easy navigation and slicing of data.

•Created Hive Integrated Tableau dashboards and reports to visualize the purchase value time series in order to track business metrics and provide business insights to stakeholders.

Environment: Spark, Python, Scala, Kafka, AWS, EC2, SQL, Hive, AWS, Java, Oracle, Glue, Athena, S3, Parquet, PowerBI, Data Studio, Tableau, Oozie, Kafka, HBase, Data bricks, EMR, HD Sights.

Big Data Engineer

Option Care Health, Cleveland, Ohio. Jan 2016 – Jun2018

Responsibilities:

•Design, develop, and implement RDBMS and NoSQL databases, as well as construct views, indexes, and stored procedures.

•Responsible for designing and deploying multi-tier applications utilizing all AWS services (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) with an emphasis on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.

•Supporting Continuous storage with Elastic Block Storage, S3, and Glacier on AWS. Volumes and Snapshots have been created and configured for EC2 instances.

•SQL queries on RDBMS like MySQL/PostgreSQL and HiveQL on Hive tables for data extraction and initial data analysis. Migration of Access databases to SQL Server.

•Construct data pipelines comprising data ingestion, data transformations including aggregation, filtering, and cleansing, and data storage.

•Created Nifi flows to trigger spark jobs and used put email processors to get notifications if there are any failures.

•To ensure the stability, quality, and maintainability of data engineering solutions, Agile engineering practices such as continuous integration, automated testing, and routine code reviews were utilized.

•Experienced with cloud platforms like Amazon Web Services, Azure, Databricks (both on Azure as well as AWS Integration of data bricks.

•Proficient with Azure Data Lake Services (ADLS), Databricks & Python Notebooks formats. Databricks Delta lakes & Amazon web series (AWS)

•Implement a One-time Data Migration of Multistate-level data from SQL server to Snowflake utilizing Python and Snow SQL.

•Strong expertise in Python programming language for developing data-intensive applications, data processing pipelines, and ETL (Extract, Transform, Load) workflows.

•Worked on scheduling all operations with Python-based Airflow scripts. Adding various duties to DAGs and establishing dependencies between them.

•Ingestion of data from SQL and NoSQL databases and multiple data formats, including XML, JSON, and CSV.

•Data ingestion of real-time customer behavior data into HDFS utilizing Flume, Sqoop, and Kafka, and data transformation utilizing Spark Streaming.

•Performed Data Analysis and Data Visualization on survey data using Tableau Desktop, as well as compared respondent demographics data using Python (Pandas, NumPy, Seaborn and Matplotlib).

•Establishing a pipeline to import data from Spark into Snowflake Database using AWS Data Pipeline.

•Perform ETL operations using Scala Spark and PySpark with IntelliJ for Java and PyCharm for Python.

•Implement and execute the parallel processing of a Map-Reduce job using Java for server log data.

•Monitor and assess the health of the data warehouse by providing cost-effective failover and disaster recovery solutions.

•Employed the Informatica Power Center ETL tool to extract data from disparate sources and transfer it into the target systems.

•Developed a comprehensive migration plan, including data extraction, transformation, and loading strategies, to migrate large-scale Oracle databases to PostgreSQL 15.

•Expertise in creating, debugging, scheduling, and monitoring Airflow jobs for ETL batch processing to integrate into Snowflake for analytical processes.

•Set Cassandra backups using snapshot backups.

•Used OpsCenter to monitor prod, dev, test, and fast Cassandra clusters.

•Constructed ETL using PySpark. Data frame API and Spark SQL API were utilized.

•Perform data management and data queries using Spark and deal with streaming data using Kafka to ensure rapid and reliable data transfers and processing.

•Use AWS S3 as the HDFS storage solution, AWS Glue as the ETL solution, and AWS kinesis as the data streaming solution to deploy the data pipeline in the cloud.

•Migrate data warehouse from RDBMS to AWS Redshift and perform log data analysis using AWS Athena on S3. A Hadoop cluster can be managed using AWS EMR.

•Data purification, data manipulation, and data wrangling using Python to remove invalid datasets and decrease prediction errors.

•Created Nifi flows to trigger spark jobs and used put email processors to get notifications if there are any failures.

•Conducted A/B tests on metrics including customer retention, acquisition, sales revenue, and volume growth to evaluate product performance.

•Pandas, NumPy, and Seaborn were utilized for exploratory data analysis.

•Extend the functionality of Hive by employing User Defined Functions such as UDF, UDTF, and UDAF.

•Flatten the API or Kafka Data (in JSON file format) before loading it into Snowflake DB for various functional services.

•Employed Python and R to perform Dimension Reduction with PCA and Feature Engineering with Random Forest to capture key features for predicting annual sales and best-selling product.

•Created Hive Integrated Tableau dashboards and reports to visualize the purchase value time series in order to track business metrics and provide business insights to stakeholders.

•Use Git for version control and Maven for building, testing, and deploying Java projects.

Environment: Hadoop 3.0, Spark 2.3, MapReduce, Java, MongoDB, HBase 1.2, JSON, Hive 2.3, Zookeeper 3.4, AWS, Scala 2.12, Python, Cassandra 3.11, HTML5, JavaScript, Oracle.

Spark/Hadoop Developer

Infosys Chennai India May 2014 – Jun 2015

Responsibilities:

•Interacting with multiple teams understanding their business requirements for designing flexible and common components.

•Created HBase tables to load large sets of semi-structured and unstructured data coming from UNIX, No SQL and a variety of portfolios.

•Experience in performance tuning of Hadoop cluster namely CPU, memory, I/O & yarn configuration.

•Experience in ingesting Structured, unstructured and log data to Hadoop HDFS and Netezza & Greenplum using Spark & Sqoop and Informatica.

•Analyzing/Transforming data with Hive and Pig.

•Wrote MapReduce jobs using Java API and Pig Latin.

•Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.

•Wrote Pig scripts to run ETL jobs on the data in HDFS and further do testing.

•Used Hive to do analysis on the data and identify different correlations.

•Imported data using Sqoop to load data from MySQL to HDFS and Hive on regular basis.

•Written Hive queries for data analysis to meet the business requirements.

•Involved in collecting, aggregating, and moving data from servers to HDFS using Apache Flume.

•Involved in using Oozie for defining amd scheduling jobs to manage apache Hadoop jobs by Directed Acyclic graph (DAG) of actions with control flows.

•Experience in handling data cleansing, data profiling, data lineage and de-normalization, & aggregation of big data.

•Worked with different Hive file formats like RC file, Sequence file, ORC file format and Parquet.

•Took Splunk tools used for log aggregation and implemented log data analysis.

•Involved in writing optimized Pig Script along with involved in developing and testing PIG Latin Scripts.

•Developed job flows to automate the workflow for PIG and HIVE jobs.

•Responsible for writing Hive Queries for analyzing data using Hive Query Language (HQL).

•Tested and reported defects in an Agile Methodology perspective.

•Created R function and Spark stream to pull customer sentiment data from Twitter.

•Experience in using Pentaho Data Integration tool for data integration, OLAP analysis and ETL process.

•Experienced in converting ETL operations to Hadoop system using Pig Latin operations, transformations and functions.

Environment: Hadoop, YARN, HDFS, Map Reduce, Hive, Oozie, HiveQL, Netezza, Informatica, HBase, Pig, MySQL, NoSQL, Spark Sqoop, Pentaho.

Genpact - Hyderabad, India Jun 2013 – May 2014

ETL Developer

Roles& Responsibilities:

•Collected business requirements and drafted technical design documents, source-to-target mapping documentation, and mapping specification documents.

•Considerable experience with Informatica PowerCenter.

•Parsed



Contact this candidate