Data Engineer Senior

Location:

Aurora, IL

Posted:

November 28, 2023

Contact this candidate

Resume:

Name: Akshith

Senior Data Engineer

Phone: 815-***-****

Email : **************@*****.***

PROFESSIONAL SUMMARY

●Over 9 years of experience in Data Engineering, including profound expertise and experience on statistical data analysis such as transforming business requirements into analytical models, designing algorithms, and strategic solutions that scales across massive volumes of data.

●Experienced in Data Acquisition, Big Data Analytics, Data Processing, Distributed Computing, Data Analysis, Data Modelling, Cloud Services, Agile Methodologies, Data Pipelines, SQL and NoSQL databases, ETL.

●Experience in Big Data/Hadoop, Data Analysis, Data Modeling professional with applied information Technology.

●Proficient in using Unix based Command Line Interface, Expertise in handling ETL tools like Informatica.

●Strong experience working with HDFS, MapReduce, Spark, Hive, Sqoop, Flume, Kafka, Oozie, Pig and HBase. IT experience on Big Data technologies, Spark, database development.

●Strong experience using PySpark, HDFS, MapReduce, Hive, Pig, Spark, Sqoop, Oozie, and HBase.

●Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.

●Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, MapR, Amazon EMR) to fully implement and leverage new features.

●Experience in developing Spark Applications using Spark RDD, Spark-SQL, and Data frame APIs.

●Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.

●Involved in creating Hive tables, loading with data, and writing Hive ad-hoc queries that will run internally in MapReduce and TEZ, replaced existing MR jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing, Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.

●Involved in defining the structure of the data mart, determining the data sources to be included, and implementing the necessary extract, transform, and load (ETL) processes to populate the data mart with relevant data.

●Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2, MongoDB, HBase and SQL Server databases.

●Deep understanding of MapReduce with Hadoop and Spark. Good knowledge of Big Data ecosystems like Hadoop 2.0 (HDFS, Hive, Pig, Impala), Spark (SparkSQL, Spark MLlib, Spark Streaming).

●Experienced in writing complex SQL Queries like Stored Procedures, triggers, joints, and Subqueries.

●Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)

●Extensive experience in loading and analyzing large datasets with Hadoop framework (MapReduce, HDFS, PIG, HIVE, Flume, Sqoop, SPARK, Impala, Scala), NoSQL databases like MongoDB, HBase, Cassandra.

●Strong experience in the Analysis, design, development, testing and Implementation of Business Intelligence solutions using Data Warehouse/Data Mart Design, ETL, BI, Client/Server applications and writing ETL scripts using Regular Expressions and custom tools (Informatica, Pentaho, and Sync Sort) to ETL data.

●Experienced on Hadoop Ecosystem and Big Data components including Apache Spark, Scala, Python, HDFS, Map Reduce, KAFKA.

●Expert in designing Server jobs using various types of stages like Sequential file, ODBC, Hashed file, Aggregator, Transformer, Sort, Link Partitioner and Link Collector.

●Proficiency in Big Data Practices and Technologies like HDFS, MapReduce, Hive, Pig, HBase, Sqoop, Oozie, Flume, Spark, Kafka.

●Hadoop ecosystem tools like HDFS, Spark, Sqoop, Hive, Flume, Kafka, Impala, PySpark, Oozie, and HBase.

●Solid Knowledge of AWS services like AWS EMR, Redshift, S3, EC2, and concepts, configuring the servers for auto-scaling and elastic load balancing.

●Experience in working with various distributions like Cloudera (CDH) and Hortonworks.

●Working experience in developing Apache Spark programs using python and SQL.

●Experience in integrating the data in AWS with snowflake.

●Experienced with distributed version control systems such as GitHub, GitLab and Bitbucket to keep the versions and configurations of the code organized.

●Excellent Interpersonal and communication skills, efficient time management and organization skills, ability to handle multiple tasks and work well in a team environment.

●Transforming the data into a format that is suitable for analysis, such as converting data from wide to long format or normalizing data.

●Highly organized with the ability to manage multiple projects and meet deadlines and can work collaboratively with all the team members to ensure high-quality products.

●Working knowledge of Azure cloud components (HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, CosmosDB).

●Experienced in building data pipelines using Azure Data Factory, Azure Databricks, and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse, and controlling database access.

●Extensive experience with Azure services like HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, and Storage Explorer.

●Good knowledge in understanding the security requirements and implementation using Azure Active Directory, Sentry, Ranger, and Kerberos for authentication and authorization resources.

●Experienced in working with Spark eco system using SCALA and HIVE Queries on different data formats like Text file and parquet.

●Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.

●Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and also worked on various applications using python integrated IDEs like Sublime Text and PyCharm

●Developed web-based applications using Python, DJANGO, QT, C++, XML, CSS3, HTML5, DHTML, JavaScript and jQuery.

●Involved in publishing of various kinds of live, interactive data visualizations, dashboards, reports and workbooks from Tableau Desktop to Tableau servers.

●Involved in conducting training to users on interacting, filter, sort and customize views on an existing visualization generated through Tableau desktop.

TECHNICAL SKILLS

Data Technologies

Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Zookeeper, Apache Flume, Apache Airflow, Cloudera, HBase

Programming Languages

Python, PL/SQL, SQL, Scala, C, C#, C++, T-SQL, PowerShell Scripting, JavaScript, Perl script.

Cloud Technologies

AWS, Microsoft Azure, GCP, Databricks, Snowflake

Cloud Services

Azure Data Lake Storage Gen 2, Azure Data Factory, Blob storage, Azure SQL DB, Databricks, Azure Event Hubs, AWS RDS, Amazon SQS, Amazon S3, AWS EMR, Lambda, AWS SNS, Data Flow, Big Query, VM, Delta Tables, Cloud functions, Clusters.

Databases

MySQL, SQL Server, IBM DB2, Postgres, Oracle, MS Access, Teradata, and Snowflake

NoSQL Data Bases

MongoDB, Cassandra DB, HBase

Development Strategies

Agile, Lean Agile, Pair Programming, Waterfall, and Test-Driven Development.

ETL, Visualization & Reporting

Tableau, Data Stage, Informatica, Talend, SSIS, and SSRS

Frameworks

Django, Pandas, NumPy, Matplot Lib, TensorFlow, PyTorch

Version Control & Containerization tools

Jenkins, Git, CircleCI and SVN

Operating Systems

Unix, Linux, Windows, Mac OS

Monitoring tool

Apache Airflow, Control M

Tools

PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman

Machine Learning Techniques:

Linear & Logistic Regression, Classification and Regression Trees, Random Forest, Associative rules, NLP and Clustering.

PROFESSIONAL EXPERIENCE

Centene Corporation, Saint Louis June 2022 TO CURRENT

Role: Senior Data Engineer - Azure

Responsibilities:

●Analyze, design and build Modern data solutions using Azure PaaS service to support visualization of data. Understand current Production state of application and determine the impact of new implementation on existing business processes.

●Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

●Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.

Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).

●Design Setup maintain Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse

Performed ETL operation using SSIS and loaded the data into Secure DB.

●Good hands-on experience in Data Vault concepts, data models, well versed understanding, and implementation on Data warehousing concepts/Data Vault.

●Designed, reviewed, and created primary objects such as views, indexes based on logical design models, user requirements and physical constraints.

●Worked with stored procedures for data set results for use in Reporting Services to reduce report complexity and to optimize the run time. Exported reports into various formats (PDF, Excel) and resolved formatting issues.

●Designed, developed, and maintained data pipelines in Azure Databricks to process and analyze large volumes of data for real-time and batch processing.

●Orchestrated complex data workflows and ETL processes using Azure Synapse Pipeline to move data between various data sources and destinations.

●Leveraged Azure Blob Storage and Azure Data Lake as storage solutions to store and manage data efficiently and cost-effectively.

●Created Application Interface Document for the downstream to create a new interface to transfer and receive the files through Azure Data Share.

●Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks

●Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks.

●Created, provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.

●Created Linked service to land the data from SFTP location to Azure Data Lake.

●Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move Transform, Copy, filter, for each, Databricks etc.

●Created several Databricks Spark Jobs with PySpark to perform several table-to-table operations.

●Working with complex SQL, Stored Procedures, Triggers, and packages in large databases from various servers.

●Involved in migrating the client data warehouse architecture from on-premises into Azure cloud.

●Create pipelines in ADF using linked services to extract, transform and load data from multiple sources like Azure SQL, Blob storage and Azure SQL Data warehouse.

●Creating storage accounts which involve an end-to-end environment for running jobs.

●Monitoring end to end integration using Azure monitor.

●Implementation of data movements from on-premises to cloud in Azure.

●Develop batch processing solutions by using Data Factory and Azure Data Bricks.

●Implement Azure Data bricks clusters, notebooks, jobs and auto scaling.

●Preparing ETL test strategy, designs and test plans to execute test cases for ETL and BI systems.

●Experience with implementing DevOps practices such as Infrastructure as Code, Continuous Integration and Deployment (CI/CD), and automated testing. You may also need to work with containerization technologies such as Docker and Kubernetes.

●Helping team members to resolve any technical issue, Troubleshooting, Project Risk & Issue identification and management Addressing resource issue, Monthly one on one, Weekly meeting.

●Worked on transformations to transform the data required by the analytics team for visualization and business decisions.

●Collaborated with data architects to define data architecture and design data solutions aligned with business requirements.

●Provided on-call support for data pipeline issues and incidents, ensuring minimal downtime and data loss.

●Conducted data migration projects from on-premises environments to Azure cloud, ensuring minimal disruption to business operations.

Environment: Azure SQL, Azure Storage Explorer, Azure Storage, Azure Blob Storage, Azure Backup, Azure Files, Azure Data Lake Storage, SQL Server Management Studio 2016, Visual Studio 2015, VSTS, Azure Blob, Power BI, PowerShell, .Net, SSIS, DataGrid, ETL Extract Transformation and Load, Business Intelligence (BI).

Bank Of America,Charlotte, NC Jan 2020 TO May 2022

Role: Sr. Data Engineer - AWS

Responsibilities:

●Involved in full Software Development Life Cycle (SDLC) - Business Requirements Analysis, preparation of Technical Design documents, Data Analysis, Logical and Physical database design, Coding, Testing, Implementing, and deploying to business users.

●Designed and implemented end-to-end data pipelines on AWS using services such as AWS Glue, AWS Lambda, and AWS EMR.

●Written complex SQLs using joins, sub queries and correlated sub queries. Expertise in SQL Queries for cross verification of data.

●Created ingestion framework for creating Data Lake from heterogeneous sources like Flat files, Oracle Db, mainframe, and SQL server Databases.

●Design and Develop ETL Processes in AWS Glue to load data from external sources like S3, glue catalog, and AWS Redshift.

●Used DynamoDB to log the errors of the ETL process while validating input files with target table structure data type mismatches and all.

●Developed complex ETL mappings for Stage, Dimensions, Facts, and Data marts load. Involved in Data Extraction for various Databases & Files using Talend.

●Created Talend jobs using the dynamic schema feature. Have used Big Data components (Hive components) for extracting data from hive sources.

●Ingested large-size files around 600 GB files to S3 in an efficient way.

●Performance tuning - Using the map cache properties, multi-threading, and parallelizing components for better performance in case of huge source data. Tuning the SQL source queries to restrict unwanted data in the ETL process.

●Using Glue job read the data from S3 and load it into redshift tables by reading metadata from the data catalog in JSON format.

●Extensively used S3 bucket, Lambda functions, and dynamo Db services from AWS.

●Part of data loading into a data warehouse using big data Hadoop Talend ETL components, AWS S3 Buckets, and AWS Services for redshift database.

●Developed and maintained data dictionaries, data catalogs, and data lineage documentation for improved data understanding and traceability.

●Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS snowflake.

●Worked on GIThub to clone the repository and commit the changes in different versions of the code and released it to Bit-Bucket by creating the Pull Requests for merge process.

●Extensively worked towards performance tuning/optimization of queries, contributing to an improvement in the deployed code.

●Created a data pipeline involving various AWS services including S3, Kinesis firehose, kinesis data stream, SNS, SQS, Athena, Snowflake etc.

●Worked on end-to-end deployment of the project that involved Data Analysis, Data Pipelining, Data Modelling, Data Reporting and Data documentations as per the business needs.

●Construct the AWS data pipelines using VPC, EC2, S3, Auto Scaling Groups (ASG), EBS, Snowflake, IAM, CloudFormation, Route 53, CloudWatch, CloudFront, CloudTrail.

●Launching and configuring of Amazon EC2 (AWS) Cloud Servers using AMI s (Linux/Ubuntu) configuring the servers for specified applications.

●Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.

●Gathering the data stored in AWS S3 from various third-party vendors, optimizing it and joining with internal datasets to gather meaningful information.

●Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift

●AWS CI/CD Data pipeline and AWS Data Lake using EC2, AWS Glue, AWS Lambda.

●Migrate data from on-premises to AWS storage buckets.

●Developed a python script to transfer data from on-premises to AWS S3. Developed a python script to hit REST API’s and extract data to AWS S3.

●Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.

●Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.

●Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the source to target mappings developed.

●Developed Data mapping, Transformation and Cleansing rules for the Data Management involving OLTP and OLAP.

●Skilled in visualizing and presenting data using Tableau, creating interactive dashboards, and generating meaningful insights for stakeholders.

●Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

●Performed benchmark tests to read data from database, object store using pandas and PySpark API s to compare results, identify potential improvement areas and provide recommendations.

●Read and write Parquet, JSON files from S3 buckets using Spark, Pandas data frame with various configurations.

●Develop code to assign default life cycle policy to buckets, objects and auto purge objects based on default policy in mercury which is internal implementation of AWS S3.

●Developed interactive and visually appealing dashboards and reports using Power BI, enabling data-driven decision-making for stakeholders.

●Designed and implemented data visualizations and charts in Tableau to effectively communicate complex data insights and trends to non-technical users.

●Work closely with the application customers to resolve JIRA tickets related to API issues, data issues, consumption latencies, onboarding, and publishing data.

Environment: Python, Spark, AWS EC2, AWS S3, AWS EMR, AWS Redshift, AWS Glue, AWS RDS, AWS Kinesis firehose, kinesis data stream, AWS SNS, AWS SQS, AWS Athena, snowflake, SQL, Tableau, Git, REST, Bitbucket, Jira.

CDK Global, Chicago, Illinois Aug 2017 TO Dec 2019

Role: Hadoop/Big Data Engineer

Responsibilities:

●Worked on developing ETL processes (DataStage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.

●Developing Spark scripts, UDFS using both Spark DSL and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.

●Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORCFILE and other compressed file formats Codecs like gZip, Snappy, Lzo.

●Strong understanding of Partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.

●Interacted with business partners, Business Analysts, and product owners to understand requirements and build scalable distributed data solutions using the Hadoop ecosystem.

●Developed Spark Streaming programs to process near real time data from Kafka, and process data with both stateless and stateful transformations.

●Experience in report writing using SQL Server Reporting Services (SSRS) and creating various types of reports like drill down, Parameterized, Cascading, Conditional, Table, Matrix, Chart and Sub Reports.

●Used DataStax Spark connector which is used to store the data into Cassandra database or get the data from Cassandra database.

●Wrote Oozie scripts and set up workflow using Apache Oozie workflow engine for managing and scheduling Hadoop jobs.

●Worked on implementation of a log producer in Scala that watches for application logs, transforms incremental logs and sends them to a Kafka and Zookeeper based log collection platform.

●Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.

●Worked with HIVE data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing, and optimizing the HQL queries.

●Built and implemented automated procedures to split large files into smaller batches of data to facilitate FTP transfer which reduced 60% of execution time.

●Developed PIG UDFs for manipulating the data according to Business Requirements and also worked on developing custom PIG Loaders.

●Developing ETL pipelines in and out of data warehouses using a combination of Python and Snowflake Snow SQL Writing SQL queries against Snowflake.

●Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature.

●Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration.

●Developed data pipeline programs with Spark Scala APIs, data aggregations with Hive, and formatting data (JSON) for visualization, and generating.

Environment: Apache Spark, Map Reduce, Snowflake, Apache Pig, Python, Java, SSRS, HBase, AWS, Cassandra, PySpark, Apache Kafka, HIVE, SQOOP, FLUME, Apache Oozie, Zookeeper, ETL, UDF

Togglenow Software Sol, India Jul 2014 TO May 2017

Role: Hadoop/SQL Developer

Responsibilities:

●Involved in understanding the Requirements of the End Users/Business Analysts and Developed Strategies for ETL processes.

●Generated database monitoring and data validation reports in SQL Server Reporting Service (SSRS).

●Created a partition table for a very large database for good performance.

●Designed dynamic SSIS Packages to transfer data crossing different platforms, validate data during transferring, and archived data files for different DBMS.

●Responsible for writing SQL queries, stored procedures, views, triggers, T-SQL and DTS/SSIS

●Deployed SSIS packages and Reports to Production Servers.

●Worked on Migration of packages from DTS using SQL Server Integration Service (SSIS).

●Reported all events and requirements through established reporting mechanisms in SSRS.

●Generated test data and tested database to meet the functionalities deliverables in the project documentation and specifications.

●Designed and developed OLAP using MS SQL Server Analysis Services (SSAS).

●Designed and developed MS SQL Server Reporting Services (SSRS) under SQL Server 2008.

●Generated periodic reports based on the statistical analysis of the data using SQL Server Reporting Services (SSRS).

●Proficient in designing and implementing data integration solutions using Informatica PowerCenter, IICS, Edge, EDC, and IDQ.

●Developed mappings/Reusable Objects/Transformation by using mapping designer, transformation developer in Informatica Power Center.

●Designed and developed ETL Mappings to extract data from flat files, and Oracle to load the data into the target database.

●Proven ability to analyze complex data requirements and design efficient and scalable ETL workflows to meet business objectives.

●Expertise in data profiling, data cleansing, and data quality management using Informatica Data Quality (IDQ) to ensure data accuracy and consistency.

●Skilled in utilizing Tivoli Workload Scheduler (TWS) for job scheduling and automation in data integration processes.

●Proficient in creating mappings, workflows, and sessions in Informatica PowerCenter for data extraction, transformation, and loading.

●Developed Informatica Mappings for the complex business requirements provided using different transformations like Normalizer, SQL Transformation, Expression, Aggregator, Joiner, Lookup, Sorter, Filter, and Router and so on.

●Used ETL to load data using PowerCenter/Power Connect from source systems like Flat Files and Excel Files into staging tables and load the data into the target database.

●Developed complex mappings using multiple sources and targets in different databases, flat files.

●Extensively used Informatica Client Tools Source Analyzer, Warehouse Designer, Transformation Developer, Mapping Designer, Mapplet Designer, Informatica Repository.

●Worked on creating Informatica mappings with different transformations like lookup, SQL, Normalizer, Aggregator, SQ, Joiner, Expression, Router etc.

●Designed and developed Informatica ETL Interfaces to load data incrementally from Oracle databases and Flat files into staging schema.

●Used various transformations like Unconnected/Connected Lookup, Aggregator, Expression Joiner, Sequence Generator, Router etc.

●Responsible for the development of Informatica mappings and tuning for better performance.

Environment: Informatica Power Center 10.4/10.2, Oracle, Flat Files, SQL, and Windows,SQL Server 2008, Reporting Services (SSRS), Integration Services (SSIS), Analysis Services (SSAS), DTS, Oracle 9i/8i, T-SQL, MDX, Windows xp/7/vista, XML, MS Excel and MS Access, SAS, Linux.

Education: Bachelors in Computer Science from Jaya Prakash Narayan College of Engineering.

Contact this candidate