Data Engineer Analyst

Location:

McKinney, TX

Posted:

January 11, 2024

Contact this candidate

Resume:

Lokesh Kumar

Email id: ******.*****.****@*****.***

Contact no: +1-414-***-****

Linked In: https://www.linkedin.com/in/lokesh-kumar-a549a1291/

Azure Data Engineer/Data Engineer/ Big Data Analyst / Data Analyst

PROFESSIONAL SUMMARY:

Highly skilled and experienced Data Engineer with 8+ years of experience in the IT industry with working experience on data ingestion, management within data lakes, structuring data warehouses, generating reports, and enabling advanced analytics.

My experience spans the entire Software Development Life Cycle (SDLC), and I am well-versed in the Agile Development model. I've been involved in all stages, from initial requirement gathering to deploying solutions and providing ongoing production support.

My extensive experience includes managing data migration, conversion, quality assurance, integration, and metadata management services. Additionally, I possess a strong background in migrating data to Snowflake and other databases.

As an ETL expert, I have consistently created data transformation packages, facilitated data migration, performed data cleansing, executed data backups, and ensured daily transaction synchronization. My toolkit includes tools like Informatica, Alteryx, and AutoSys.

Experience in data migration using tools like Sqoop, transferring data between HDFS, Hive, and relational database systems and experience in Sqoop, Flume, Kafka, and other data ingestion systems.

Experience in developing enterprise level solutions using batch processing (using Apache Pig) and streaming framework (using Spark Streaming, Apache Kafka & Apache Flink).

My hands-on experience with the components of the Hadoop ecosystem, such as Map Reduce, Hive, HBase, Flume, Sqoop, Zookeeper, and Oozie, has enriched my capabilities in data engineering.

Experience working with Azure Blob Storage, Azure Data Lake, Azure Data Factory, Azure SQL, Azure SQL Data warehouse, Azure Analytics, Poly base, Azure HDInsight, Azure Data Bricks.

Designed and implemented end-to-end ETL (Extract, Transform, Load) pipelines in Azure Data Factory for efficient data movement and transformation.

Designed and implemented Spark-based data processing workflows in Azure Data Bricks for large-scale data analytics and machine learning.

Development level experience in Microsoft Azure providing data movement and scheduling functionality to cloud-based technologies such as Azure Blob Storage and Azure SQL Database.

Good experience in deploying, managing, and developing MongoDB clusters.

Experience in setting up, configuring, and managing Azure SQL Databases. This includes creating and optimizing database schemas, writing T-SQL queries, and ensuring high availability.

Knowledge and experience on AWS services like Redshift, S3, Glue, Athena, Lambda, CloudWatch and EMRs like Hive and Presto.

Hands on experience in migration of on-premises ETL processes to Google Cloud Platform (GCP), employing cloud-native tools like Big Query, Cloud Data proc, Google Cloud Storage, and Composer.

Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.

Experience in developing Conceptual, logical models and physical database design for Online Transactional processing (OLTP) and Online Analytical Processing (OLAP) systems.

Good experience in technical consulting and end-to-end delivery with data modeling, data governance.

Strong SQL development skills including writing Stored Procedures, Triggers, Views, and User Defined functions.

Applied ELT methodologies to efficiently move and transform data within the data processing ecosystem.

Utilized ELT for optimal data integration, ensuring timely availability of transformed data for analytics.

Experience working with NoSQL databases and big data tools including Hadoop, Hive and Spark.

Experience in creating Teradata SQL scripts using OLAP functions like rank Over to improve the query performance while pulling the data from large tables.

Excellent knowledge and extensively using NOSQL databases (Cassandra, MongoDB, Dynamo DB, Snowflake DB).

Experience in Google Cloud components, Google container builders and GCP client libraries. Supported various business teams with Data Mining and Reporting by writing complex SQLs using Basic and Advanced SQL including OLAP functions like Ranking, partitioning, and windowing functions, Etc.

Extensive knowledge in designing Reports, Scorecards, and Dashboards using Power BI and Tableau.

Strong analytical skills with ability to quickly understand a client's business needs. Involved in meetings to gather information and requirements from the clients.

EDUCATION:

Bachelor’s in computer science – Anna University

Master’s in computer science – Illinois Institute of Technology

TECHNICAL SKILLS:

Cloud Platform: AWS (S3, DMS, RDS, EMR, Snowflake, EC2, Glue), Azure (Azure Data Factory, Azure data bricks, Gen2 storage, Azure DevOps, Blob Storage, Event Hub, Log analytics, Sentinel analysis, Cosmos DB, ADLA, ADLS), GCP.

Big data Technologies: Apache Spark, Hadoop, HDFS, Hive, Pig, Apache Kafka, Apache Flink.

CI/CD: Jenkins, Azure DevOps

Programming Languages: Scala, Python, R, C, C++, C#, .NET, Java

Data base: Dynamo DB, Neo4j, Oracle DB, MS SQL Server, MySQL, MS Access

Data Visualization: Tableau and Power BI

IDE: PyCharm, Eclipse, Jupyter Notebook, Data bricks, Visual Studio Code.

ETL: Informatica, Alteryx

Frontend: HTML, CSS, .NET

Hadoop Ecosystem: HIVE, SQOOP, PIG, YARN, MAP REDUCE, Apache Spark, HDFS, Kafka.

Hive: Hive Architecture, Tables, Partitioning, Bucketing.

SQOOP: Import, Export

Spark: Spark SQL, RDD’s, Data Frames, batch processing applications, streaming applications.

WORK EXPERIENCE:

Client: Abbott Austin, TX Jan 2022 to Present

Role: Azure Data Engineer

Responsibilities:

Developed data engineering processes in azure using components like Azure Data Factory, Azure Data Lake Analytics, HDInsight, Data Bricks, and more.

Created intricate SQL/USQL/PySpark code for data engineering pipelines within Azure Data Lake Analytics and Azure Data Factory.

Conducted comprehensive data analysis using SQL, Excel, Data Discovery, and other tools, encompassing both legacy systems and new data sources.

Demonstrating comprehensive skills in querying, managing, and optimizing relational databases for efficient data retrieval and manipulation.

Managed and optimized SQL-based databases, ensuring data consistency, integrity, and efficient storage, with a focus on performance tuning and query optimization.

Integrated Azure Data Factory with other Azure services like Azure SQL Database, Azure Blob Storage, and Azure Data Lake Storage for seamless data movement.

Designed intuitive and user-friendly dashboards in Tableau, optimizing layout, color schemes, and interactive elements to enhance user experience.

Proficient in utilizing Data Bricks for processing and analyzing large-scale datasets, demonstrating expertise in big data technologies.

Optimized Power Query for efficient data extraction, transformation, and loading (ETL) processes, streamlining the integration of diverse data sets.

Demonstrated proficiency in the Hadoop ecosystem, including HDFS, MapReduce, YARN, and various related tools for distributed data processing.

Integrated Hadoop with Apache Spark, combining the strengths of both frameworks to perform complex data processing and analytics tasks.

Developed robust data processing solutions using C#, demonstrating proficiency in handling diverse datasets efficiently.

Utilized C# for seamless interaction with relational databases like SQL Server, Oracle, and MySQL.

Implemented optimized SQL queries and stored procedures for efficient data retrieval and manipulation.

Implemented security measures in Hadoop clusters, including authentication, authorization, and encryption, to ensure the confidentiality and integrity of stored and processed data.

Implemented advanced Data Analysis Expressions (DAX) formulas for complex calculations and custom metrics, enhancing the depth of analysis in Power BI reports.

Designed and implemented strategies for incremental data loading to enhance efficiency and reduce the processing time of Informatica ETL workflows.

Implemented Change Data Capture techniques in Informatica for efficiently capturing and processing changes in source data, minimizing the amount of data processed during each ETL run.

Utilized DAX time intelligence functions to perform dynamic date calculations, such as year-to-date, quarter-to-date, and moving averages, enabling trend analysis.

Evaluated and working on Azure Data Factory as an ETL tool to process business critical data into aggregated tables in Hive Cloud. Deployed and Development in Bigdata applications like Spark, Hive, Kafka, and Flink in Azure cloud.

Extensive experience in using Talend for data integration, designing and deploying ETL processes to move and transform data across various systems.

Utilized Talend's extensive connectivity options to integrate with diverse data sources, including databases, cloud storage, and APIs, ensuring comprehensive data access.

Applied Hive for data warehousing on Hadoop, creating and managing structured tables to facilitate efficient querying and reporting.

Leveraged Hive's MapReduce abstraction to simplify complex data processing tasks, enabling developers to focus on high-level queries rather than low-level programming.

Developed and executed notebooks in languages like Python, Scala, and SQL for interactive data exploration, analysis, and visualization.

Configured and customized Snap Logic connectors (Snaps) to interface with a diverse set of applications, databases, and cloud services, ensuring seamless data flow.

Utilized Snap Logic's visual interface to define and execute complex data transformations, ensuring data quality and adherence to business rules.

Integrated Pentaho ETL outputs seamlessly with Business Intelligence (BI) tools for reporting and analytics.

Ensured compatibility and optimized data structures for BI platforms such as Tableau, Power BI, or Looker.

Proficient in working with Mark Logic, a NoSQL database, showcasing expertise in schema-less data storage and retrieval for flexibility and scalability.

Designed semantic data models within Mark Logic, enabling the representation of data relationships and facilitating advanced queries for comprehensive analytics.

Developed Python-based Lambda functions for data processing tasks, leveraging libraries like Boto3 for AWS interactions.

Integrated Python scripts seamlessly with AWS Lambda for a cohesive serverless solution.

Applied dynamic SQL in PL/SQL for flexible and adaptive query execution, allowing for runtime.

Implemented solutions for multi-cloud connectivity, ensuring seamless data transfer and interoperability within AWS and across other cloud environments.

Integrated Amazon EMR with other AWS services, such as Amazon S3 for data storage, AWS Glue for ETL, and Amazon RDS for managed relational databases, creating comprehensive data processing pipelines.

Managed and optimized data warehouses on AWS, ensuring data consistency, integrity, and efficient retrieval for analytical purposes.

Played a key role in cost estimation, billing, and the implementation of cloud services.

Environment: Windows remote desktop, Azure functions apps, Azure Data Lake, Azure Data Factory (ADF v2), Azure Cloud, BLOB Storage, SQL Server, AZURE PowerShell, Azure SQL Server, Azure Data Warehouse, Data bricks, Python and AWS.

Client: JPMC, Fair Lawn, NJ Aug 2019 to Dec 2021

Role: AWS/Data Engineer

Responsibilities:

Collaborated with cross-functional teams to gather requirements, integrate with BI assets, review pipelines, and translate needs into Spark SQL and PySpark ETL workflows.

Developed an ETL framework for migrating data from on-premises sources like Hadoop and Oracle to AWS, leveraging Apache Airflow, Apache Sqoop, and Apache Spark (PySpark).

Assisted clients in migrating fulfillment center data platforms from on-premises to AWS cloud.

Automated DAG generation based on source systems and migrated tables to S3 storage.

Utilized AWS Data Pipeline to set up data pipelines for ingesting data from Spark and migrating it to Snowflake Database.

Implemented advanced SQL performance tuning techniques, optimizing database queries, and enhancing overall system efficiency for seamless data processing.

Optimized Hive queries for enhanced performance by fine-tuning configurations, partitioning tables, and utilizing indexing where applicable.

Integrated Azure Data Factory with various Azure services, such as Azure SQL Database, Azure Blob Storage, and Azure Data Lake Storage, for comprehensive data processing.

Designed data transformation activities within Azure Data Factory pipelines, utilizing Azure Data Flow for complex data transformations and mappings.

Demonstrated expertise in utilizing Pentaho for end-to-end data integration, covering extraction, transformation, and loading (ETL) processes.

Developed and maintained Pentaho Data Integration (PDI) jobs for seamless data flow within the organization.

Utilized Pentaho Data Integration (PDI) toolset to design and implement complex ETL workflows.

Utilized SSIS transformations for data cleansing, normalization, and enrichment, ensuring data quality and consistency in the destination system.

Integrated Hive with external systems and data sources, ensuring seamless data exchange and enabling a unified view of data across different platforms.

Utilized Native SQL for designing and implementing ETL processes, facilitating the extraction, transformation, and loading of data into data warehouses for analysis.

Applied functional programming principles in Scala, enhancing code modularity, reusability, and supporting a more declarative programming style.

Developed web applications using the Scala Play Framework, combining the benefits of Scala with a powerful and reactive web framework for building scalable applications.

Implemented strategies for scalability and performance optimization in Azure Synapse.

Monitored and tuned Synapse workloads for optimal efficiency.

Implemented a Hadoop-based data lake architecture for scalable storage and processing of large volumes of raw and structured data. Utilized HDFS for data storage and MapReduce for parallel processing.

Designed stored procedures to optimize query execution plans, improving database performance and reducing response times for critical data retrieval tasks.

Connected Tableau to various data sources, including databases, Excel files, cloud platforms, and web data connectors, ensuring seamless data integration for analysis.

Utilized Tableau Calculations to create advanced calculations, custom fields, and formulas, providing deeper insights and supporting complex analytical scenarios.

Applied Tableau for mapping and geospatial analysis, leveraging geographic data to visualize patterns, trends, and regional insights within dashboards.

Implemented transaction management within stored procedures to ensure data integrity, coordinating multiple SQL statements as a single, atomic unit of work.

Used AWS EMR to transform and move large amounts of data into and out of AWS S3.

Leveraged Amazon EMR for Hadoop and Apache Spark-based data processing, optimizing the distribution and parallelization of tasks across nodes for efficient computation.

Successfully integrated on-premises and cloud-based systems, addressing the challenges of hybrid cloud environments using Snap Logic's hybrid integration capabilities.

Developed and managed APIs within Snap Logic, facilitating efficient communication and data exchange between applications and systems.

Integrated C# applications with big data platforms such as Apache Hadoop and Spark, showcasing adaptability across diverse ecosystems.

Implemented data pipelines for processing and analyzing large-scale datasets efficiently.

Developed and consumed RESTful APIs using C#, facilitating seamless integration with external data sources.

Implemented security measures on Amazon EMR clusters, configuring access controls, encryption, and authentication to ensure the confidentiality and integrity of data.

Established and maintained golden records within MDM, serving as the authoritative source for master data entities and supporting data quality initiatives.

Created Lambda functions to run the AWS Glue job based on the AWS S3 events.

Monitored and optimized the performance of AWS Lambda functions, adjusting configurations for efficiency.

Implemented best practices for AWS Lambda, including memory allocation, timeouts, and concurrent execution settings.

Integrated AWS Lambda with other AWS services, such as AWS S3, DynamoDB, or Aurora, for seamless data access and storage.

Worked with Big Data technologies such as Hadoop and Spark, incorporating them into data warehousing solutions. Demonstrated proficiency in managing and processing large volumes of data for analytics and reporting.

Presented dashboards to business users and cross-functional teams, defining key performance indicators (KPIs) and identifying data sources.

Documented source-to-target mappings for both data integration and web services.

Environment: AWS (S3, DMS, RDS, EMR, Snowflake, EC2, Glue), Azure Data Bricks, ADF, Azure Data Lake, Flask, Zeppelin, SQL, Python, Git, Tableau, MS Office, JIRA, Windows, Kubernetes, Airflow, Nebula, Oracle DB, DynamoDB, HDFS, PySpark, SQL Server Management Studio (SSIS, SSRS).

Client: CBRE Group, Milwaukee WI, Jan 2017 to July 2019

Role: AWS Data Analyst/Data Engineer

Responsibilities:

Involved in full Software Development Life Cycle (SDLC) - Business Requirements Analysis, preparation of Technical Design documents, Data Analysis, Logical and Physical database design, Coding, Testing, Implementing, and deploying to business users.

Developed complex data mappings in Informatica, integrating data from Oracle, Teradata, and Sybase into the target database.

Designed and implemented complex mappings with various transformations in Informatica like Source Qualifier, Aggregator, Router, Joiner, Union, Expression, Lookup, Filter, Update Strategy and more.

Developed Python scripts to automate the ETL process using Apache Airflow and CRON scripts in the UNIX operating system as well.

Worked on Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage and cloud SQL.

Using g-cloud function with Python to load Data into Big query for on arrival csv files in GCS bucket.

Worked with EMR, S3 and EC2 services in AWS cloud and Migrating servers, databases, and applications from on premise to AWS.

Integrated T-SQL with .NET applications, providing seamless connectivity between database systems and custom software solutions.

Deployed Tableau visualizations to Tableau Server and Tableau Online, facilitating secure and scalable sharing of dashboards across teams and organizations.

Performed data blending and joins In Tableau to combine data from multiple sources, enabling comprehensive analysis and reporting in a single view.

Created maintenance plans in T-SQL for database backups, integrity checks, and index optimizations, automating routine tasks for database health and reliability.

Achieved Continuous Integration and Delivery to Data Bricks by defining build and release pipeline tasks on AWS DevOps, that builds, drops, and deploys the application libraries into DBFS.

Created views and materialized views using DDL, providing customized perspectives on data for simplified querying and reporting.

Integrated C# applications with cloud platforms such as Azure and AWS for scalable and flexible data processing.

Integrated Matillion with Tableau for efficient data visualization and reporting.

Ensured optimal performance and compatibility between Matillion-transformed data and Tableau dashboards.

Implemented strategies to enhance Tableau's connectivity and interaction with Matillion-produced datasets.

Designed and implemented data models in C#, ensuring alignment with data architecture principles.

Collaborated in architecting scalable and maintainable data solutions, showcasing a deep understanding of data systems.

Utilized Data Bricks notebooks for interactive and iterative data analysis, fostering collaboration and documentation of data processing workflows.

Managed Spark clusters efficiently within Data Bricks, optimizing resource allocation and ensuring high-performance data processing.

Integrated machine learning workflows within Data Bricks, leveraging MLlib and MLflow for scalable and collaborative machine learning model development.

Developed Python scripts to automate the ETL process using Apache Airflow and CRON scripts in the UNIX operating system as well.

Contributed to the creation and maintenance of ELT specifications, including source-to-target mappings.

Conducted project-specific data analysis as part of the ELT process, analyzing and mapping required data.

Developed and maintained Matillion workflows to execute ELT tasks seamlessly.

Designed, developed, and maintained Matillion ETL workflows to facilitate efficient data processing.

Applied Matillion's intuitive interface for orchestrating end-to-end ETL processes within cloud environments.

Implemented monitoring and logging features within Matillion to track the progress and performance of ETL jobs.

Experience in cloud versioning technologies like GitHub.

Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.

Environment: Informatica Power Center 9.5, Python, Data Bricks, Oracle, Teradata, AWS Glue, Talend, Google Cloud Platform (GCP), SQL, Erwin, Unix Shel Scripting, CRON, PostgreSQL Server, Python, and AWS Redshift

Client: Tesco Bengaluru, India Aug 2014 to Dec 2015

Role: Data Analyst

Responsibilities:

Worked on a migration project which required gap analysis between legacy systems and new systems.

Involved in various projects related to Data Modeling, Data Analysis, Design and Development for both OLTP and Data warehousing environments.

Utilized a variety of data stores, including data warehouses, relational databases, in-memory caches, and searchable document databases.

Ensured seamless integration between different data stores, enabling unified access and retrieval.

Worked on Data Lake in AWS S3, Copy Data to Redshift, Custom SQLs to implement business Logic using Unix and Python Script Orchestration for Analytics Solutions.

Worked at conceptual/logical/physical data model level using Erwin according to requirements.

Strong background in SQL Server development, including T-SQL and stored procedure development.

Expertise in database design principles, including star schema, slowly changing dimensions (SCD types), OLAP, and OLTP concepts.

Integrated Tableau with R and Python scripts, extending analytical capabilities by incorporating advanced statistical and machine learning models into dashboards.

Implemented parameterization and dynamic filtering in Tableau, allowing end-users to interactively explore and analyze data based on specific criteria.

Utilized Tableau Prep for data cleaning and preparation tasks, ensuring data quality and facilitating a streamlined ETL process before visualization.

Skillful in T-SQL profiling for optimizing and fine-tuning database queries.

Performed and utilized necessary PL/SQL queries to analyze and validate the data.

Designed and developed T-SQL stored procedures to extract, aggregate, transform, and insert data.

Used forward engineering approach for designing and creating databases for OLAP model.

Implemented regression analysis, hypothesis testing, and other statistical methodologies.

Designed and implemented data cubes in SSAS, optimizing data structures for efficient querying and analysis, and ensuring optimal performance for end users.

Utilized MDX (Multidimensional Expressions) and DAX (Data Analysis Expressions) to create complex calculations, custom measures, and KPIs within SSAS models.

Worked with system architects to create functional code set cross walks from source to target systems.

Wrote ETL transformation rules to assist the SQL developer.

Implemented dynamic package configurations in SSIS, allowing for flexible configuration settings based on runtime parameters and environmental variables.

Implemented indexing strategies using DDL to optimize query performance, facilitating faster data retrieval and efficient data manipulation.

Performed component integration testing to check if the logics had been applied correctly from one system to other system.

Maintained the offshore team for the updates and project required details.

Environment: Erwin, T-SQL, OLTP, AWS, PL/SQL, OLAP, DDL, SSIS, SSAS, Teradata, Tableau, SQL, ETL, SAS, SSRS.

Contact this candidate