Nithin B.
Sr. Data Engineer
*********.*@*******.*** +1-940-***-**** LinkedIn
PROFESSIONAL SUMMARY
• 6+ years of experience as a Data Engineer and extensively worked with designing, developing, and implementing Data models for enterprise-level applications and BI solutions.
• Experience in designing and building Data Management Lifecycle covering Data Ingestion, Data integration, Data consumption, Data delivery, and integration Reporting, Analytics, and System-System integration.
• Proficient in Big Data environment and Hands-on experience in utilizing Hadoop environment components for large- scale data processing including structured and semi-structured data.
• Strong experience with all phases including Requirement Analysis, Design, Coding, Testing, Support, and Documentation.
• Extensive experience with Azure cloud technologies like Azure Data Lake Storage, Azure Data Factory, Azure SQL, Azure Data Warehouse, Azure Synapse Analytical, Azure Analytical Services, Azure HDInsight, and Databricks.
• Solid Knowledge of AWS services like AWS EMR, Redshift, S3, EC2, and concepts, configuring the servers for auto- scaling and elastic load balancing.
• Experience with monitoring the web services using Hadoop and Spark for controlling the applications and analyzing their operation and performance.
• Experienced in Python data manipulation for loading and extraction as well as with Python libraries such as NumPy, Pandas, and SciPy for data analysis and numerical computations.
• Good knowledge and experience with NoSQL databases like HBase, Cassandra, and MongoDB and SQL databases like Teradata, Oracle, PostgreSQL, and SQL Server.
• Experience in the development and design of various scalable systems using Hadoop technologies in various environments and analyzing data using MapReduce, Hive, and PIG.
• Hands-on use of Spark and Scala to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
• Strong knowledge in working with ETL methods for data extraction, transformation, and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
• Hands-on experience in designing and implementing data engineering pipelines and analyzing data using Hadoop ecosystem tools like HDFS, Spark, Sqoop, Hive, Flume, Kafka, Impala, PySpark, Oozie, and HBase.
• Leveraged Databricks features such as notebooks, clusters, and libraries to create scalable and maintainable data pipelines.
• Experience with different ETL tool environments like SSIS, Informatica, and reporting tool environments like SQL Server Reporting Services, and Business Objects.
• Experience in deployment of applications and scripting using the Unix/Linux Shell scripting.
• Solid knowledge of Data Marts, Operational Data Store, OLAP, Dimensional Data Modeling with Star Schema Modeling, Snowflake Modeling for Dimensions Tables using Analysis Services.
• Extensive experience with various databases like Teradata, MongoDB, Cassandra DB, MySQL, Oracle, and SQL Server.
• Experience in Creating Teradata SQL scripts using OLAP functions like rank and rank over to improve the query performance while pulling the data from large tables.
• Strong Experience in working with Databases like Teradata and proficiency in writing complex SQL, PL/SQL for creating tables, views, indexes, stored procedures, and functions.
• Knowledge and experience with Continuous Integration and Continuous Deployment using containerization technologies like Docker and Jenkins.
• Excellent working experience in Agile/Scrum development and Waterfall project execution methodologies. SKILLS
Big Data Technologies Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Zookeeper, Apache Flume, Apache Airflow, Cloudera, HBase, Azure Stream Analytics. Programming Languages Python, Java, Matplotlib, Seaborn, SQL, Shell Scripting, Fast APIs, Restful APIs. Cloud Services Azure Cloud: Azure Data Lake Storage Gen 2, Azure Data Factory, Blob storage, Azure SQL DB, Databricks, Azure Event Hubs, Synapse, Athena. AWS Cloud: EC2, S3, Glacier, AWS RDS, Amazon SQS, Amazon S3, AWS EMR, Lambda, AWS SNS, CloudWatch, Kinesis, CloudFront, Quick Sight, AWS Glue ETL & Visualization Tools Tableau, Informatica, Talend, SSIS, SSRS, Grafana, Power BI Databases Microsoft SQL, Oracle, MS Access, PostgreSQL, Azure SQL Database, Amazon DynamoDB, MongoDB, Azure Cosmos DB, Teradata, Cassandra DB, HBase Data Warehousing Tools AWS Redshift, Snowflake, Azure Synapse analytics. Version Control & Orchestration Jenkins, GitHub, GitLab, SVN, Azure Repos, Azure DevOps. Documentation Confluence, Share Point.
CERTIFCATIONS
Azure Certified Data Engineer – Associate
PROFESSIONAL EXPERIENCE
Client: TRUIST, Charlotte NC (Dec 2022 – Present)
Role: Sr. Data Engineer
• Worked with business/user groups for gathering the requirements and working on the creation and development of pipelines.
• Migrated applications from Cassandra DB to Azure Data Lake Storage Gen 1 using Azure Data Factory, created tables, and loading and analyzed data in the Azure cloud.
• Worked on developing the process and ingested the data in Azure cloud from web service and loaded it to Azure SQL DB.
• Worked with Spark applications in Python for developing the distributed environment to load high volume files using Pyspark with different schema into Pyspark Data frames and process them to reload into Azure SQL DB tables.
• Collaborated with data scientists and analysts to understand data requirements and transform them into efficient Databricks workflows.
• Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet, and XML.
• Worked on creating ETL packages using SSIS to extract data from various data sources like Access database, Excel spreadsheet, and flat files, and maintain the data using SQL Server.
• Implement authentication, error handling, and data pagination strategies when interacting with external APIs and integrate the extracted API data seamlessly into existing data pipelines and storage solutions.
• Used Informatica for creating, executing, and monitoring sessions and workflows.
• Worked on automating data ingestion into the Lakehouse and transformed the data, used Apache Spark for leveraging the data, and stored the data in Delta Lake.
• Ensured data quality and integrity of the data using Azure SQL Database and automated ETL deployment and operationalization.
• Used Databricks, Scala, and Spark for creating the data workflows and capturing the data from Delta tables in Delta Lakes.
• Performed Streaming of pipelines using Azure Event Hubs and Stream Analytics to analyze the data from the data- driven workflows.
• Worked with Delta Lakes for consistent unification of Streaming, processed the data, and worked on ACID transactions using Apache Spark.
• Worked with Azure Blob Storage and developed the framework for the implementation of the huge volume of data and the system files.
• Implemented distributed stream processing platform with low latency and seamless integration, with data and analytics services inside and outside Azure to build your complete big data pipeline.
• Use NIFI to load data into HDFS as ORC files.
• Writing TDCH scripts and Apache NIFI to load data from Mainframes DB2 to Hadoop cluster.
• Designed and implemented data pipelines that move and transform data between Azure Synapse and other data sources using appropriate tools and technologies.
• Worked with PowerShell scripting for maintaining and configuring the data. Automated and validated the data using Apache Airflow.
• Created orchestrations within Azure Data Factory to manage complex data workflows and dependencies.
• Worked on optimization of Hive queries using best practices and right parameters and using Hadoop, YARN, Python, and Pyspark.
• Used Sqoop to extract the data from Teradata into HDFS and export the patterns analyzed back to Teradata.
• Used Accumulators and Broadcast variables to tune the Spark applications and to monitor the created analytics and jobs.
• Tracked Hadoop cluster job performance and capacity planning and tuning Hadoop performance for high availability and Hadoop cluster recovery.
• Worked with Tableau for generating reports and created Tableau dashboards, pie charts, and heat maps according to the business requirements.
• Worked with all phases of Software Development Life Cycle and used agile methodology for development. Environment:
Apache Spark, Apache Hadoop, Databricks, Delta Lake, Azure Data Lake Storage Gen 1, Azure Blob Storage, Apache NIFI, Azure Event Hubs, Azure Stream Analytics, Python, Scala, PowerShell, TDCH scripts (Teradata), Microsoft Azure (Azure Data Factory, Azure SQL Database, Azure Synapse), AWS Cloud, Azure Data Factory, Informatica, Apache Airflow, Tableau, Azure SQL Database, Teradata, Cassandra, Azure Synapse, Hadoop Distributed File System (HDFS), Git, Azure Data Factory. Client: GAF, Parsippany NJ. (Oct 2021 – Oct 2022)
Role: Data Engineer
• Designed and executed real-time data streaming solutions utilizing Apache Kafka and Azure Stream Analytics to capture and process customer behavior data in real-time, significantly enhancing data flow efficiency.
• Leveraged Apache Airflow to automate ETL tasks, ensuring efficient and scheduled data processing for real-time customer analytics, resulting in streamlined operations.
• Performed ETL on data from different source systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL, Azure Data Lake Analytics. Data Ingestion to one or more Azure services
(Azure Storage, Azure SQL, Azure DW).
• Implemented data processing and analytics workflows on Azure Databricks for real-time data insights and advanced analytics capabilities.
• Integrated data from various sources, including PostgreSQL (a relational database) and Cassandra (a NoSQL database), for sourcing and storing both structured and unstructured customer data.
• Proficiently handled complex SQL queries to extract, aggregate, and analyze data stored in Data Warehouse using tools like SQL Server Management Studio.
• Developed data engineering solutions using Python and relevant libraries like Seaborn and TensorFlow to transform and process data effectively.
• Collaborated with dashboard development teams, using Power BI, to create interactive data visualizations and communicate real-time analytics results effectively.
• Deployed data solutions on Azure cloud services, including Azure Data Lake Storage and Azure SQL Data Warehouse, utilizing Azure Portal for cloud management.
• Implemented data privacy measures and data engineering solutions, utilizing Azure Data Lake Storage, to secure sensitive information and comply with regulatory requirements.
• Established CI/CD pipelines with Jenkins to automate code integration, testing, and deployment of data solutions, improving development efficiency.
• Set up monitoring using Azure Monitor to track data pipeline and infrastructure performance, proactively identifying, and addressing issues. Ensured robust security measures using Azure Active Directory for identity and access management, ensuring data privacy and compliance.
• Utilized Kusto Query Language (KQL) for querying and analyzing data in Azure Log Analytics, extracting valuable insights for monitoring, and troubleshooting data pipelines and infrastructure.
• Utilized GitHub and Azure Repos for version control, enabling effective codebase management, collaboration with team members, and ensuring code quality and consistency.
• Worked on importing and exporting data from Teradata into Azure Cloud and HIVE using.
• Sqoop, Info Works, and Azure Data Factory (ADF) for analysis, visualization, and generating reports.
• Maintained comprehensive documentation of data engineering processes and workflows using Confluence, facilitating knowledge transfer and efficient maintenance.
• Worked within an agile development environment, following the Scrum methodology, and used JIRA for sprint planning, daily stand-ups, and retrospectives, enhancing collaboration and adaptability within the data engineering team.
Environment:
Apache Kafka, Apache Airflow, PostgreSQL, ETL, Azure Databricks, SQL, Python, Seaborn, TensorFlow, Power BI, Azure Data Lake Storage, Azure SQL Data Warehouse, Azure Portal, CI/CD, Jenkins, Azure DevOps, Azure Monitor, Azure Active Directory, Kusto Query Language (KQL), Azure Log Analytics, Azure Repos, Confluence, Agile, Scrum, JIRA. Client: ICICI Bank, India (Feb 2020 – Jul 2021)
Role: Data Engineer
• Designed and executed Extract, Transform, Load (ETL) processes to efficiently ingest and process vast volumes of financial data using Hadoop.
• Worked with the Hadoop ecosystem, including components like Hive, HBase, and more.
• Stored and managed vast volumes of transaction data, customer data, and other relevant information by using HDFS.
• Implemented AWS services including Amazon S3 and Amazon EMR to enhance fraud detection capabilities, enabling efficient data management and real-time analytics in the financial industry.
• PIG can be used for data preparation and ETL (Extract, Transform, Load) processes. It helps in cleaning and transforming data before analysis.
• Worked with SQL databases like MySQL, and NoSQL databases like MongoDB as needed for data integration followed by the optimization of complex HQL (Hive Query Language) and Spark SQL queries for data analysis and reporting.
• Implemented Hive and Hive Query Language (HQL) for data warehousing, enabling structured data analysis and reporting.
• Utilized Apache Kafka for real-time data ingestion, ensuring prompt processing and analysis of financial transactions.
• Leveraged Tableau to create interactive data visualizations and reports, facilitating data-driven decision-making and insights sharing.
• Established Continuous Integration and Continuous Deployment (CI/CD) pipelines for the big data analytics system, automating code integration, testing, and deployment processes.
• Implemented real-time monitoring using tools like Apache Spark and Kafka to detect and respond to fraudulent activities promptly.
• Maintained comprehensive project documentation using Confluence, ensuring knowledge transfer, and supporting effective system maintenance and enhancements.
• Implemented Agile practices, including Kanban, to promote collaboration and adaptability within the data engineering team, resulting in improved alignment of projects with business requirements and the more efficient delivery of data solutions.
Environment:
Hadoop, HDFS, AWS, Amazon S3, PIG, MySQL, MongoDB, HQL (Hive Query Language), Spark SQL, Apache Kafka, Tableau, CI/CD, Apache Spark, Confluence, Agile, Kanban. Client: Biocon, India (Jul 2017 – Dec 2019)
Role: Hadoop Developer
• Developed Hive and Bash scripts for source data validation and transformation. Automated data loading into HDFS and Hive for pre-processing the data using One Automation.
• Gathered data from Data warehouses in Teradata and Snowflake.
• Developed Spark/Scala, Python for regular expression project in the Hadoop/Hive environment.
• Designed and implemented an ETL framework to load data from multiple sources into Hive and from Hive into Teradata.
• Experience at building Big Data applications using Cassandra and Hadoop.
• Utilized SQOOP, ETL and HadoopFileSystem APIs for implementing data ingestion pipelines.
• Worked on Batch data of different granularity ranging from hourly, daily to weekly and monthly.
• Hands on experience in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager.
• Handled Hadoop cluster installations in various environments such as Unix, Linux and Windows.
• Assisted in upgrading, configuration, and maintenance of various Hadoop infrastructures like Ambari, PIG, and Hive.
• Developing and writing SQLs and stored procedures in Teradata. Loading data into snowflake and writing Snow SQLs scripts.
• Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet, and ORC.
• Worked extensively on Teradata, Hadoop-Hive, Spark, SQLs, PLSQLs, Snow SQLs.
• Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications.
• Experienced in working with Hadoop from Horton works Data Platform and running services through Cloudera manager.
Environment:
Apache Ambari, AWS Glue, Hadoop HDFS, Apache HBase, Apache Hive, AWS, Tableau, SQL, Python, AWS CloudWatch, Jenkins, Git, ELK Stack.
EDUCATION
Master of Science: Computer Science, University of Dayton, OH. Bachelor of Technology: Computer Science, Mahaveer Institute of Science and Technology, Hyderabad.