Data Engineering Sql Server

Location:

Posted:

November 21, 2023

Resume:

Professional Summary:

Over ** years of experience as a highly motivated IT professional specializing in Data Engineering with expertise in designing data-intensive applications using Hadoop Ecosystem, Big Data Analytics, Data Warehousing, Data Mart, Cloud Data Engineering, Data Visualization, Reporting, and Data Quality solutions.

In-depth knowledge of Hadoop architecture & its components, including YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker & MapReduce paradigm.

Extensive experience in developing enterprise-level solutions utilizing Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, Pig, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and Yarn.

Proficient in enhancing performance and optimizing existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, MLlib, and Pair RDDs.

Experienced in designing ETL data flows, creating mappings/workflows, and performing data migration and transformation using SQL Server SSIS, extracting data from SQL Server, Oracle, Access, and Excel Sheets.

Skilled in working with databases, both SQL and NoSQL, including Oracle, MySQL, SQL Server, MongoDB, Cassandra, DynamoDB, PostgreSQL, and Teradata.

Proficient in handling database issues, connections, and creating Java apps for MongoDB and HBase.

Expertise in designing and creating RDBMS components such as tables, views, user-created data types, indexes, stored procedures, cursors, triggers, and transactions.

Worked on seamlessly integrates with several AWS native services, enabling a robust, scalable, and flexible data engineering and analytics environment.

Proficient in FACT dimensional modeling including Star Schema, Snowflake Schema, Transactional Modeling, and Slowly Changing Dimension (SCD) implementation.

Extensive expertise in SSIS, SSRS, Power BI, Tableau, Talent, Informatica, T-SQL, and reporting/analytics.

expert in leveraging these Native AWS services within Databricks empowers organizations to derive actionable insights from their data, implement advanced analytics, and build intelligent applications, all within a secure, scalable, and collaborative environment.

Experienced in CI/CD pipeline instantiation, creation, and maintenance, utilizing automation tools such as GIT, Terraform, and Ansible.

Possesses hands-on experience with a diverse range of cloud technologies across leading platforms, including AWS (e.g., EC2, S3, RDS, Lambda, EMR), Google Cloud Platform (e.g., BIG query, Cloud Data Proc, Google Cloud Storage), and Azure (e.g., Blob Storage, Data Lake, Azure SQL, Azure Databricks), with expertise in migrating on-premises ETLs and managing data warehousing and analytics solutions.

Worked on various applications using Python, Java, Django, C++, XML, CSS3, HTML5, DHTML, JavaScript, and jQuery.

Familiar with JSON-based RESTful web services, XML/QML-based SOAP web services, and integrated Python IDEs such as Sublime Text and PyCharm.

Experienced in utilizing various Python packages for data analysis and machine learning, including ggplot2, NLP, Reshape2, Pandas, Numpy, Seaborn, SciPy, Matplotlib, Scikit-Learn, Beautiful Soup, SQL Alchemy, PyQT, and PyTest.

Proficient in executing the entire SDLC encompassing system requirements gathering, architecture design, coding, development, testing, maintenance using Agile/Waterfall methodologies.

Technology Stack:

Programming Languages

Python, R, SQL, Java, .Net, Pyspark, Scala, Spark.

Python Libraries

Requests, Report Lab, NumPy, SciPy, Pytables, cv2, imageio, Python-Twitter, Matplotlib, HTTPLib2, Urllib2, Beautiful Soup, PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3.

Web Frameworks and Architectures:

Frameworks: Django, Flask, Pyramid, PyCharm, Sublime Text

Architectures: MVW, MVC, WAMP, LAMP.

DBMS

Oracle, PostgreSQL, Teradata, IBM DB2, MySQL, PL/SQL, MongoDB, Cassandra, DynamoDB, HBase.

Web Services

REST, SOAP, Microservices.

Big Data Ecosystem Tools

Cloudera distribution, Hortonworks Ambari, HDFS, Map Reduce, YARN, Pig, Sqoop, HBase, Hive, Flume, Cassandra, Apache Spark, Oozie, Zookeeper, Hadoop, Scala, Impala, Kafka, Airflow, DBT, NiFi.

Reporting Tools

Power BI, SSIS, SSAS, SSRS, Tableau.

Containerization / orchestration Tools

Kubernetes, Docker, Docker Registry, Docker Hub, Docker Swarm.

Cloud Technologies

AWS: Amazon EC2, S3, RDS, VPC, IAM, Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, Step Functions, EMR

GCP: Big Query, GCS Bucket, G-Cloud function, Cloud Dataflow, Pub/Sub, Cloud Shell, GSUTIL, BQ command line utilities, Data Proc

Azure: Azure web application, App services, Azure storage, Azure SQL Database, Virtual machines, Azure search, Notification Hub

Data Modelling Techniques

Relational data modeling, ER/Studio, Erwin, Sybase Power Designer, Star Join Schema, Snowflake modeling, FACT and Dimensions tables.

Streaming Frameworks

Kinesis, Kafka, Flume.

Version Control and CI/CD Tools

Concurrent Versions System (CVS), Subversion (SVN), GIT, GitHub, Mercurial, Bit Bucket, Docker, Kubernetes.

Professional Experience:

Responsibilities:

Assisted in mapping the existing data ingestion platform to the AWS cloud stack as part of an enterprise cloud initiative.

Managed AWS Management Tools like CloudWatch and CloudTrail, storing log files in AWS S3 with versioning for sensitive information.

Automated regular AWS tasks, such as snapshot creation using Python scripts.

Utilized AWS Redshift, S3, and Athena services to query large amounts of data stored on S3, creating a Virtual Data Lake without the need for ETL processes.

Installed and configured Apache Airflow for AWS S3 buckets and created DAGs to run Airflow.

Prepared scripts for automating the ingestion process using PySpark and Scala from various sources like API, AWS S3, Teradata, and Redshift.

Integrated AWS DynamoDB using AWS Lambda for storing item values & backing up DynamoDB streams.

Deployed SNS, SQS, Lambda functions, IAM roles, custom policies, EMR with Spark and Hadoop setup, and bootstrap scripts using Terraform for QA and production environments.

Managed AWS Hadoop clusters and services using Hortonworks Manager.

Set up Kinesis Data Streams, Kinesis Data Firehose, and Kinesis Data Analytics to capture and process streaming data, storing, and analyzing the output in S3, DynamoDB, and Redshift.

Developed several MapReduce jobs using PySpark and NumPy, incorporating Jenkins for continuous integration.

Developed schemas that allowed for flexibility and scalability, enabling easy integration of new data sources into the Snowflake data warehouse.

Proficient in designing and implementing serverless ETL workflows using AWS Glue's orchestration features. Created and scheduled workflows to automate data processing and pipeline execution.

Implemented custom MapReduce programs in Java API for data processing.

Extracted data from various sources such as CSV files, Excel, HTML pages, and SQL.

Debugged complex SSIS packages, SQL objects, and SQL job workflows.

Implemented Spark jobs for Change Data Capture (CDC) on Postgresql tables, updating target tables using JDBC properties.

Designed and implemented Scala programs using Spark Data Frames and RDDs for data transformations and actions.

Implemented seamless integration between AWS Glue and Amazon S3 for data lake storage, as well as with Amazon Redshift or other data warehouses for efficient data loading and querying.

Developed PySpark scripts utilizing SQL and RDD in Spark for data analysis and storing the results back into S3.

Integrated Kafka Publisher in Spark jobs to capture errors and push them into Postgres tables.

Built Nifi data pipelines in a Docker container environment during the development phase.

Managed log files from various sources by moving them to HDFS for further processing through Elasticsearch, Kafka, Flume, and Talend.

Performed Sqooping for file transfers through HBase tables to process data in several NoSQL DBs like Cassandra and MongoDB.

Integrate Apache Flink with various data sources and sinks, including Apache Kafka, Apache Pulsar, Amazon Kinesis, and databases, to ingest and deliver real-time data seamlessly.

Experienced in utilizing AWS Glue Data Catalog to catalog and organize metadata for various data sources. Managed schemas, tables, and partitions to enable efficient data discovery and querying.

Utilized Pig to communicate with Hive using HCatalog and HBase using Handlers.

Hands-on experience with confidential Big Data product offerings such as Info Sphere Big Insights and Info Sphere Streams.

Proficient in monitoring AWS Glue job performance using AWS CloudWatch and troubleshooting issues using logs and debugging techniques to ensure smooth ETL operations.

Implement real-time data processing and streaming analytics using Databricks Streaming, making it possible to analyze and respond to consumer electronics device data in real-time.

Utilized Step Functions in conjunction with AWS Glue for seamless data cataloging and ETL processes. Automate Glue jobs and cataloging tasks within Step Functions workflows.

Extended Apache Iceberg to support real-time data streaming by integrating it with streaming platforms such as Apache Kafka, enabling immediate processing of sensor and device data.

Developed user-friendly data visualizations and dashboards using complex datasets from different sources using Tableau.

Proficient in using QuickSight's custom expressions to create calculated fields, perform complex data manipulations, and add custom logic to visualizations.

Used Apache Beam SQL to express data transformations and aggregations in SQL-like queries, simplifying the development of complex data processing logic.

Skilled in connecting Amazon QuickSight to various data sources, including Amazon S3, Amazon Redshift, and other supported databases, to extract and visualize data seamlessly.

Utilize Databricks integration with machine learning libraries like MLlib, scikit-learn, and TensorFlow to build advanced analytics and predictive models for product usage patterns and customer behavior analysis.

Responsibilities:

Designed and built data pipelines using ELT/ETL tools and technologies, specifically leveraging ADF (Azure Data Factory) and Synapse.

Performed ELT optimization, designing, coding, and tuning big data processes using Microsoft Azure Synapse or similar technologies.

Integrated NoSQL databases with Hadoop and Apache Spark clusters, facilitating complex analytics and machine learning on insurance data.

Implemented secure authentication and authorization mechanisms using FastAPI's built-in OAuth2 support, enhancing API security and user access control.

Led data migration projects from traditional SQL databases to NoSQL systems, ensuring data integrity and minimal downtime.

Utilized Azure Event Grid for managing event services across various applications and Azure services.

Migrated on-premises data from Oracle/SQL Server/DB2/MongoDB to Azure Data Lake and Stored it using Azure Data Factory.

Developed data pipelines in Azure Data Factory (ADF) utilizing Linked Services, Datasets, and Pipelines to efficiently extract, transform, and load data from various sources including Azure SQL, Blob storage, Azure SQL Data Warehouse, write-back tool, and reverse operations.

Developed Python scripts for data validations and automation processes using ADF and implemented alerts using ADF and Azure Monitor for data observability functions.

Used Azure Resource Manager (ARM) to deploy, update, or delete all the resources for your solution in a single, coordinated operation.

Created resources, using Azure Terraform modules, and automated infrastructure management in Azure IaaS environments.

Used Apache Flink's CEP(Complex Event Processing) library to detect complex patterns and conditions in data streams, enabling applications like fraud detection, anomaly detection, and predictive maintenance.

Setting up Azure Infrastructure monitoring through Datadog and application performance monitoring through App Dynamics.

Designed, implemented, and managed virtual networking within Azure and connected to on-premises environments, Configure Express Route, Virtual Network, VPN Gateways, DNS, and Load Balancer.

Integrated Azure Boards with other Azure DevOps services, such as Azure Repos and Azure Pipelines, for seamless end-to-end project management.

Constructed NiFi flows for data ingestion purposes, ingesting data from Kafka, Microservices, CSV files from edge nodes utilizing NiFi flows.

Deploying DBT projects to DBT Cloud and configuring the necessary connections and environments for data transformation and analytics workflows.

Implemented task monitoring and logging using Celery's built-in monitoring tools and integration with logging frameworks like Logstash, and Kibana.

Utilized Databricks notebooks and Spark framework, deployed serverless services like Azure Functions, and configured HTTP triggers with Application Insights for system monitoring.

Implemented fine-grained access control and security measures using Apache Iceberg to protect sensitive consumer data, ensuring compliance with data privacy regulations.

Extensively worked on running Spark jobs on Azure HD Insights environment and used Spark as a data processing framework.

Built automated ETL pipelines using SnowSQL, ensuring seamless extraction, transformation, and loading of insurance data from diverse sources into Snowflake.

Implemented robust error handling mechanisms in PL/SQL procedures, ensuring data integrity and reliability during ETL operations.

Designed end-to-end ETL pipelines using SnowSQL and Snowflake tasks, automating data extraction, transformation, and loading processes.

Proficient in extracting data from different types of sources such as databases, flat files, APIs, and web services using Pentaho's connectors and transformation steps.

Implement stateful processing in Apache Beam pipelines using timers and keyed state, allowing for complex computations and aggregations over data streams.

Improved the performance and optimization of Hadoop algorithms using Spark-SQL, Pair RDD’S, YARN, and Spark Context.

Designed custom Spark REPL application and used Hadoop scripts for HDFS data loading and manipulation.

Utilized Spark SQL API in Pyspark to extract and load data and perform SQL queries.

Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.

Implemented automation using Databricks jobs and workflows to schedule and orchestrate data processing tasks, ensuring data pipelines run smoothly.

Created complex SQL queries in SnowSQL to support various analytical requirements, optimizing them for performance to handle large datasets efficiently.

Employed Apache Flink's built-in state management to maintain and update application state across event time, making it possible to perform complex computations on streaming data.

Involved in building Data Models and Dimensional Modeling using 3NF, Star, and Snowflake schemas for OLAP and Operational Data Store (ODS) applications.

Maintain metadata repositories in DataStage to document data lineage, data definitions, and job dependencies, aiding in data governance and compliance efforts.

Optimized query performance by leveraging Apache Iceberg's features like predicate pushdown and statistics collection, allowing for faster data retrieval.

Collaborated with RabbitMQ in cluster mode, serving as a reliable message queue within the OpenStack environment.

Parameterized Pentaho jobs and transformations, allowing for dynamic configuration and flexibility in handling different data scenarios.

Created SSIS packages to implement error/failure handling with event handlers, row redirects, and loggings.

Containerized FastAPI applications using Docker for seamless deployment and consistency across development, testing, and production environments.

Created data analytics reports using Power BI, utilizing DAX expressions, Power Query, Power Pivot, Power BI Desktop, as well as SQL Server Reporting Services (SSRS).

Responsibilities:

Developed Dimensional data models, Star Schema model, Snow-flake data models using Erwin.

Implemented Spark Architecture, MPP Architecture, including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.

Utilized Spark SQL API in Pyspark to extract and load data and perform SQL queries.

Configure DataStage to support real-time data integration by utilizing CDC, messaging systems, and event-driven triggers.

Tuned SQL queries and PL/SQL procedures, significantly enhancing database performance and reducing query execution times.

Designed and developed Pig Latin scripts and Pig command line transformations for data joins and custom processing of MapReduce outputs.

Imported data using Sqoop, SFTP from various sources like RDMS, Teradata, Mainframes, Oracle, Netezza to HDFS and performed transformations on it using Hive, Pig and Spark.

Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.

Used Python & SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.

Developed rest API's using python with flask and Django framework and done the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.

Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.

Created data-models for customer data using the Cassandra Query Language.

Worked extensively with g-cloud function with Python to load Data into Big Query for on arrival csv files in GCS bucket.

Processed and loaded and unbound Data from Google pub/subtopic to Big Query using cloud Dataflow with Python.

Worked in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, and Data Proc.

Worked on google cloud platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.

Storing data files in Google Cloud S3 Buckets daily basis, Using Data Proc, Big Query to develop and maintain GCP cloud base solutions.

Used cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query.

Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines such as GCP.

Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.

Used Apache airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators and python callable and branching operators.

Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.

Make use of the GCP Stack driver and Cloud Monitoring tools to proactively monitor and fix problems with data pipelines and storage solutions.

Responsibilities:

Worked on development of data ingestion pipelines using ETL tool, Talend & Bash scripting with big data technologies.

Consumed JSON messages using Kafka and processed the JSON file using Spark Streaming to capture UI updates.

Load the data into HDFS from different Data sources like Oracle, DB2 using Sqoop and loaded into Hive tables.

Implemented MVC architecture and Java EE frameworks like Struts2, Spring MVC, and Hibernate.

Worked with various Python IDE's using PyCharm, PySpark and Bottle framework and to deploy AWS.

Developed web applications in Django Framework model view control (MVC) architecture.

Analyzed the SQL scripts and designed the solution using PySpark for faster performance.

Performed integration of JWT authentication into Fast API applications, ensuring robust and stateless authentication for API endpoints.

Developed automated Python scripts utilizing the Boto3 library for AWS security auditing and reporting across multiple AWS accounts, leveraging AWS Lambda.

Employed the Django framework to build web applications, implementing the Model-View-Controller (MVC) architecture.

Worked with Python NumPy, SciPy, Pandas, Matplotlib, Stats packages to perform dataset manipulation, data mapping, data cleansing and feature engineering. Built and analyzed datasets using R and Python.

Implemented a Python-based distributed random forest via PySpark and MLlib.

Developed data ingestion modules using AWS Step Functions, AWS Glue and Python modules.

Worked on Cloud Formation Templates (CFT) in YAML and JSON format to build the AWS Services with the paradigm of Infrastructure as a Code.

Utilized in Shell and Python scripting language with AWS CLI and BOTO3 scripting experience.

Configured AWS IAM and Security Group in Public and Private Subnets in VPC.

Involved heavily in setting up the CI/CD pipeline using Jenkins, Maven, Nexus, GitHub, Chef, and AWS.

Used AWS Bean Stalk for deploying and scaling web applications and services developed with Java, Node.js, Python and Ruby on familiar servers such as Apache, and IIS.

Used Jenkins AWS Code Deploy plugin to deploy and Chef for unattended bootstrapping in AWS.

Implemented CloudTrail in order to capture the events related to API calls made to AWS infrastructure.

Education: Bachelor of Technology in Computer Science, Malla Reddy Engineering College, Hyderabad, India

Certifications: AWS Certified Solutions Architect

Raja Srinivas

Sr. Data Engineer

+1-213-***-**** **************@*****.*** LinkedIn: http://linkedin.com/in/raja-srinivas-1809b627b

Client: Samsung Electronics (Dallas, TX)

Role: Senior Data Engineer Aug 2021 - Present

Client: Homesite Insurance (Boston, Massachusetts)

Role: Data Engineer Oct 2018 - Jul 2021

Client: Goldman Sachs Group, Inc. (New York, NY)

Role: Data Engineer July 2015 - Sep 2018

Client: BP America inc. (Houston, Texas)

Role: Python Developer May 2013 – Jun 2015

Contact this candidate