Data Engineer Engineering

Location:

Princeton, NJ

Posted:

January 19, 2024

Contact this candidate

Resume:

Shivani

SR Data Engineer

Email Id: **********************@*****.***

Contact Details: 732-***-****

Professional Summary:

Over 9+ years of IT experience, specializing as a Senior Data Engineer with expertise in Design, Development, Maintenance, and Support of Big Data Applications, have experience of Data Engineering, Data Pipeline Design, Development, and Implementation as a Data Engineer/Data Developer and Data Modeler.

Successfully implemented various Big Data Analytical, Cloud Data Engineering, Data Warehouse/Data Mart, Data Visualization, Reporting, Data Quality, and Data virtualization solutions. Additionally, I'm well-versed in AWS cloud services such as EC2, S3, Glue, Athena, DynamoDB, and RedShift.

Skills extend to designing and implementing data engineering pipelines and data analysis using AWS services like EMR, Glue, EC2, Lambda, Beanstalk, Athena, Redshift, Scoop, and Hive.

Proficient in programming using Python, Scala, Java, and SQL, and I've implemented Spark in EMR for processing Enterprise Data across Data Lake in AWS.

Designed and executed end-to-end data pipelines for extracting, cleansing, processing, and analyzing large volumes of behavioral and log data, with a strong focus on data analytics in AWS Cloud.

Hands-on experience includes migrating on-premise ETLs to the Google Cloud Platform (GCP) using cloud-native tools such as BigQuery, Cloud Dataproc, Google Cloud Storage, and Composer, with a focus on IT data analytics projects.

Experience in Handling Heterogeneous data sources, databases Oracle, IBM DB2 and XML files using SSIS.

Hands-on experience building ETL pipelines, Visualizations, Analytics based quality solutions in-house using AWS, Azure Databricks, and other Open-source frameworks.

Contributed to infrastructure automation by writing Terraform scripts for AWS services, including ELB, CloudFront distribution, RDS, EC2, database security groups, Route S3, VPC, Subnets, Security Groups, and S3 Bucket, and have converted existing AWS infrastructure to AWS Lambda deployed via Terraform and AWS Cloud Formation.

Hands-on experience extends to various GCP services, including BigQuery, Cloud Storage (GCS), Cloud Functions, Cloud Dataflow, Pub/Sub, Cloud Shell, GSUTIL, Data Proc, and Operations Suite (Stackdriver).

Possess in-depth knowledge of Spark Streaming, Spark SQL, and other components of Spark, along with extensive experience in Scala, Java, Python, SQL, T-SQL, and R programming.

Developed and deployed enterprise-based applications using Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, and Kafka.

Expert in designing complex reports like reports using Cascading parameters, Drill-Through Reports, Parameterized Reports and Report Models and adhoc reports using SQL Server Reporting Services (SSRS) based on client requirement .Good experience in developing reports using SQL Server (SSRS).

Expertise in managing changes in data structures and business requirements, efficiently adapting Informatica workflows to accommodate evolving data needs while minimizing disruptions.

Highly skilled in Big Data technologies such as Hadoop, PySpark, Hive, Pig and Spark, enabling efficient data storage, retrieval, and analysis on large-scale datasets.

Hands on experience across Hadoop Ecosystem that includes extensive experience in Big Data technologies like HDFS, Map Reduce, NoSQL, Spark, Python and Scala.

Proficiency in Spark Core, Spark SQL, Spark MLlib, Spark GraphX, and Spark Streaming for processing and transforming complex data using in-memory computing capabilities in Scala.

Integrated various data sources like Oracle SE2, SQL Server, Flat Files, and Unstructured files into data warehousing systems.

Worked in Upgrade and Migration projects SQL Server /2014 and upgraded DTS to SSIS by rewriting the complete functionality.

Have extensive experience in developing MapReduce and Streaming jobs using Scala and Java for data cleansing, filtering, and aggregation, and I possess detailed knowledge of the MapReduce framework.

Well-versed in using IDEs such as Eclipse, IntelliJ IDE, PyCharm IDE, Notepad++, and Visual Studio for development.

Have a strong background in Machine Learning algorithms and Predictive Modeling, including Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering, with a deep understanding of data architecture.

Skilled in developing automated ETL pipelines within Snowflake, streamlining data workflows and enhancing overall efficiency in data processing.

Skilled in implementing security protocols on Snowflake, including role-based access controls, encryption, and auditing, ensuring compliance with industry standards and regulations.

Hands-on experience in database migration, upgrading, and version management in MySQL environments.

Experience extends to working with No SQL databases like Cassandra and HBase, enabling real-time read/write access to large datasets via HBase.

Developed Spark Applications capable of handling data from various RDBMS (MySQL, Oracle Database) and Streaming sources.

Well-versed in using GITHub/GIT 2.12 for source and version control, and I have a strong foundation in core Java concepts, including Object-Oriented Design (OOD) and Java components like the Collections Framework, Exception handling, and I/O systems.

Technical Skills:

Languages

Java, Scala, Python, SQL, and C/C++

Big Data Ecosystem

Hadoop, MapReduce, Kafka, Spark, Pig, Hive, YARN, Flume, Sqoop,

Oozie, Zookeeper, Talend

Hadoop Distribution

Cloudera Enterprise, Data Bricks, Horton Works, EMC Pivotal.

Databases

Oracle, SQL Server, PostgreSQL

Web Technologies

HTML, CSS, JSON, JavaScript, Ajax

Streaming Tools

Kafka, RabbitMQ

Cloud

GCP, AWS, Azure, AWS EMR, Glue, RDS, Kinesis, DynamoDB, Redshift Cluster

GCP Cloud

Big Query, Cloud Data Proc, GCS Bucket, G-Cloud Function, Apache Beam, Cloud Shell, GSUTIL, BQ Command Line, Cloud Data Flow

AWS Cloud

Amazon EC2, Amazon S3, Amazon Simple DB, Amazon MQ, Amazon ECS, Amazon Lambdas, Amazon Sage maker, Amazon RDS, Amazon Elastic Load Balancing, Elastic Search, Amazon SQS, AWS Identity and access management, AWS Cloud Watch, Amazon EBS and Amazon Cloud Formation

Azure Cloud

Azure Data Lake, Data Factory, Azure SQL Database, Azure data bricks.

Operating Systems

Linux Red Hat/Ubuntu/CentOS, Windows 10/8.1/7/XP

Testing

Hadoop Testing, Hive Testing, MRUnit

Application Servers

Apache Tomcat, JBOSS, WebSphere

Tools and Technologies

Servlets, JSP, Spring (Boot, MVC, Batch, Security), Web Services,

Hibernate, Maven, GitHub, Bamboo.

IDE’s

IntelliJ, Eclipse, Net Beans.

Professional Experience:

Client: Ryder Logistics and Transportation January 2022- Present

Role: SR AWS Data Engineer

Responsibilities:

Developed disaster recovery plans using AWS Backup for critical data stored in AWS S3 and RDS and Utilized AWS Reserved Instances to optimize costs for AWS EMR and AWS RDS services, achieving significant savings.

Utilized AWS Glue and AWS Elastic Map Reduce (EMR) for Extract, Transform, Load (ETL) processes, optimizing data transformation and preparation.

Reviewed and updated security configurations in AWS Cloud Watch, Cloud Trail and IAM to align with industry standards and compliance requirements.

Designed ETL Process using Informatica to load data from Flat Files, and Excel Files to target Oracle Data Warehouse database.

Implemented robust error handling mechanisms in Informatica workflows, promptly identifying and resolving data issues to maintain the reliability of the data processing pipeline.

Worked on advanced analytics using Python libraries like Boto3, Pandas and NumPy for data manipulation and modification. Created and optimized complex SQL queries to extract and analyze data from relational databases.

Developed spark applications in python on distributed environment to load massive number of CSV files with different schema in to Hive tables.

Created custom T-SQL procedures to read data from flat files to dump to SQL Server database using SQL Server import and export data wizard.

Designed data models and schemas suitable for NoSQL databases to support scalable and flexible data storage.

Automated routine database tasks, including backups, updates, and monitoring, to ensure the reliability and availability of PostgreSQL databases.

Improved the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.

Installed and configured HDFS, Pig, Hive, Hadoop and Map Reduce in a Snowflake environment for real-time data loading from various sources using Kafka.

Worked on Implementation of a log producer in Scala that watches for application logs transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.

Utilized Scala to create data models and algorithms for optimizing data storage, retrieval, and processing, resulting in more efficient and cost-effective data solutions.

Implemented CI/CD pipelines for Scala applications, automating the testing, building, and deployment processes to accelerate development cycles.

Maintained comprehensive documentation for Snowflake configurations and processes, and provided training to team members, ensuring knowledge continuity and efficient use of the platform.

Developed spark applications for performing large scale transformations and denormalization of relational datasets.

Integrated diverse data sources into Tableau, including databases, spreadsheets, and cloud-based platforms, to create comprehensive and unified analyses.

Managed version control for Tableau workbooks and data sources, facilitating collaboration among team members and providing a clear audit trail for changes.

Created different visualization in the reports using custom visuals like Bar Charts, Pie Charts, Line Charts, Cards, Slicers, also using different transformation inside edit query into clean-up the data using Power BI.

Developed algorithms within Power BI to detect and highlight data anomalies, supporting data quality assurance and ensuring the reliability of analytical results.

Environment: AWS (S3, RDS, Glue, EMR, Cloud Watch, Cloud Trail, IAM), ETL, Informatica, Python, Boto3, Pandas, NumPy, SQL, Spark, Hive, T-SQL, Hadoop, Spark-SQL, Data Frame, YARN, HDFS, Pig, Snowflake, Kafka, Tableau, CI/CD, Zookeeper, Scala, Power BI, Cassandra, NoSQL, PostgreSQL.

Client: Charter Communications, Negaunee, MI. July 2018- December 2021

Role: SR GCP Data Engineer

Responsibilities:

Developed Spark applications using Python and executed an Apache Spark data processing project to manage data from diverse RDBMS and streaming sources.

Assumed responsibility for constructing scalable distributed data solutions with Hadoop. Constructed data pipelines in Google Cloud Platform (GCP) using Airflow for ETL tasks, employing various Airflow operators.

Utilized Apache Airflow in a GCP Composer environment to construct data pipelines, making use of various operators such as the Bash operator, Hadoop operators, and Python callable and branching operators.

Hands-on experience building ETL pipelines, Visualizations, Analytics based quality solutions in-house using AWS, Azure Databricks, and other Open-source frameworks.

Created a NiFi dataflow to ingest data from Kafka, perform data transformations, store it in HDFS, and exposed a port for running Spark streaming jobs.

Proficiently managed the Hadoop cluster on GCP using Google Cloud Storage, BigQuery, and Dataproc.

Worked in Upgrade and Migration projects SQL Server /2014 and upgraded DTS to SSIS by rewriting the complete functionality.

Collaborated with Spark to enhance the performance and optimize existing algorithms within Hadoop.

Utilized the GCP Cloud Shell SDK to configure services like Data Proc, Storage, and Big Query. Employed the GCP environment to utilize Cloud Functions for event-based triggering, Cloud Monitoring, and Alerting.

Leveraged GCP Cloud Functions in Python to load data into Big Query for incoming CSV files in a GCS bucket.

Developed Python scripts to do file validations in Databricks and used ADF to automate the process.

Worked with Spark's RDD, Data Frame API, Data Set API, Data Source API, Spark SQL, and Spark Streaming.

Employed Spark Streaming APIs for real-time transformations and actions in order to construct common solutions.

Developed a Python-based Kafka consumer API for data ingestion from Kafka topics. Ingested Extensible Markup Language (XML) messages via Kafka and processed XML files using Spark Streaming to capture User Interface (UI) updates.

Designed a pre-processing job utilizing Spark Data Frames to flatten JSON documents into flat files.

Experience in importing/exporting data between different sources like Oracle/Access/Excel etc. using SSIS/DTS utility.

Loaded D-Stream data into Spark RDD and conducted in-memory data computations to generate output responses.

Devised a GCP Cloud Composer Directed Acyclic Graph (DAG) for loading data from on-premises CSV files into GCP Big Query tables, scheduling it in incremental mode.

Configured Snowpipe to retrieve data from Google Cloud buckets and load it into Snowflake tables.

Possessed a solid understanding of Cassandra architecture, replication strategies, gossip, and snitches.

Employed Hive Query Language (HQL) for analyzing partitioned and bucketed data, executing Hive queries on Parquet tables.

Utilized Hive for data analysis to meet specific business logic requirements.

Utilized Apache Kafka to aggregate web log data from multiple servers and make it accessible to downstream systems for data analysis and engineering tasks.

Implemented Kafka security measures and improved its performance.

Developed Oozie coordinators to schedule Hive scripts for creating data pipelines.

Assisted in cluster setup and testing of HDFS, Hive, Pig, and MapReduce to enable access for new users.

Involved in writing C#.net code for SSIS 2008/2005 packages.

Environment: Spark, Spark-Streaming, Spark SQL, GCP, Dataproc, BigQuery, GCS, GKE, Pubsub, Map R, HDFS, Hive, Pig, Apache Kafka, Sqoop, Python, PySpark, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau.

Client: Thomson Reuters, Eagan, MN. June 2016- June 2018

Role: AWS Data engineer

Responsibilities:

Crafting and deploying AWS solutions involving EC2, S3, EBS, Elastic Load Balancer (ELB), and auto-scaling groups.

Establishing and constructing AWS infrastructure for a variety of resources, including VPC, EC2, S3, IAM, EBS, Security Groups, Auto Scaling, and RDS, using Cloud Formation JSON templates.

Creating AWS Cloud Formation templates tailored to generate VPCs, subnets, and NAT gateways of custom sizes to ensure the successful deployment of web applications and database templates.

Developing stored procedures in MS SQL to retrieve data from different servers using FTP and processing these files to update tables.

Conducting data analysis and profiling of source data to gain a better understanding of the data sources.

Azure Data Storage, Azure Data Factory, Azure Services, Azure SQL server, Azure data warehouse, MySQL, ETL, Kafka, PowerBI, SQL Database, T-SQL, U-SQL, GitHub, Azure Data Lake, Azure Databricks, SSIS.

Downloading BigQuery data into Pandas or Spark data frames for advanced ETL capabilities. Executing data transformation and cleansing using SQL queries, Python, and PySpark. Crafting Hive SQL scripts to generate intricate tables with high-performance metrics such as partitioning, clustering, and skewing.

Extracted data from multiple applications and data sources with the database integration using SSIS and Linked server ODBC connections.

Constructing ETL pipelines using Spark and Hive to ingest data from multiple sources. Taking charge of ETL processes and data validation using SQL Server Integration Services.

Writing Python scripts to automate the identification of trends, outliers, and data anomalies and loading data from Web APIs to staging databases.

Reverse-engineering existing data models to accommodate new changes using Erwin. Generating artifacts for the data engineering team, including source-to-target mappings, data quality rules, data transformation rules, and joins.

Performing data visualization for different modules using Tableau and the ONE Click method.

Developing, deploying, and overseeing event-driven and scheduled AWS Lambda functions triggered by events on various AWS sources, including logging.

Creating dashboards in Tableau with ODBC connections from various sources like BigQuery and the Presto SQL engine.

Extensively worked on error handling and logging in SSIS and T-SQL Scripts on package and task level.

Presenting analyses of automation test results during daily Agile stand-up meetings. Deploying code using blue/green deployments with AWS CodeDeploy to minimize application deployment downtime.

Leading a team to successfully customize and deploy the Oracle SQL Path Finder tool for querying and joining data from different databases to deliver an application.

Collaborating with the Data Science team to implement advanced analytical models in a Hadoop Cluster over large datasets.

Environment: ETL, SSIS, Erwin, SSMS, SSDT, Excel, Python, DAX, Power BI, Agile, TFS, AWS, EMR, S3, RDS, Redshift, Lambda, Boto3, Dynamo DB, Amazon Sage Maker, Apache Spark, HBase,

Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake.

Client: Careator Technologies Pvt Ltd, Hyderabad, India. July 2013-February 2016

Role: Data Engineer

Responsibilities:

Examined the legacy database's stored procedures to understand their purpose and converted them into T-SQL scripts.

Generated ETL Log Reports and ETL Audit Reports using SQL Server Reporting Services (SSRS) to provide project managers with insight into ETL package run statuses, including ETL names, data flow types, start and finish times, status, and error messages.

Identified various data sources and fields, creating a comprehensive data mapping document for the ETL process.

Imported and exported data from SQL Server databases, Excel spreadsheets, and flat files by developing SSIS packages and configuring connection managers.

Designed SSIS packages to handle incremental data loads and manage slowly changing dimensions through the use of lookup and derived column transformations.

Constructed intricate SQL queries involving inner joins, left joins, and case-when statements on tables and views to retrieve data for reporting purposes.

Developed and validated machine learning models, including Ridge and Lasso regression, to predict potential loan default amounts.

Crafted worksheets and data visualization dashboards enriched with parameters and filters for enhanced user interaction.

Created interactive dashboards with features like highlighting, actions, and custom filters to facilitate data exploration by users.

Designed dual-axis line charts showcasing sales and profit trends over specific periods. Published Tableau reports and dashboards on Tableau Server, empowering business users to explore and analyze data.

Environment: MS SQL Server 2014, SQL Server Management Studio, Visual Studio, SQL Server Integration Service, Tableau (Desktop/Server), MS Excel.

Contact this candidate