Big Data Engineer

Location:

St. Louis, MO

Posted:

February 18, 2025

Contact this candidate

Resume:

Meghana

Sr. Data Engineer

**********@*****.*** 314-***-****

PROFESSIONAL SUMMARY:

Having 10+ years of experience in Information Technology which includes Hands on experience in Hadoop ecosystem including Spark, Kafka, HBase, Map Reduce, Python, Scala, Pig, Impala, Sqoop, Oozie, Flume, Storm, big data technologies and worked on Spark SQL, Spark Streaming and using Core Spark API to explore Spark features to build data pipelines.

Experience in building big data solutions using Lambda Architecture using Cloudera distribution of Hadoop, Twitter Storm, Trident, Map Reduce, Cascading, HIVE, PIG and Sqoop.

Strong Knowledge and experience on implementing Big Data in Amazon Elastic Map Reduce (Amazon EMR) for processing, managing Hadoop framework dynamically scalable Amazon EC2 instances, Lambda, SNS, SQS, AWS Glue, S3, RDS and Redshift.

Significant Expertise in integrating data using AWS Cloud, employing various platforms such as Data bricks, Apache Spark, Airflow, EMR, Glue, Kafka, Kinesis, and Lambda within ecosystems such as S3, Redshift, RDS, and MongoDB/Dynamo DB.

Having experience in developing a data pipeline using Kafka to store data into HDFS and Experience in building Real-time Data Pipelines with Kafka Connect and Spark Streaming and developing end to end data processing pipelines that begin with receiving data using distributed messaging systems Kafka for persisting data into Hive.

Experienced in development methodologies like Agile/Scrum.

Extensive knowledge and proficiency in creating, designing, documenting, and testing ETL jobs and mappings in both Server and Parallel jobs using Data Stage to fill data warehouses and data marts with tables.

Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing Data Mining, Data Acquisition, Data Preparation, Data Manipulation, Feature Engineering SIOP, Validation and Visualization and reporting solutions that scales across massive volume of Structured and Unstructured Data.

Expertise in using ETL methods for data extraction, transformation, and integration in enterprise wide ETL Solutions and Data Warehouse tools for reporting and data analysis.

Practical experience working with various ETL tool environments such as SSIS and Informatica, along with reporting tool environments like SQL Server Reporting Services and Business Objects.

Having good experience on all flavors of Hadoop (Cloudera, Hortonworks, and MapR) and hands on experience in AVRO and Parquet file format, Dynamic Partitions, Bucketing for best Practice and Performance improvement.

Proficient in building Server tasks utilizing several stages such as Sequential file, ODBC, Hashed file, Aggregator, Transformer, Sort, Link Partitioner.

Proficiency in Big Data Practices and Technologies like HDFS, Map Reduce, Hive, Pig, HBase, Sqoop, Oozie, Flume, Spark, Kafka.

Developed Data pipelines to synchronize and aggregate data from various data sources. Build the query data platform and provide guidelines for partner team to use aggregated data. Publish standard data sets & schema.

Experience with source control systems such as Git, Bit bucket, and Jenkins and in CI/CD deployments.

Good knowledge in RDBMS concepts (Oracle 112c/1g, MS SQL Server 2012) and strong SQL, query writing skills (by using Erwin & SQL Developer tools), Stored Procedures and Triggers and Experience in writing Complex SQL Queries involving multiple tables inner and outer joins.

Excellent knowledge with Unit Testing, Regression Testing, Integration Testing, User Acceptance Testing, Production implementation and Maintenance.

Extensive experience developing, publishing, and maintaining customized interactive reports and dashboards with customizable parameters, including the creation of tables, graphs, and listings using different methods and tools such as Tableau and user-filters utilizing Tableau.

Practical expertise with Azure cloud components (SQL DB, SQL DWH, Cosmos DB, HDInsight, Data bricks, Data Lake, Blob Storage, Data Factory, and Storage Explorer).

Extensive experience with Azure Data Factory, Azure Data bricks, and importing data to Azure Data Lake, Azure SQL Database, and Azure SQL Data Warehouse, as well as controlling database access.

Self-starter with a desire to solve unstructured data challenges and the capacity to analyze and optimize big data sets.

Prior experience constructing Map Reduce programs in Apache Hadoop for analyzing large amounts of data. Practical knowledge of data modeling (dimensional and relational) principles such as Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.

Ability to review technical deliverables, mentor and drive technical teams to deliver quality products.

Demonstrated ability to communicate and gather requirements, partner with Enterprise Architects, Business Users, Analysts, and development teams to deliver rapid iterations of complex solutions.

Versatile and adaptable with a strong drive to stay up with the most recent technological advancements.

TECHNICAL SKILLS:

Big Data Technologies

Hadoop, Map Reduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Zookeeper, Apache Flume, Apache Airflow, Cloudera, HBase.

Programming Languages

Python, PL/SQL, SQL, Scala, C, C#, C++, T-SQL, Power Shell Scripting, JavaScript

Cloud Services

Azure Data Lake Storage Gen 2, Azure Data Factory, Blob storage, Azure SQL DB, Data bricks, Azure Event Hubs, AWS RDS, Amazon SQS, Amazon S3, AWS EMR, Lambda, AWS SNS.

Databases

MySQL, SQL Server, Oracle, MS Access, Teradata, and Snowflake.

NoSQL Data Bases

MongoDB, Cassandra DB, HBase.

Development Strategies

Agile, Lean Agile, Pair Programming, Waterfall and Test Driven Development.

Visualization & ETL tools

Tableau, Informatica, Talend, SSIS, and SSRS

Version Control & Containerization tools

Jenkins, Git, and SVN

Operating Systems

Unix, Linux, Windows, Mac OS

Monitoring tool

Apache Airflow, Jenkins

PROFESSIONAL EXPERIENCE:

Client: DTCC, New York, NY Aug 2022 to Present

Role: Senior Data Engineer

Responsibilities:

Made heavy use of Azure Data Factory (ADF) to ingest data from various source systems and as an orchestration tool to integrate data from upstream to downstream systems, automated tasks in ADF utilizing several triggers (Event, Scheduled, and Tumbling).

Order processing pipelines used Cosmos DB to source events and store catalog data.

Created user-defined functions, stored procedures, and triggers for Cosmos DB.

Determining the appropriate design architecture for the Azure environment by analyzing the data flow from various sources to the goal.

Skilled in using Azure Data Factory to carry out incremental loading from Azure SQL DB to Azure Synapse.

Extensively used SQL Server Import and Export Data tool.

Analyzed existing database, tables, and other objects to prepare to migrate to Azure Synapse.

Implemented Side - by- Side Migration of MS SQL SERVER 2016.

Involved in daily production server check list, SQL Backups, Disk Space, Job Failures, System Checks, checking performance statistics for all servers using monitoring tool and research and resolve any issues, checking connectivity.

Designed and developed SSIS (ETL) packages to validate, extract, transform and load data from OLTP system to the Data warehouse.

Led and contributed to the project management efforts for the implementation of data management policies and governance standards using Agile and Kanban methodologies.

Designed and implemented Tables, Functions, Stored Procedures and Triggers in SQL Server 2016 and wrote the SQL.

Take initiative and ownership to provide business solutions on time.

Created High level technical design documents and Application design documents as per the requirements and delivered clear, well-communicated and complete design documents.

Created DA specs and Mapping Data flow and provided the details to developer along with HLDs.

Created Build definition and Release definition for Continuous Integration (CI) and Continuous Deployment (CD).

Created Application Interface Document for the downstream to create new interface to transfer and receive the files through Azure Data Share.

Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Data bricks.

Conducted quarterly Data owner meetings to communicate upcoming Data Governance initiatives, processes, policies, and best practices.

Used Spark Streaming to do RDD transformations on the mini batches of ingested data in Data bricks to perform streaming analytics.

Built and deployed the necessary libraries for the various Data bricks clusters needed for batch and continuous streaming data processing.

Demonstrated the capability to stakeholders while integrating Azure Active Directory authentication for each Cosmos DB request submitted.

Utilized Azure stream analytics, Azure Event Hub, and Service Bus Queue to design and construct a new solution to process the NRT data, created a linked service to land the data into Azure Data Lake from an SFTP location.

Utilizing a variety of Azure Activities, such as Move &Transform, Copy, Filter, Each, and Data bricks, created several pipelines in Azure using Azure Data Factory v2.

Several Data bricks Spark tasks were created using PySpark to carry out several tables-to-table transactions.

Using complex SQL, Stored Procedures, Triggers, and packages in big databases across several servers.

Helping team member to resolve any technical issue, Troubleshooting, Project Risk & Issue identification, and management Addressing resource issue, Monthly one on one, Weekly meeting.

Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure Data Lake, BLOB Storage, Visual Studio 2012/2016, Microsoft SQL Server 2012/2016, SSIS 2012/2016, Teradata Utilities, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Python, Erwin Data Modelling Tool, Azure Cosmos DB, Azure Stream Analytics, Azure Event Hub.

Client: Centene Corporation, St Louis, Missouri Apr 2019 to Jul 2022

Role: Sr. Data Engineer

Responsibilities:

Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.

Extensive experience in working with AWS cloud Platform (EC2, S3, EMR, Redshift, Lambda and Glue).

Worked on building the data pipelines (ELT/ETL Scripts), extracting the data from different sources (MySQL, AWS S3 files), transforming, and loading the data to the Data Warehouse (AWS Redshift)

Working knowledge of Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL and Spark Streaming.

Created SQL scripts for daily extracts, ad-hoc requests, reporting and analyzing large data sets from S3 using AWS Athena, Hive and Spark SQL.

Creation of ETL, built data pipelines using spark SQL, PySpark, AWS Athena and AWS Glue.

Writing to Glue metadata catalog which in turn enables us to query the refined data from Athena achieving a Serverless querying environment.

Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.

Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop.

Using Spark Context, Spark-SQL, Spark MLLib, Data Frame, Pair RDD and Spark YARN.

Used Spark Streaming APIs to perform transformations and actions on the fly for building common.

Learner data model which gets the data from Kafka in real time and persist it to Cassandra.

Performed API calls using the python scripting. Performed reads and writes to S3 using Botto3 libraries.

Developed Kafka consumer API in python for consuming data from Kafka topics.

Consumed Extensible Markup Language (XML) messages using Kafka and processed the XML file using Spark Streaming to capture User Interface (UI) updates.

Performed Raw data ingestion into S3 from kinesis firehouse which would trigger a lambda function and pit refined data into another S3 bucket and write to SQS queue as aurora topics.

Developed Preprocessing job using Spark Data frames to flatten JSON documents to flat file.

Load D-Stream data into Spark RDD and do in memory data Computation to generate output response.

Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a Data pipeline system.

Developed custom AWS Step Functions state machines using AWS Lambda functions, allowing for greater flexibility and customization in workflow design.

Hands on expertise with AWS Databases such as RDS(Aurora), Redshift, DynamoDB and Elastic Cache (Memcached & Redis).

Environment: Python, Flask, NumPy, Pandas, SQL, MySQL, Cassandra, API, AWS EMR, Spark, AWS Kinesis, AWS Redshift, AWS EC2, AWS S3, AWS Beanstalk, AWS Lambda, AWS data pipeline, AWS cloud-watch, Docker, Shell scripts, Agile Methodologies.

Client: Target, Minneapolis, MN Nov 2016 to Mar 2019

Role: Big Data/Hadoop Developer

Responsibilities:

Involved in analyzing business requirements and prepared detailed specifications that follow project guidelines required for project development.

Involved in Data Ingestion using Sqoop/Flume

Worked with Flume in bringing click stream data from front facing application logs.

Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds.

Involved in writing Spark applications using Scala/Java.

Experience in using Kafka as a messaging system to implement real-time Streaming solutions using Spark Streaming

Developed SQOOP scripts to migrate data from Oracle to Big Data Environment.

Involved in Analysis, Design, and Implementation/translation of Business User requirements.

Worked on collection of large sets using Python scripting, PySpark, Spark SQL

Worked on large sets of Structured and Unstructured data.

Actively involved in designing and developing data ingestion, aggregation, and integration in the Hadoop environment.

Developed Sqoop scripts to import export data from relational sources and handled incremental loading on the customer, transaction data by date.

Extensively worked with Avro and Parquet files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in Spark.

Performed data analysis and data profiling using complex SQL queries on various source systems including Oracle 10g/11g and SQL Server 2012.

Identified inconsistencies in data collected from different sources.

Participated in requirement gathering and worked closely with the architect in designing and modeling.

Designed object model, data model, tables, constraints, necessary stored procedures, functions, triggers, and packages for Oracle Database.

Automated Regular AWS tasks like snapshots creation using Python scripts.

Install and configure Apache Airflow for AWS S3 bucket and created DAGs to run in the Airflow.

Prepared scripts to automate the ingestion process using PySpark and Scala as needed through various sources such as API, AWS S3, Teradata and Redshift.

Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.

Using AWS Redshift, me Extracted, transformed and loaded data from various heterogeneous data sources and destinations.

Environment: Hortonworks, Hadoop, Big Data, HDFS, Map Reduce, Sqoop, AWS, Oozie, Nifi, Python, SQL server, Oracle, HBase, Hive, Impala, Pig, Sqoop, Tableau, NoSQL, Unix/Linux, Spark, PySpark, Notebooks.

Client: Magnaquest Technologies Limited Hyderabad, India Dec 2013 to Sept 2016

Role: Data Analyst

Responsibilities:

Worked with leadership teams to implement tracking and reporting of operations metrics across global programs.

Worked with large data sets, automate data extraction, built monitoring/reporting dashboards and high-value, automated Business Intelligence solutions (data warehousing and visualization)

Gathered Business Requirements, interacted with Users and SMEs to get a better understanding of the data and performed Data entry, data auditing, creating data reports & monitoring all data for accuracy.

Designed, developed, and modified various Reports.

Performed data discovery and build a stream that automatically retrieves data from multitude of sources (SQL databases, external data such as social network data, user reviews) to generate KPI's using Tableau.

Wrote ETL scripts in Python/SQL for extraction and validating the data.

Create data models in Python to store data from various sources.

Interpreting raw data using a variety of tools (Python, R, Excel Data Analysis Toolpak), algorithms, and statistical/econometric models (including regression techniques, decision trees, etc.) to capture the bigger picture of the business.

Created and presented Dashboards to provide analytical insights into data to the client.

Translated requirement changes, analyzing, providing data driven insights into their impact on existing database structure as well as existing user data.

Worked primarily on SQL Server, creating Store Procedures, Functions, Triggers, Indexes and Views using T-SQL.

Worked on Amazon Web service (AWS) to integrate EMR, Spark 2 and S3 storage and Snowflake.

Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.

Handled AWS Management Tools as Cloud watch and Cloud Trail.

Stored the log files in AWS S3. Used versioning in S3 buckets where highly sensitive information is stored.

Environment: SQL Server, ETL, SSIS, SSRS, Tableau, Excel, R, AWS, Python, Django

Contact this candidate