Vamshi Krishna
Mobile number: +1-469-***-****
Email- *************@*****.***
LinkedIn- https://www.linkedin.com/in/vamshikrishnaaaa/
SUMMARY
9.5 years of experience as Big Data Engineer with expertise in designing data intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data Engineering, Data Warehouse/Data Mart, Data Visualization, Reporting and Data Quality solutions.
Experienced with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, MLlib, Spark RDD's and worked explicitly on PySpark and Scala.
Extensive experience in Core Hadoop eco-system Development for enterprise level solutions utilizing Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, Impala, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.
Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows
Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.
Experienced with using most common Operators in Airflow using python.
Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from source using SQL Server SSIS.
Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL server, and PostgreSQL. Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase.
Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.
Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.
Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.
Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and also worked on various applications using python integrated IDEs like Sublime Text and PyCharm
EDUCATION AND CERTIFICATION:
Bachelors from Computers Science from JNTU Hyderabad 2012
Masters in Computer Science from George Mason University VA 2015
AWS CERTIFIED Associate Solution Architect
TECHNICAL SKILLS:
Programming Python, R, Scala, Java, COBOL, C
Scripting Languages UNIX Shell script, Python, R
IDE IntelliJ, Eclipse, Visual Studio, PyCharm, Jupyter notebook
ETL/Data Warehouse Tools SSIS, DataStage, Informatica Power Centre, Snap Logic, Microsoft Fabric, GA4
BI Tools Power BI, Tableau, Python Visualization Libraries, Fabric, GA4
Databases MySQL, SQL Server, Oracle 10g/11g, PostgreSQL 9.3
Big Data Ecosystem Spark (Scala/Python), Hive, Hadoop, PySpark, Trino
Data-Modeling Methodologies Object Relational Modeling, ER modeling, Dimensional Modeling
Cloud technologies Azure, Databricks, AWS, GCP
Development Methodologies Agile, Waterfall Model
PROFESSIONAL EXPERIENCE
OPTUM (Remote) Sept 2023 – Current
Role: Data Engineer
Responsibilities:
•Designed and setup Enterprise Data Lake to provide support for various use cases including Analytics, processing, storing and Reporting of voluminous, rapidly changing data.
•Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect.
•Designed and developed Security Framework to provide fine-grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
•Set up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
•Performed end-to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3
•built the data pipelines using Core Hadoop Eco system tools such as HDFS, Hive, Impala, Sqoop, Oozie for batch processing and for real-time and near real-time data, building the data pipelines using Spark, Spark SQL, Spark streaming, and Kafka.
•Used python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 data lake.
•Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
•Used Spark SQL for Scala & Python interface that automatically converts RDD case classes to schema RDD.
•Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.
•Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.
•Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages).
•Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.
•Coded Teradata BTEQ scripts to load, transform data, fix defects like SCD 2 date chaining, cleaning up duplicates.
•Integrated AWS QuickSight with various data sources, including AWS Redshift, S3, and RDS, to deliver comprehensive analytics solutions that support cross-functional teams.
•Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.
•Devised a Python package to automate raw flat file collecting, summary report generation, and SQL database upload.
•Responsible for creating CloudFormation templates for services such as SNS, SQS, Elasticsearch, DynamoDB, Lambda, EC2, VPC, RDS, S3, IAM, and CloudWatch. Implemented and integrated these services with Service Catalog.
•Utilized Windows batch scripts, Unix scripts, SQL, and PL-SQL for automation, unit testing, and customized business needs.
•Used Hadoop technologies like spark and hive Including using the Pyspark library to create spark data frames and
•Designed SSIS solutions for both real-time and batch data processing scenarios within datalakes, supporting near real-time analytics and operational reporting needs while managing large volumes of data effectively.
Environment: AWS EMR, S3, RDS, Redshift, Lambda, DynamoDB, Amazon Sage Maker, Apache Spark, HBase, Apache Kafka, Hive, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, SSRS, Tableau
HUMANA (Remote) Sept 2021- Aug 2023
Data Engineer
Responsibilities:
•Played a pivotal role in analysing over 8,000 COVID claims data, providing essential insights for provider reimbursements and supporting critical decision-making processes.
•Complex SQL queries, views, functions, and reports that meet customer needs on Snowflake are developed and maintained.
•Architected and built a scalable distributed Data Solutions System using Hadoop Ecosystem
•Utilized Scala to develop and optimize ETL processes, ensuring efficient data integration and processing within the healthcare domain. This contributed to improving data accuracy and streamlining analytics, supporting the organization's data-driven decision-making.
•Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
•Managed, Configured, and scheduled resources across the cluster using Azure Kubernetes Service.
•Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also worked with Cosmos DB (SQL API and Mongo API).
•Developed dashboards and visualizations to help business users analyze data as well as providing data insights to upper management with a focus on Microsoft products like SQL Server Reporting Services (SSRS) and Power BI.
•Performed the migration of large datasets to Databricks (Spark), created and administered clusters, loaded data, configured data pipelines, loading data from ADLS Gen2 to Databricks using ADF pipelines.
•Created various pipelines to load data from Azure Data Lake into Staging SQL DB and followed by to Azure SQL DB.
•Created Databricks notebooks to streamline and curate the data for various business use cases and also mounted Blob storage on Databricks.
•Experienced in Azure Kubernetes service to produce production-grade Kubernetes that allow enterprises to reliably deploy and run containerized workloads across.
•Utilized Azure Logic Apps to build workflows to schedule and automate batch jobs by integrating apps, ADF pipelines, and other services like HTTP requests, email triggers, etc.’
•Ingested data in mini-batches and performed RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
Environment: Azure SQL DW, Databricks, Azure Synapse, Cosmos DB, ADF, SSRS, Power BI, Azure Data Lake, ARM, Azure HDInsight, Blob Storage, Apache Spark.
Client: CVS Health AZ (Remote) March 2020 – Aug 2021
Role: Senior Data Engineer
Summary: As Data Engineer at CVS my role was designing, developing, and maintaining robust data pipelines and infrastructure to support the organization's data-driven initiatives. Responsibilities include developing data models, designing and managing data warehouses, implementing ETL processes, ensuring data quality, and enforcing data governance policies
Responsibilities:
Developed robust data pipelines on AWS utilizing tools such as Dataflow and Apache Beam to ensure efficient data ingestion, processing, and storage.
Implemented scalable and fault-tolerant solutions for real-time and batch processing using AWS services
Designed and optimized data models on BigQuery to support complex analytical queries and enable data-driven decision-making across the organization.
Automated ETL processes using AWS services and scripting languages like Python, reducing manual intervention and increasing efficiency.
Automated ETL workflows and scheduled data loads in Matillion to ensure timely and accurate data updates.
Integrated data from multiple sources, including APIs, databases, and flat files, into Snowflake using Matillion.
Collaborated with cross-functional teams to understand data requirements and deliver end-to-end data solutions aligned with business objectives.
Conducted performance tuning and optimization of data pipelines to improve processing speed and reduce costs, resulting in enhanced overall system performance.
Implemented security best practices and data governance policies to ensure compliance with regulatory standards and protect sensitive data on AWS.
Provided training and support to end-users on accessing and utilizing data in AWS and Snowflake.
Utilized AWS monitoring and logging capabilities to proactively identify and troubleshoot issues within data pipelines, ensuring high availability and reliability.
Mentored junior team members on AWS best practices, data engineering techniques, and coding standards to foster skill development and knowledge sharing within the team.
Actively participated in continuous learning and kept abreast of the latest developments in AWS and data engineering technologies to drive innovation and maintain a competitive edge.
Developing Spark scripts, UDFS using both Spark DSL and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.
Implemented Apache Nifi for efficient data orchestration, automating the flow of data between systems and ensuring a streamlined data processing pipeline.
Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
State of DC, Dept of Veterans Oct 2018 – Dec 2019
Data Engineer
Responsibilities:
Developed and executed validation procedures for data query and update forms, ensuring data accuracy and integrity, which resulted in a 25% reduction in data entry errors. Developed and executed validation procedures for data query and update forms, ensuring data accuracy and integrity, which resulted in a 25% reduction in data entry errors.
Responsible for building scalable distributed data solutions using Hadoop
Collaborated with business stakeholders, Business Analysts, and product owners to gather requirements and design scalable, distributed data solutions within the Hadoop ecosystem.
Created Spark Streaming applications to handle near real-time data from Kafka, performing both stateless and stateful transformations
Designed and managed HIVE data warehouse infrastructure, including table creation, data partitioning, bucketing, and optimizing HQL queries.
Developed automated processes to divide large data files into smaller batches, enhancing FTP transfer efficiency and reducing execution time by 60%.
Engineered ETL workflows using Data Stage Open Studio, loading data from diverse sources into HDFS with the help of FLUME and SQOOP, and applied structural transformations using MapReduce and HIVE.
Scripted Spark programs and developed UDFs with Spark DSL and Spark SQL for data aggregation and querying, and utilized Sqoop for transferring data back to RDBMS.
Implemented numerous MapReduce tasks using Java API, Pig, and Hive to extract, transform, and aggregate data across various formats such as Parquet, Avro, XML, JSON, CSV, ORCFILE, and other compressed formats.
Demonstrated a deep understanding of Hive concepts such as Partitioning and Bucketing, and designed both Managed and External tables to enhance performance.
Created custom PIG UDFs to manipulate data according to business needs and developed specialized PIG Loaders.
Built ETL pipelines for data warehousing, integrating Python and Snowflake's SnowSQL, and executed SQL queries against Snowflake
Gained experience in generating reports using SQL Server Reporting Services (SSRS), including various report types such as drill-down, parameterized, cascading, conditional, table, matrix, chart, and sub-reports.
Leveraged the DataStax Spark Connector to store data in the Cassandra database or retrieve data from it.
Developed and configured Oozie scripts to manage and schedule Hadoop jobs using the Apache Oozie workflow engine.
Used AWS Glue dynamic frames with PySpark to transform data, organized the transformed data with Crawlers, and scheduled jobs using workflow features.
Involved in cluster management activities, including the installation, commissioning, and decommissioning of data nodes, name node recovery, capacity planning, and configuring slots.
Created data pipeline programs using Spark Scala APIs, performed data aggregations with Hive, and formatted data in JSON for visualization and reporting purposes.
Environment: AWS, Cassandra, PySpark, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, FLUME, Apache Oozie, Zookeeper, ETL, UDF, MapReduce, Snowflake, Apache Pig, Python, Java, SSRS
Credit Karma (Intuit) Charlotte NC May 2016 – Sept 2018
Data Engineer
Responsible for designing, developing, and maintaining data pipelines and infrastructure on the AWS platform. As Data Engineer worked with large volumes of data, ensuring its quality, reliability, and accessibility.
Build and maintain scalable and reliable data pipelines, ensuring the smooth flow of data from various sources to the desired destinations in the AWS cloud environment.
Led a data conversion project at Credit Karma, utilizing Python libraries such as NumPy and SciPy to ensure accurate and efficient data transformation.
Designed and implemented ETL processes to migrate data from legacy systems to Teradata, enhancing data accessibility and integrity.
Developed and optimized PL/SQL scripts to extract and manipulate large datasets from Teradata, improving query performance and data retrieval efficiency.
Developed tools using python and Shell Scripting to automate some of menial tasks.
Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon Sage Maker.
Experienced in developing Web Services with Python programming language.
Used Spark and Scala for developing machine learning algorithms that analyze clickstream data.
Implemented a CI/CD pipeline using Jenkins, Airflow for Containers from Docker, and Kubernetes.
Worked on complex SQL Queries, PL/SQL procedures and convert them to ETL tasks
Support current and new services that leverage AWS cloud computing architecture including EC2, S3, and other managed service offerings.
Created a task scheduling application to run in an EC2 environment on multiple servers.
Designed built and deployed a set of python modelling APIs for customer analytics, which integrate multiple machine learning techniques for various user behavior prediction and support multiple marketing segmentation programs.
Worked on the design of Star & Snowflake schema data model.
Performed information purging and applied changes utilizing Databricks and Spark information analysis.
Extensively utilized Databricks notebooks for interactive analysis utilizing Spark APIs.
Used Metadata tool for importing metadata from repository, new job categories and creating new data elements.
Used Spark for data analysis and store final computation result to HBase tables.
Developed a fully automated continuous integration system using Git, Jenkins, MySQL, and custom tools developed in Python and Bash.
Managed large datasets using Pyspark, Pandas and Dask Data frames.
Written python custom scripts to transform data in to ETL logic and perform the Data driven analysis, Data quality checks and Data profiling.
Developed a fully automated continuous integration system using Git, Jenkins, MySQL, and custom tools developed in Python and Bash.
Worked effectively on SQL Profiler, Index Tuning Wizard, Estimated Query Plan to optimize the performance tuning of SQL Queries and Stored Procedures.
Kroger, Cincinnati OH March 2015 – May 2016
Data Analyst
Participate in design and analysis sessions with business analysts, source-system technical teams, and end users.
Deep understanding of writing test cases to ensure data quality, reliability, and high level of confidence to customers in their bookings in terms of logistics, airlines etc.
Continuously improve quality, efficiency, and scalability of data pipelines.
Design and build scalable and reliable data infrastructure and pipelines (ingestion, integration, ETL, real-time connectors) to support airlines customer data for measurement and reporting
Responsible for developing and maintaining ETL jobs, including ETL implementation and enhancements, testing and quality assurance, troubleshooting issues, and ETL/Query performance tuning.
Designing, developing, maintaining, and supporting Data Warehouse or OLTP processes via Extract, Transform, and Load (ETL) software using Informatica.
Worked with PySpark for using Spark libraries by using Python scripting for data analysis.
Developed an ETL pipeline to extract archived logs from disparate sources and stored in AWS S3 data lake for further processing using PySpark.
Created several Databricks Spark jobs with Pyspark to perform several tables to table operations.
Developed custom ETL solutions, batch processing and real-time data ingestion pipeline to move data in and out of Hadoop using PySpark and shell scripting.
Manage and expand the current ETL framework for enhanced functionality and expanded sourcing.
Translate business requirements into ETL and report specifications. Performed error handling using session logs.
Involved in creating database objects like tables, views, procedures, triggers, and functions using T-SQL to provide definition, structure and to maintain data efficiently.
Done Code reviews of ETL and SQL processes.
Created complex mappings in Power Centre Designer using Aggregate, Expression, Filter, and Sequence
Performed Relational Data Modelling, ER Diagrams (forward & reverse engineering), dimensional modelling, OLAP multidimensional Cube design & analysis, defining slowly changing dimensions, and surrogate key management.
Worked with the testing team to resolve bugs related to day one ETL mappings before production.
Maintained ETL release document for every release and migrated the code into higher environments through deployment groups.
Created weekly project status reports, tracking the progress of tasks according to schedule and reporting any risks and contingency plans to management and business users.