Data Engineer Machine Learning

Location:

Dallas, TX

Posted:

February 13, 2025

Contact this candidate

Resume:

BUYYANI VIKRAM

+1-469-***-**** *****************@*****.***

PROFILE SUMMARY

●A versatile professional with over 10+ years of experience in Information Technology and 8+ years of experience as Cloud Python Developer and Data Engineer in areas including Data Analysis, Statistical Analysis, Machine Learning, Deep Learning, Data mining with large data sets of structured and unstructured data source and Big Data.

●MLOps and ML Engineer with a proven track record in deploying and managing end-to-end machine learning pipelines, ensuring seamless integration from development to production environments.

●Experience in the AWS environment for the development and deployment of Custom in Hadoop Application.

●Hands-on experience with major components in Hadoop Ecosystem like MapReduce, HDFS, YARN, Hive, Pig, HBase, Sqoop, Oozie, Cassandra, Impala and Flume.

●Skilled in collaborating with cross-functional teams, including data scientists, software engineers, and operations teams, to streamline communication and promote a collaborative, agile approach to MLOps.

●Proven ability to assess and implement security best practices in MLOps, ensuring data privacy, integrity, and compliance with industry regulations and standards.

●Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools.

●Experience in building end to end data science solutions using R, Python, SQL and Tableau by leveraging machine learning based algorithms, Statistical Modeling, Data Mining, Natural Language Processing (NLP) and Data Visualization.

●Strong experience of software development in Python (libraries used: Beautiful Soup, NumPy, SciPy, matplotlib, python-twitter, Pandas data frame, networkx, urllib2, MySQL dB for database connectivity) and IDEs - sublime text, Spyder, PyCharm.

●Design, develop, and implement ETL processes using Ab Initio to extract, transform, and load data from various sources (databases, files, APIs).

●Create and maintain Ab Initio graphs for ETL processes, data integration, and transformation workflows.

●Experienced in Agile Methodologies, Scrum stories and sprints experience in a Python based environment, along with data analytics, data wrangling.

●Use Ab Initio’s Data Profiler and other tools to identify data inconsistencies, gaps, and anomalies.

●Experience with Design, code, debug operations, reporting, data analysis and Web Applications using Python.

●Proficient in Object Oriented Programming concepts like Multi-Threading, Exception Handling and Collections using Python.

●Designed and implemented ETL pipelines using Databricks on Azure to process and analyze large datasets, reducing data processing time by 50%.

●Developed and optimized Spark jobs in Databricks to handle complex transformations and aggregations, ensuring efficient use of resources and quick turnaround times.

●Collaborated with data scientists and analysts to integrate Databricks with other data platforms like Snowflake and Redshift, facilitating seamless data flow and analytics.

●Proficient in working with AWS Redshift, Google BigQuery, and Snowflake to store and analyze large volumes of structured data at scale.

●Managed Databricks clusters, including performance tuning, job scheduling, and cost optimization, resulting in a 30% reduction in operational costs.

●Conducted workshops and training sessions on Databricks best practices, enhancing team proficiency and project success.

●Excellent understanding of machine learning techniques and algorithms, such as K-NN, Naive Bayes, SVM, Decision Forests, Random Forest etc.

●Hands-on experience working with various Relational Database Management Systems (RDBMS) like MySQL, Microsoft SQL Server, Oracle & non-relational databases (NoSQL) like MongoDB and Apache Cassandra.

●Having good knowledge in writing SQL Queries, Stored procedures, functions, tables, views, triggers on various databases like Oracle, MySQL.

●Extensively worked on Spark using Scala o cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake.

●Knowledge on GCP tools like BigQuery, Pub/Sub, Cloud SQL and Cloud functions.

●Proficient in data visualization tools such as Tableau, Python Matplotlib/Seaborn, R ggplot2/Shiny to generate charts like Box Plot, Scatter Chart, Pie Chart and Histogram etc., and to create visually impactful and actionable interactive reports and dashboards.

●Optimize Databricks workflows and Spark jobs for performance and scalability.

●Manage and optimize cloud-based resources (e.g., AWS, Azure, GCP) for Databricks implementations.

●Experience in using Teradata ETL tools and utilizes such as BTEQ, MLOAD, FASTLOAD, TPT, and Fast Export.

●Experience in Schema Design and designing efficient, portable, fast & Dimensional Data Models using Avro and ORC for Big Data Applications.

●Experience in designing, coding and testing data extraction pipelines from various data sources using Sqoop, Kafka, Spark Streaming, FTP, RESTful Webservice API’s.

●Experience in designing, coding and testing ETL processing using Java, Scala, Spark, Hive and SQL.

●Experience in designing and implementing scheduling workflows using Oozie.

●Experience in Data Analysis and Data Visualization using tools like, Tableau and Qlik View.

●Extensively worked with CI-CD tools like Git, Sbt, Maven and Jenkins.

●Experience in Creating Automated Deployment using Unix Shell Scripting and Jenkins.

TECHNICAL SKILLS

Hadoop Technologies

HDFS, YARN, Spark, MapReduce, Hbase, Phoenix, SOLR, Hive, Impala, Pig, Nifi, Sqoop, HUE UI, Cloudera, Kerberos

Programming Languages

Java, Python, C, PL/SQL, XML

Web Technologies

JSP, Servlets, Struts, Spring, JavaScript, HTML/HTML5, CSS/CSS3, jQuery, Bootstrap, Ajax

Analysis/Design Tools

Ab Initio ETL, Data Modelling, Design Patterns, UML, Axure, Photoshop

Cloud Tools

Azure Blob Storage, Azure Databricks, Azure Functions, Azure Key Vault, Azure SQL database, Google Cloud Storage, Big Query, Data Proc, Spanner, AWS – Glue

Testing/Logging Tools

JUnit, Mockito, Log4J

Build/Deploy Tools

ANT, Maven, Gradle, TeamCity, Jenkins, uDeploy, Docker.

Database Technologies

Oracle, DB2, MySQL, MongoDB, Informix, MS SQL, Cassandra

Web Services

REST, SOAPUI

Version Control

Git, SVN, CVS

Platforms

Windows, Mac OS X, Linux

Scheduler Tools

Oozie, Autosys, Apache Airflow

Responsibilities:

Designed and built statistical models and feature extraction systems and used models to solve business problems related to the company’s data pipeline and communicated these solutions to executive stakeholders.

Researched and implemented various Machine Learning Algorithms using the R language.

Devised a machine learning algorithm using Python for facial recognition.

Used R for a prototype on a sample data exploration to identify the best algorithmic approach and then wrote Scala scripts using Spark Machine Learning module.

Developed complex queries and ETL processes using Databricks.

Experience in leveraging Databricks for data ingestion, transformation, and ETL processes using Spark-based data processing frameworks like PySpark to build scalable and performant data pipelines.

Hands-on experience of Databricks features for collaborative work, such as version control, notebook sharing, and integration with Git, enabling effective teamwork and code collaboration.

Used Scala scripts for Spark Machine Learning libraries API execution for decision trees, ALS, logistic, and linear regressions algorithms.

Designed and engineered REST APIs and/or packages that abstract feature extraction and complex prediction/forecasting algorithms on time series data.

Developed Python applications for AWS services aggregation and reporting and used Django configuration to manage URLs and application parameters.

MLOps and ML Engineer with a proven track record in deploying and managing end-to-end machine learning pipelines, ensuring seamless integration from development to production environments.

Skilled in collaborating with cross-functional teams, including data scientists, software engineers, and operations teams, to streamline communication and promote a collaborative, agile approach to MLOps.

Proven ability to assess and implement security best practices in MLOps, ensuring data privacy, integrity, and compliance with industry regulations and standards.

Developed pre-processing pipelines for DICOM and NONDICOM images.

Developed and presented analytical insights on medical data and image data.

Expert in ingesting batch data from different sources into AWS using Spark.

Installed and configured Apache Airflow for S3 bucket and Snowflake data warehouse and created DAGs to run the Airflow.

Created several types of data visualizations using Python and Tableau.

Collected data needs and requirements by interacting with other departments.

Worked on different data formats such as JSON, XML.

Performed preliminary data analysis using descriptive statistics and handled anomalies such as removing duplicates and imputing missing values.

Configured EC2 instances and configured IAM users and roles and created S3 data pipes using Boto API to load data from internal data sources.

Implemented Agile Methodology for building an internal application.

Conducted statistical analysis on healthcare data using Python and various tools.

Experience in cloud versioning technologies like GitHub.

Worked closely with Data Scientists to identify data requirements for experiments.

Deep experience in using DevOps technologies like Jenkins, Docker, Kubernetes, etc.

Worked on JSON-based REST Web services and Amazon Web Services (AWS) and was responsible for setting up Python REST API framework and Spring framework using Django. Worked on deployment, data security, and troubleshooting of the applications using AWS services.

Implemented AWS Lambda to drive real-time monitoring dashboards from system logs.

Worked on migrating an on-premises virtual machine to Amazon Web Services (AWS) cloud.

Developed merge jobs in Python to extract and load data into a MySQL database.

Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed.

Worked on different data formats such as JSON, XML, and performed machine learning algorithms in Python.

Responsibilities:

Successfully designed and implemented end-to-end data pipelines to process high-volume bank transaction data using GCP technologies.

Demonstrated expertise in SQL and BigQuery to design and optimize complex queries to extract insights from terabytes of transactional data.

Developed complex ETL processes using Hadoop, Python, and PySpark to transform raw transaction data into a structured format for downstream analytics.

Built scalable data processing pipelines using SparkSQL and PySpark to handle large datasets and distributed computing.

Expertise in containerization and deployment using Kubernetes to deploy the data processing pipelines.

Designed and implemented CI/CD pipelines using Jenkins to automate the build, testing, and deployment of data pipelines.

Expertise in designing and implementing scalable data processing solutions on Google Cloud Platform (GCP), including GCP data processing services like Dataflow, Dataproc, and BigQuery.

Proficient in building data pipelines using GCP Dataflow for real-time and batch data processing, including data ingestion, transformation, and enrichment.

Strong understanding of GCP BigQuery for data warehousing and analytics, including optimizing query performance, managing partitions, and working with nested and repeated data structures.

Experience in setting up and managing GCP Dataproc clusters for distributed data processing using Apache Spark or Apache Hadoop.

Proficient in GCP Pub/Sub for building event-driven data processing systems and real-time streaming applications.

Hands-on experience with GCP Storage services like Cloud Storage for scalable and durable object storage, including bucket configuration and access control.

Knowledge of GCP Identity and Access Management (IAM) for managing user roles and permissions within GCP projects and services.

Experience with GCP Stackdriver for monitoring and logging, including setting up custom metrics, alerts, and dashboards for data engineering workflows.

Proficient in GCP Deployment Manager or Terraform for infrastructure provisioning and management as code.

Strong understanding of CI/CD (Continuous Integration/Continuous Deployment) practices in the context of GCP, including automating deployment processes and version control using tools like Cloud Build or Jenkins.

Leveraged Teradata's Parallel Data Warehouse (PDW) architecture to design and implement scalable data warehousing solutions, accommodating exponential data growth and ensuring high performance by conducting.

Teradata database tuning and optimisation through indexing, partitioning, and query plan enhancements.

Developed data pipelines using Airflow for scheduling and executing ETL tasks across multiple environments.

Expertise in data modeling techniques and database design principles to develop efficient and scalable data storage solutions.

Developed and maintained data quality frameworks to ensure data consistency and accuracy in downstream analytics.

Worked closely with data scientists to develop data models and predictive analytics using transactional data.

Conducted performance tuning and optimization to improve data processing and query performance.

Experienced in managing and maintaining data warehousing solutions using GCP BigQuery.

Developed and maintained data governance policies to ensure regulatory compliance for financial data.

Demonstrated proficiency in data visualization using tools such as Tableau and PowerBI.

Created and maintained technical documentation to ensure knowledge transfer and maintainability of data pipelines.

Worked in an Agile development environment and used Jira to manage project tasks and progress.

Conducted regular code reviews and provided constructive feedback to improve code quality and maintainability.

Maintained up-to-date knowledge of industry trends and emerging technologies related to data engineering and analytics.

Responsibilities:

Partner with Product and Engineering teams to identify disparate data sources which needs to be tapped for building end to end customer record and sessionization.

Responsible for schema design and big data modelling.

Responsible for loading raw and untransformed data into data lake.

Manage and optimize cloud-based resources (e.g., AWS, Azure, GCP) for Databricks implementations.

Experience in designing, coding and testing data extraction pipelines from various data sources using Sqoop, Kafka, Spark Streaming, FTP, RESTful Webservice API’s.

Design, develop, and implement ETL processes using Ab Initio to extract, transform, and load data from various sources (databases, files, APIs).

Create and maintain Ab Initio graphs for ETL processes, data integration, and transformation workflows.

Experienced in Agile Methodologies, Scrum stories and sprints experience in a Python based environment, along with data analytics, data wrangling.

Use Ab Initio’s Data Profiler and other tools to identify data inconsistencies, gaps, and anomalies.

Experience in designing, coding and testing ETL processing using Java, Scala, Spark, Hive and SQL.

Designed, developed and tested data pipelines using Sqoop, Scala, Spark, Sbt and Oozie for Ingesting internal source data and other third-party external data from disparate sources, API’s, Flat files into Hadoop.

Designed, developed and tested data pipelines to ingest real time data from Kafka topics using Kafka, Spark Streaming, Scala and Sbt.

Demonstrated expertise in utilizing Amazon Redshift for executing database migrations on a large scale.

Designed and developed ETL processes in AWS Glue to migrate data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.

Responsible for transforming data and joining with disparate enterprise data sources for creating datasets for analytical and data science use cases.

Experienced in Amazon Web Services (AWS) and Microsoft Azure, such as AWS EC2, S3, RD3, Azure HDInsight, Machine Learning Studio, Azure Storage, and Azure Data Lake.

Optimize Databricks workflows and Spark jobs for performance and scalability.

Designed, developed data pipelines using Java, Scala, Kafka, Spark, Sbt, Hive for data transformation and building fact and dimension tables for analytical and data science use cases.

Created Data Pipelines for creating data models to help data scientists in applying Machine Learning and Predictive Analytics Algorithms for building Recommendation Engine.

Responsible for data quality, timeliness and correctness.

Designed and built custom Data Quality check utilities for ensuring Data Quality.

Designed and Implemented SLA’s for each of the Business Data flows.

Designed and Implemented Oozie workflows.

Responsible for Data Quality, Debugging, Performance Enhancements and maintaining productionized Hadoop Applications.

Reports & Data Visualization using Tableau.

Written RESTful Webservice API’s for serving data to other Enterprise Applications.

EDUCATION

Master of Science in Computers and Information Systems Park University, Parkville, MO.

Designation: Sr. Data Engineer

Duration: Sep 2023 – Current

CVS Health, Irving, Texas

Designation: Data Engineer

Duration: Jan 2023 – Aug 2023

Citi Group, Irving, Texas

Designation: Data Engineer

Duration: Dec 2011 – June 2022

Cognizant Technology Solutions

Contact this candidate