Sai Teja
***********@*****.***
Principal Research Data Analyst
PROFESSIONAL SUMMARY:
Over 9+ years of experience in Software Development Life Cycle (SDLC), specializing in Data Analysis and Big Data Engineering.
Expertise in advanced Data Analysis techniques, including Data Validation, Cleansing, and Verification, with a focus on resolving data mismatches.
Proficient in deploying Continuous Delivery pipelines using modern tools like Jenkins X, CircleCI, and AWS Code Pipeline.
Deep understanding of Distributed Systems Design, with knowledge of emerging HDFS alternatives like Alluxio.
Extensive experience in developing scalable Spark Applications using advanced features of RDDs, Data Frames, Spark-SQL, and Spark 3.0+ enhancements.
Skilled in troubleshooting and optimizing Spark applications, including experience with the latest Spark performance tuning techniques.
Expertise in configuring Spark for optimal performance, including dynamic allocation, adaptive query execution, and advanced caching strategies.
Solid experience with AWS cloud services, including EMR, S3, Lambda, ECS, EKS, Redshift, and Athena, with updates on the latest AWS offerings.
Proficiency in real-time data processing using Spark Structured Streaming and Kafka, including Kafka Streams and KSQL.
In-depth experience with cloud environments, focusing on AWS solutions like EC2, S3, and advanced AWS networking and security features.
Experience in modern architecture approaches using Spring Boot, Docker, Kubernetes, and service mesh technologies like Istio.
Expertise in handling various data formats such as Avro, Parquet, and newer formats like Delta Lake and Iceberg.
Skilled in utilizing and managing different Hadoop distributions and their latest versions, including Cloudera's CDP and Databricks.
Advanced proficiency in Teradata, including newer features and optimizations in Teradata Vantage.
Strong skills in data analysis and reporting using advanced Excel features and Power BI.
Proficient in data import/export using tools like Apache Sqoop and cloud-native solutions like AWS Glue.
Expertise in writing complex Hive queries, including the latest optimizations and integrations with cloud data warehouses.
Experience in migrating databases to Azure Data Lake, utilizing Azure Synapse Analytics, Azure SQL Database, and Azure Databricks.
Strong programming skills in Java, Scala, and Python, with updates on the latest language features and frameworks.
Comprehensive experience in Data Analysis and Migration using ETL tools like Talend, Apache NiFi, and updated Informatica PowerCenter.
Proficiency in writing and testing SQL and PL/SQL, with experience in newer database technologies like Snowflake and Google BigQuery.
Solid understanding of Core Java concepts, updated with the latest Java versions and features.
Experience configuring Spark Streaming with Kafka, and storing and processing data using cloud-native solutions.
Knowledge in developing custom UDFs in Hive and adapting to newer technologies like Presto and Trino for big data querying.
Strong experience with databases including Oracle, MySQL, and newer technologies like Amazon Aurora and CockroachDB, along with proficiency in complex SQL querying.
TECHNICAL SKILLS:
Big Data Tools: Apache Spark, Kafka, Apache Flink, Apache Beam, Databricks
Languages: Python (Pandas, NumPy, SciPy, Scikit-learn), Scala, SQL, Java, R, Julia
BI Tools: SSIS, SSRS, SSAS, Tableau, Power BI, Looker
Modeling Tools: Erwin, ER/Studio, Sybase Power Designer, ArchiMate, Lucidchart
Data Warehousing: Snowflake, Google BigQuery, Informatica, SSIS, Data Stage
Cloud Technologies: AWS, Azure, Google Cloud Platform (GCP)
Database Tools: Oracle, My SQL, Microsoft SQL Server, Teradata, MongoDB, Cassandra, HBase, PostgreSQL, Redis, Elasticsearch
ETL Tools: Pentaho, Informatica Power 9.6, Talend, Apache NiFi, SAP Business Objects XIR3.1/XIR2, Web Intelligence
Reporting Tools: Business Objects, Crystal Reports, QlikView/Qlik Sense
Tools & Software: TOAD, MS Office, BTEQ, Teradata SQL Assistant, Docker, Kubernetes, Git
Operating System: Windows, Unix, Linux
Other Tools: Machine Learning and AI
PROFESSIONAL EXPERIENCE:
Client: LabCorp, Durham, NC July 2023 to till date.
Role: Principal Research Analyst
Roles & Responsibilities:
Leveraged advanced data analytics using AI and ML algorithms in Python and R for predictive modeling and trend analysis.
Implemented Big Data processing solutions with Apache Spark and Databricks for enhanced data processing and analysis.
Utilized cloud-native services in AWS, Azure, and Google Cloud Platform (GCP) for scalable and efficient data storage and processing.
Developed and maintained data pipelines using Apache Airflow and Apache NiFi for automated data flow and scheduling.
Employed SQL and NoSQL databases like MongoDB, Cassandra, and Couchbase for diverse data storage needs.
Created real-time data streaming solutions using Apache Kafka, Apache Flink, and Amazon Kinesis for instant data availability and analysis.
Applied advanced ETL techniques using tools like Talend, Informatica, and Matillion for efficient data transformation and loading.
Utilized containerization and orchestration technologies like Docker and Kubernetes for deploying and managing data applications.
Incorporated data virtualization techniques for real-time data integration and access without the need for physical data movement.
Implemented data governance and quality control measures using Collibra and Alation to ensure data integrity and compliance.
Utilized graph databases like Neo4j for complex data relationships analysis and visualization.
Developed robust data models and architectures using latest methodologies for optimized storage and retrieval.
Employed advanced data visualization tools like Tableau, Power BI, and Looker for insightful and interactive reporting.
Integrated machine learning model deployment in data pipelines for automated predictive insights.
Utilized Jupyter and Zeppelin notebooks for collaborative data exploration and analysis.
Implemented cybersecurity measures for data protection and privacy in collaboration with IT security teams.
Utilized serverless architectures like AWS Lambda and Azure Functions for cost-efficient and scalable data processing.
Applied advanced statistical techniques and hypothesis testing to derive insights from complex datasets.
Created and maintained robust data warehouses in Snowflake and Redshift for centralized data storage and analysis.
Implemented automated data quality checks using Python scripts and SQL queries to ensure data accuracy.
Developed custom APIs for seamless data access and integration with other systems and platforms.
Utilized Git and CI/CD pipelines for version control and efficient deployment of data applications.
Wrote detailed documentation and maintaining a knowledge repository for data processes and systems.
Monitored and optimizing cloud resources for cost-effective data operations using cloud management tools.
Engaged in agile project management methodologies for iterative development and delivery of data solutions.
Collaborated with cross-functional teams to align data analytics strategies with business objectives.
Conducted training sessions for stakeholders on data tools and dashboards to enable data-driven decision-making.
Stayed updated with emerging data technologies and tools, and continuously exploring opportunities for innovation and improvement in data practices.
Environment: HDFS, Python, Django, Flask, Clinical Informatics, Pyspark, Map-Reduce, Hive, Snowflake, AWS, EC2, S3, Lambda, Redshift, Pig, Sqoop, Oozie, Teradata, Informatica, PL/SQL, Oracle SQL, UC4, Kafka, GitHub, Hortonworks data platform distribution, Spark, Scala, Perl, Toad, MS Access, Excel.
Client: Troy University, Troy, AL Aug 2021 to May 2023
Role: Graduate Research Analyst
Roles & Responsibilities:
Analyzed large datasets using advanced data analytics techniques and tools like Python, R, and SQL.
Developed and maintained ETL pipelines using modern cloud-based platforms such as AWS Glue, Azure Data Factory, and Google Cloud Dataflow.
Implemented real-time data processing and analytics using Apache Kafka, Apache Flink, and Spark Streaming.
Optimized database performance and query execution speed in DB2 Z/OS and other relational database systems.
Performed data cleansing and validation within CRM systems like Salesforce to ensure data accuracy and reliability.
Created various Parser programs to extract data from systems like Autosys, Tibco, XML, Informatica, and Java-based applications.
Generated ETL exception and validation reports post data loading into data warehouses.
Conducted thorough User Acceptance Testing (UAT), maintaining comprehensive documentation including test plans and reports.
Automated data workloads and integrated them with cloud-based data warehouses like Snowflake.
Developed applications for refreshing BI tools, such as Power BI, using automated trigger APIs.
Utilized SQL, Hive, and Pig for data importation, cleaning, filtering, and analysis.
Retrieved data from NoSQL databases like Cassandra using CQL and Java APIs.
Devised PL/SQL Stored Procedures, Functions, Triggers, Views, and packages for database optimization.
Implemented Spark using Scala and Spark SQL for faster data testing and processing.
Employed Apache Spark with Python for Big Data Analytics projects.
Wrote and tested code for Ingest automation processes, both full and incremental loads, using tools like Sqoop, MapReduce, Shell scripts, and Python.
Created and maintained optimal data pipeline architecture in cloud environments like Microsoft Azure, using Azure Data Factory and Azure Databricks.
Led One Time Data Migration projects from SQL Server to Snowflake using Python and SnowSQL.
Involved in extensive data validation using complex SQL queries and backend testing.
Developed ETL pipelines in and out of data warehouses and produced regulatory and financial reports.
Worked on Oozie jobs for creating Hadoop Archive (HAR) files.
Designed data ingestion processes from various sources using Apache NiFi and Kafka.
Implemented machine learning backend pipelines using Python libraries like Pandas and NumPy.
Developed Spark programs for faster data processing compared to traditional MapReduce programs.
Integrated external services and APIs within platforms like MuleSoft to enrich datasets for deeper analysis.
Environment: Agile, Power BI, Azure, Azure Data Bricks, Azure Data Factory, Azure Data Lake, Hadoop, Hortonworks, Snowflake, HDFS, Solr, HAR, HBase, Oozie, Scala, Informatica, Python, SOAP API webservices, Java, Weblogic, PL/SQL, Oracle, MS Office, Access, Excel, Tableau, Apache airflow, Jira.
Maybank, Kuala Lumpur, Malaysia Oct 2019 – Aug 2021
Data Analyst/Data Modeler
Project Overview: Tax Fraud Detection for Malaysia Department of Taxation
The projects primary goal is to strengthen the Department of Taxation ability to detect tax fraud using advanced data analytics. As a Data Analyst, I will employ COBOL, DB2, CICS, and Python to extract insights from complicated tax datasets, combining traditional mainframe tools with modern technologies like AWS and SQL. This integration will result in a real-time fraud detection system that promptly identifies anomalies and patterns, ensuring the integrity of the tax system and preventing potential revenue losses. The project will culminate in an upgraded tax fraud detection system that rapidly identifies suspicious activities, thereby safeguarding tax revenues. By synergizing traditional mainframe technologies with contemporary tools like Python and AWS, the project will empower the Department of Taxation to fortify tax integrity and secure the state’s financial resources. The collaboration of diverse skills and technologies will enhance the department’s capacity to effectively detect and combat tax fraud.
Responsibilities:
Conducted Initial Analysis to identify and rectify erroneous/misaligned data, collaborating with Data Verification and Data Cleaning teams for resolution.
Prioritized business and information need in consultation with management.
Established standardized processes for data analysis across diverse systems, pinpointing data discrepancies and recommending resolutions.
Applied statistical techniques to interpret and analyze data, extensively using SQL queries for data validation and analysis.
Performed Data Quality Analysis (DQA), Data Profiling, and detailed Data Analysis (DDA) on source data.
Collaborated with business users to validate Complex ETL Mappings and Sessions, ensuring accurate data loading.
Leveraged AWS Glue for ETL automation, streamlining data processing workflows, and enhancing accuracy.
Utilized AWS Lambda to create serverless data processing pipelines, orchestrating analysis tasks within AWS.
Designed ER diagrams, logical and physical database models, aligning with business needs.
Employed Python panda’s library for data cleaning, transformation, and analysis, integrating statistical techniques and data visualization.
Proficient in utilizing Amazon QuickSight, a powerful business intelligence and data visualization tool, to create interactive dashboards, reports, and visualizations.
Developed customized Tableau dashboards to present KPIs and metrics tailored to business requirements.
Documented data mapping and transformation processes in functional design documents.
Managed Salesforce customization, security access, workflow approvals, and data utilities.
Integrated Apex applications into Salesforce, enhancing account management.
Performed data validation using Excel functions, including pivot tables and VLOOKUPs.
Defined roles and responsibilities for data governance, ensuring clear accountability.
Assumed responsibility for loading, extracting, and validating client data.
Conducted data analysis using relational databases such as MS Access, Teradata, and SQL Server.
Environment: COBOL, DB2, CICS, Mainframe OS, Quick Edit, MS Office 2007, MS Visio 2003, Share Point, Python, PowerPoint, MS Project, Windows XP, UML, SQL, MS-Excel 2000, Aws, Microsoft XP Professional.
Client: SBI, India May 2016 to Oct 2019
Role: Data Analyst
Roles & Responsibilities:
Resolved on-call production issues, including hive query problems and implementing workarounds for defects within SLA durations.
Applied advanced statistical methods and financial modeling to assess and quantify risk exposure in derivative markets, enhancing risk management strategies.
Conducted performance tuning for ETL processes at various levels, including source, target, and database, using tools like Apache NiFi and Azure Data Factory for optimized throughput.
Developed and optimized PL/SQL Procedures, Functions, Packages, Triggers, and Views, ensuring efficient database management and operations.
Utilized advanced SQL functions and Python scripts to extract and relay information from databases to front-end interfaces.
Created and maintained Spark streaming pipelines in Java for parsing JSON data and storing it in cloud-based data warehouses like Snowflake.
Extensively used tools like Apache Sqoop for efficient data import from traditional databases like Oracle to Hadoop ecosystems.
Designed and managed Hive tables, optimizing data loading and query execution using Hadoop and cloud-based data platforms.
Developed and deployed AWS Lambda functions in Python and Java for event-driven processing and automation tasks.
Created comprehensive Data Mapping Repository documents as part of advanced Metadata Management services.
Engineered complex Hive queries to process and visualize data, facilitating insightful data cube generation.
Implemented schema extraction techniques for Parquet and Avro file formats in Hive, enhancing data processing efficiency.
Gained expertise in Talend Open Studio for ETL job design, focusing on data transformation and integration.
Implemented data partitioning strategies in Hive, including dynamic partitions and bucketing, for optimized data management.
Collaborated cross-functionally with infrastructure, network, database, application, and BI teams to ensure high data quality and availability.
Utilized Python and Scala for developing and maintaining data processing pipelines in Apache Spark, improving processing speed and efficiency.
Developed and maintained data pipelines in Azure Databricks, ensuring smooth data flow and processing.
Leveraged Google Cloud BigQuery for large-scale data analytics, optimizing SQL queries for performance and cost-efficiency.
Implemented machine learning algorithms using Python libraries like Pandas and Scikit-Learn to derive insights from large datasets.
Worked with real-time data streaming technologies like Apache Kafka and Flink to facilitate immediate data processing and analytics.
Environment: Hadoop, Map Reduce, AWS, HDFS, Pig, HiveQL, MySQL, UNIX Shell Scripting, Tableau, Java, Spark.
Education:
Master’s in Bio-informatics (Computer Science) from Troy University, Troy, Alabama