Senior Data Engineer with Big Data Expertise

Location:

McKinney, TX

Posted:

December 12, 2025

Contact this candidate

Resume:

Lakshmi Mikkilineni

Data Engineer

Phone: +1-214-***-**** Location: TX Email: *******.*******@*****.*** SUMMARY

• Around 5 years of technical software development experience with expertise in Big Data, Hadoop Ecosystem, Analytics, Cloud Engineering, and Data Warehousing.

• Experience in analyzing data using Python, R, SQL, Microsoft Excel, Hive, Pyspark, and Spark SQL for Data Mining, Data Cleansing, Data Munging and Machine Learning.

• Strong proficiency in building large-scale applications using the Big Data ecosystem: Hadoop (HDFS, MapReduce, Yarn), Spark, Kafka, Hive, Impala, HBase, Sqoop, Pig, Airflow, Oozie, Zookeeper, Ambari, Flume, Nifi, and AWS.

• Exhibiting substantial experience with Azure services including HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, and Storage Explorer.

• Extensive involvement with AWS services like EC2, S3, EMR, RDS, VPC, Elastic Load Balancing, IAM, Auto Scaling, CloudFront, CloudWatch, SNS, SES, SQS, and Lambda to trigger resources.

• In-depth understanding of Hadoop Architecture and its components: HDFS, Job Tracker, Task Tracker, Name Node, Data Node, MapReduce, and Spark.

• Proficient in Scala programming for developing high-performance and scalable applications. Leveraged Scala's concise syntax and functional programming capabilities to build robust software solutions.

• Strong working experience with SQL and NoSQL databases (Cosmos DB, MongoDB, HBase, Cassandra), including data modeling, tuning, disaster recovery, backup, and creating data pipelines.

• Proficient in scripting with Python (PySpark), Java, Scala, and Spark-SQL for development and data aggregation from various file formats such as XML, JSON, CSV, Avro, Parquet, and ORC.

• Experienced in developing end-to-end ETL data pipelines that extract data from various sources and load it into RDBMS using Spark.

• Configuring Spark Streaming to receive real-time data from Apache Kafka and storing the stream data to HDFS, utilizing Spark-SQL with various data sources like JSON, Parquet, and Hive.

• Proficient in utilizing ELK stack for developing search engines on unstructured data within NoSQL databases in HDFS.

• Sound knowledge in developing highly scalable and resilient Restful APIs, ETL solutions, and third-party integrations using Informatica.

• Highly skilled in all aspects of SDLC using Waterfall and Agile Scrum methodologies.

• Familiarity with NLP, Image Detection, Map R.

EDUCATION

Master’s Degree in Computer and information science Jan-2024 to May-2025 Southern Arkansas university

Bachelor’s Degree in Computer science and engineering Jul- 2016 to Mar- 2020 Computer science and engineering

SKILLS

Big Data Technologies: Kafka, Scala, Apache Spark, HDFS, Yarn, MapReduce, Hive, HBase, Cassandra. Cloud Services: Amazon EMR (EMR, EC2, EBS, RDS, S3, Glue, Elasticsearch, Lambda, Kinesis, SQS, DynamoDB, Redshift, ECS) Azure HDInsight (Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, Cosmos DB, Azure DevOps, Active Directory). Programming Languages Python, SQL, Scala

Databases: Snowflake, MS-SQL SERVER, Oracle, MySQL, PostgreSQL, DB2 Reporting Tools/ETL Tools: Informatica, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, Power BI. Methodologies: Agile/Scrum, Waterfall.

Others: Machine learning, NLP, Airflow, Jupyter Notebook, Docker, Kubernetes, Jenkins, Jira. EXPERIENCE

Capital One, TX Jan 2025 - Present

Data Engineer

• Develop and maintain a robust data pipeline architecture that efficiently extracts, transforms, and loads data from diverse data sources into GCP 'Big Data' technologies, ensuring data accuracy and integrity.

• Experience in building multiple Data pipelines, end-to-end ETL, and ELT processes for Data ingestion and transformation in GCP and coordinating tasks among the team.

• Design and implement various layers of Data Lake, Design star schema in Big Query.

• Understanding creating DATA PROC cluster management and configuration using GCP.

• Created real-time data streaming pipelines using GCP Pub/Sub using Data flow.

• GCP provides robust and secure storage solutions for finance organizations to store and manage their data. Google Cloud Storage offers scalable and durable object storage, while Google Cloud Bigtable and Google Cloud Fire store provide NoSQL databases for structured and semi-structured data. These services are commonly used to store electronic finance records (EHRs), medical imaging data, genomics data, and other finance-related data

• We are using for Big Query is a fully managed and serverless data warehouse that can be used for analytics and reporting on finance data

• Using g-cloud function with Python to load Data into Big query for on arrival csv files in GCS bucket

• Created a data pipeline using Dataflow to import all the UDF files and Parquet files and load the data into BigQuery.

• Used Dataflow automatic templates for processing batch and streaming data.

• Worked with GCP services like Cloud Storage, Compute Engine, App engine, Cloud SQL, Cloud Functions, Cloud Run, Cloud Data Flow, Cloud Composer, Cloud Bigtable and Pub/Sub to process data for the downstream Customers.

• Developed production-ready Spark applications using Spark SQL, Matplotlib, GraphX, Data Frames, Datasets, Spark- ML, and Spark Streaming, enabling advanced data analysis and real-time processing.

• Configured Spark Streaming to handle real-time data from sources like Apache Kafka and store stream data in various formats (JSON, Parquet, and Hive) for immediate insights and decision support.

• Managed and maintained the Hadoop ecosystem, including HDFS, Map Reduce, Hive, and Spark, to ensure efficient data processing and storage.

• Developed Data pipelines using Python for medical image pre-processing, Training and Testing.

• Optimized data storage and retrieval using Hadoop HDFS, reducing storage costs by 15%.

• Configured Spark Streaming to handle real-time data from Apache Kafka, storing streaming data in HDFS, and leveraging Spark-SQL for real-time analysis, providing actionable insights to stakeholders.

• Leveraged GCP's data storage and compute capabilities to build and optimize data pipelines, ensuring data accuracy and accessibility for analytics and decision-making.

Mphasis, India Apr 2021 - Jan 2023

Data Engineer

• Worked in a Databricks Delta Lake environment on AWS, leveraging Spark for efficient data processing and management, ensuring optimal performance and scalability in big data operations.

• Developed a Spark-based ingestion framework for ingesting data into HDFS, creating tables in Hive, executing complex computations and parallel data processing, and enhancing data analysis capabilities.

• Developed PowerShell scripts to monitor and manage cloud infrastructure resources on AWS, optimizing resource allocation and ensuring cost-effectiveness by automating infrastructure management tasks.

• Ingested real-time data from flat files and APIs into Apache Kafka, enabling data streaming for immediate processing and analysis, facilitating timely insights and decision-making.

• Developed a data ingestion pipeline from HDFS into AWS S3 buckets using Apache NiFi, ensuring seamless data movement and storage in a scalable and secure cloud environment.

• Created external and permanent tables in Snowflake on AWS, enabling efficient data storage and querying in a cloud- based data warehouse, supporting advanced analytics and business intelligence.

• Worked on creating Hive tables and writing Hive queries for data analysis to meet business requirements and experienced in using Sqoop to import and export data from Oracle & MySQL databases to the Hadoop ecosystem.

• Implemented Spark to migrate MapReduce jobs into Spark RDD transformations and Spark streaming, optimizing big data processing for speed and efficiency.

• Developed an application to clean semi-structured data like JSON into structured files before ingesting them into HDFS, ensuring data quality and consistency for downstream processing.

• Automated transformation and ingesting of terabytes of monthly data using Kafka, S3, AWS Lambda, and Oozie, streamlining data pipelines and reducing manual intervention.

• Integrated Apache Storm with Kafka to perform web analytics and to process clickstream data from Kafka to HDFS, enabling real-time analytics on user interactions and website traffic.

• I worked on automating the CI/CD pipeline with AWS Code Pipeline, Jenkins, and AWS Code Deploy, ensuring the seamless and efficient deployment of data applications and services.

• Created an internal tool for comparing RDBMS and Hadoop, ensuring that all data in the source and target matches, reducing the complexity of moving data, and ensuring data accuracy. Adani, India May 2019 - Apr 2021

Data Analyst

• Leveraged Python libraries and frameworks such as Pandas, NumPy, and SciPy for advanced data analysis and statistical modeling, resulting in more accurate business insights.

• Leveraged Tableau and Power BI to create interactive data visualizations and dashboards, facilitating data-driven decision-making for business stakeholders.

• Perform Data Cleaning, features scaling, features engineering using pandas and NumPy packages in Python.

• Managed and optimized SQL databases, creating complex queries and performing data transformations.

• Conducted performance tuning of SQL queries and database operations, leading to a 20% improvement in query response times.

• Developed interactive visualizations and reports using Tableau and MS-Excel to effectively communicate findings to stakeholders.

• Utilized SQL for data extraction and manipulation, performing complex queries for data analysis and reporting.

• Prepared detailed monthly reports, highlighting project KPIs and progress, for management and key stakeholders.

• Conducted software reconciliations for vendors, ensuring compliance with licensing agreements and optimizing costs.

• Developed and maintained Python-based ETL pipelines to process and analyze large datasets.

• Worked with Apache Spark to build scalable data processing solutions, accelerating data analysis and reporting capabilities.

Projects

Sentiment Analysis for Product Reviews

• Developed a Python code for conducting sentiment analysis of product reviews in an e-commerce website to recommend products to users with similar interests based on ratings.

• Cleaned and analyzed the data using Pandas, Matplotlib, NumPy, Math, and Seaborn libraries and conducted the analysis.

Online Social Networking Platform – DBMS

• Created a backend online social networking portal for sending data from CSV files to a database using Python script.

• Designed a relational database with tables for the online social networking portal using an SQL server, incorporating strong and weak entities with non-key attributes and surrogate keys.

• Designed a relational database with tables for the online social networking portal using an SQL server, incorporating strong and weak entities with non-key attributes and surrogate keys. Data Integration Using Azure Services

• Utilized Twitter API to extract data from Twitter, such as tweets, user profiles, and hashtags, and retrieve data in JSON format.

• Implemented data transformation tasks within Azure Data Factory (ADF) to process the raw Twitter data, using ADF's data flow capabilities to cleanse, filter, and enrich the extracted tweets, extracting essential information like tweet text, user location, timestamps, and user mentions.

Contact this candidate