Post Job Free

Resume

Sign in

Python Developer Data Engineer

Location:
Kansas City, MO
Salary:
72/hr
Posted:
March 31, 2023

Contact this candidate

Resume:

Akhil Kumar Komma

Sr. Big Data Engineer

Email: adv9ca@r.postjobfree.com

Phone: 919-***-****

Linkedin URL : https://www.linkedin.com/in/akhilkumarkomma/

Professional Summary:

About 6+ years of expertise with Big Data technologies, Python based web application development, Linux administration.

Expertise installing, configuring, maintaining, and administering Hadoop Clusters utilizing the Apache, Cloudera, and Amazon Web Services distributions (AWS).

Excellent knowledge of Sqoop and YARN, two key Hadoop ecosystems.

Expertise with the tools in Hadoop Ecosystem including Spark, Hive, HDFS, MapReduce, Sqoop, Kafka, Yarn, Pig, Flume, Oozie and HBase.

Establish design and implementation guidelines and procedures for Hadoop-based applications.

Experience with NoSQL databases such as MongoDB, Cassandra, and HBase.

Expertise in data modeling and database performance tuning.

Created numerous MapReduce applications using Apache Hadoop for handling Big Data.

Knowledge of Hadoop environments from Hortonworks and Cloudera.

Created automated scripts for database-related tasks including RUNSTATS, REORG, REBIND, COPY, LOAD, BACKUP, IMPORT, and EXPORT using Unix Shell.

Solid understanding of SQOOP and as well as analysis skills utilizing PIG and HIVE.

Expertise with tools for planning and monitoring the job work-flow, such as Oozie.

Extensive knowledge in Talend Big Data Studio .

Practical experience with AWS Redshifts to load huge data sets.

A solid grasp of XML techniques, especially Web Services (XML, XSL, and XSD).

Proficient working knowledge of a variety of relational databases, including Oracle, MS-SQL Server, Postgres, and MS Access 2000, as well as experience with Hibernate for converting object-oriented domain models.

Proficiency in using time series databases like Graphite and Kairos DB.

Good experience in working with various Python Integrated Development Environments like PyCharm, Jupyter Notebook, VS Code.

Experienced in leveraging Snowflake's cloud data platform and performing various data engineering tasks, including data ingestion, transformation, and analysis using Snowflake.

Implemented cascade parameters on SSRS reports when appropriate to give users of the reports the most flexibility possible.

Participated in log file management, where logs older than 7 days were imported into HDFS and stored for 12 months after being deleted from the log folder.

Proficient in setting up, managing, and maintaining Hadoop clusters for the main Hadoop distributions.

Expertise in development support activities including installation, configuration and successful deployment of changes across all environments.

Expertise with Tableau Desktop for data visualization using a variety of charts, including bar charts, line charts, combination charts, pivot tables, scatter plots, pie charts, and packed bubbles, as well as using several comparison measures, including Individual Axis, Blended Axis, and Dual Axis.

Knowledge of and expertise utilizing ETL and Data warehousing tools

Solid working knowledge of UML and use case design in OOA and OOD.

Used a variety of project management techniques, including Waterfall, Spiral, and Agile/SCRUM.

A thorough understanding of Scrum, Continuous Integration, and Test-driven Development methods.

Utilized various AWS Cloud services like S3, EMR, Redshift, Athena, and Glue Metastore for building data lakes in cloud and creating various data engineering pipelines utilizing all AWS cloud services.

Developed numerous projects using the Flask framework and with one of the latest frameworks called Fast API as well as a Python developer.

Specialized knowledge of writing Web Services.

Proven ability to build Teradata databases, users, tables, triggers, macros, views, stored procedures, functions, packages, joins, and hash indices.

Test cases and problems were recorded using HP Quality Center.

Possessing the ability to quickly learn and adapt to new technologies, adapt to changing situations, being self–motivated, a team player, focused, and an adaptable learner with exceptional interpersonal, technical, and communication skills are highly valuable assets.

Technical Skills:

Big Data

Hadoop, Hive, Sqoop, Pig, HBase, MongoDB, Cassandra, PowerPivot, Flume, MapReduce

Operating Systems

Windows, Ubuntu, Red Hat Linux, Linux, UNIX, Centos

Project Management

Plan View, MS-Project

Cloud Services

Amazon Web Services(AWS), Microsoft Azure, and Google Cloud Platform(GCP)

Programming or Scripting Languages

Python, SQL, C, C++, Unix Shell Scripting

Modelling Tools

UML

IDE/GUI

PyCharm, Jupyter Notebook

Framework

Tornado, Flask, Fast API

Database

My-SQL, MS-SQL, Oracle, MS-Access, KairosDB, MongoDB, Elastic Search, Graphite, Redis

Middleware

Web Sphere, TIBCO

ETL

Informatica, Talend, AWS Glue

Business Intelligence

OBIEE, Business Objects,Tableau

Testing

Quality Center, Win Runner, Load Runner, QTP

Professional Experience:

Client: Charter Communications, St. Louis, U.S Jan 2022 – Till Date

Role: Senior Big Data Engineer

Responsibilities:

Designed and implemented data pipelines using Hive, Flume, Sqoop, PIG and mapReduce to ingest customer behavioral data and financial histories into HDFS for analysis.

Developed Hive User Defined Functions (UDF) to parse the present raw data to get the transaction times from a specific account for a particular time period.

Used the AWS Sage Maker to quickly build, train and deploy the Machine Learning (ML) models.

Working on Migration project and creating script to migrate data from HDFS on-prem to AWS cloud S3.

Orchestrated multiple ETL jobs using AWS Glue.

Designed, developed, and maintained ETL workflows using AWS Glue stack components including AWS Glue Data Catalog, ETL Engine, Studio, and Crawler.

Leveraged AWS Glue ETL Engine to extract data from various sources such as Amazon S3, RDS, and Redshift, transformed the data using PySpark or SparkSQL, and loaded it into the desired data target.

Used AWS Glue Crawler to automatically discover and catalog metadata about data sources, ensuring consistency and accuracy of data schema and partitions.

Maintained and optimized AWS Glue Data Catalog, ensuring that data sources and targets were properly registered and managed, and that data schemas were up-to-date.

Consistently delivered successful end-to-end data and business intelligence implementations, demonstrating expertise in architecting robust ETL pipelines.

Used AWS Step Functions to build event-driven serverless applications, that respond to changes in real-time data streams.

Integrated AWS Step Functions with AWS services such as Amazon SNS, Amazon SQS, and AWS Glue to create end-to-end data processing pipelines.

Built complex workflows that combine AWS Lambda functions, AWS Batch jobs, and other AWS services to orchestrate distributed applications.

Worked on AWS Lambda to run servers without managing them and to trigger run code by S3 and SNS.

Developed data transition programs from DynamoDB to AWS Redshift (ETL Process) using AWS Lambda by creating functions in Python for the certain events based on use cases.

Implemented the AWS cloud computing platform by using RDS, Python, Dynamo DB, S3 and Redshift.

Moved data from the traditional databases like MS SQL Server, MySQL and Oracle into the Hadoop by using HDFS.

Worked on developing workflow in Oozie to automate the tasks of loading data into HDFS and preprocessing with Pig.

Involved in the creation of Hive tables, loading and analyzing the data by using hive queries.

Have worked on installation and configuration of EC2 instances on AWS (Amazon Web Services) for the establishment of clusters on the cloud.

Worked on CI/CD solution, using Git, Jenkins, Docker and Kubernetes to setup and configure Big data architecture on AWS cloud platform.

Exposed to all aspects of software development life cycle (SDLC) like Analysis, Planning, Developing, Testing, implementing and Post-production analysis of the projects. Worked through Waterfall, Scrum/Agile Methodologies.

Proficient in utilizing Boto3, the AWS SDK for Python, to interact with AWS services and build, deploy, and manage applications on the AWS Cloud.

Created Tableau reports with complex calculations and worked on Ad-hoc reporting using PowerBI.

Used Pig as ETL tool to do event joins, transformations, filter and some pre-aggregations.

Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.

Proficient in creating and managing Teradata database structures, including tables, views, triggers, stored procedures, and functions, as well as utilizing join operations and hash indices to optimize query performance.

Working on integrating Kafka Publisher in spark job to capture errors from Spark Application and push into database.

Involving in processing log files generated from various sources to HDFS for further processing through Elastic Search, Kafka, Flume & Talend and process the files using Piggybank.

Loading and Transforming ORC data into S3 using Snowflake and performed queries using SnowSQL.

Writing custom Kafka consumer code and modified existing producer code in Python to push data to Spark-streaming jobs.

Configuring a MongoDB cluster with High Availability, Load balancing and performing CRUD operations.

Loading data from flat files into the target database using ETL processes by applying business logic for inserting and updating records when loaded.

Performed day-to-day Database Maintenance tasks including Database Monitoring, Backups, Space, and Resource Utilization.

Used Airflow to schedule the end to end jobs.

Ensured compliance with regulations and internal policies by implementing data governance measures for a migration project, including data classification, encryption, access controls.

Experienced in utilizing Terraform for infrastructure automation and management across various cloud providers and in-house solutions.

Implemented Apache Airflow and other dependency enforcement and scheduling tools.

Created multiple DAG's in airflow for daily, weekly and monthly loads.

Solid experience in working with csv, text, sequential, Avro, parquet, orc, Json datasets.

Used Tableau Data Visualization tool for reports, integrated tableau with Alteryx for Data & Analytics.

Utilized Amazon Athena to create views and analyze data, maximizing efficiency and allowing for insightful data analysis.

Building Nifi data pipelines in docker container environment in development phase.

Managed and deployed containerized applications on Amazon EKS, utilizing the fully managed Kubernetes service to automate deployment, scaling, and management, while taking advantage of the security and compliance features of AWS.

Developed scripts for extracting and processing data from SFTP server in Hive data warehousing using Linux shell scripting.

Creating Docker images to support models in different formats and developed using different languages.

Used Maven in building and deploying code in YARN cluster.

Used JIRA as the Scrum Tool for Scrum Task board and work on user stories.

Worked on creating Oozie workflow and the coordinator jobs to remove the jobs in time for availability of data.

Experience in building scripts by using Maven and performing continuous integrations systems like Jenkins.

Environment: Linux, Python, JSON, Jenkins, Gitf;, HBase, HTML, FLUME, HDFS, DB2, Airflow, Apache Hive, SQL, Sqoop, Excel, AWS Glue, AWS sageMaker, Redshift, DynamoDB, Elastic Cache,EKS, Docker, PowerBI, Nifi, Kubernetes, S3, EFS, AWS EKS, Hive, PIG, Lambda, snowflake, Amazon EMR, Kafka, Tableu, EventHub, Solr, ADLS Snowball, Sonarqube, RDS, Terraform, Amazon EMR, etc…,

Client: Mphasis, India June 2020 – Dec 2021 Role: Senior Big Data Engineer/ Senior Software Engineer

Responsibilities:

Migrated the existing SQL CODE to Data Lake and sent the extracted reports to the consumers.

Created PySpark Engines processing huge environmental data load within minutes implementing various business logics.

Worked extensively on Data mapping to understand the source to target mapping rules.

Developed data pipelines using python, PySpark, Hive, Pig and HBase.

Migration of on-premises data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1N2).

Managed and audited extensive financial data sets, demonstrating strong attention to detail and ability to handle complex financial information.

Developed PySpark scripts to encrypt specified set of data by using hashing algorithms concepts.

Designed and implemented a part of the product of knowledge lens which takes environmental data on a real time basis from all the industries.

Performed data analysis and data profiling using SQL on various extracts.

Effectively interacted with Business Experts and Data Scientists and characterized Planning archives and Configuration process for different Sources and Targets.

Created reports of analyzed and validated data using Apache Hue and Hive, generated graphs for data analytics.

Worked on data migration into HDFS and Hive using Sqoop.

Worked in complete Software Development Life Cycle (analysis, design, development, testing, implementation and support) using Agile Methodologies (Jira).

Performed Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.

Involved in business requirements gathering for successful implementation and POC (proof-of-concept) of Hadoop and its ecosystem.

Performed database cloning for testing purposes.

Written multiple batch processes in python and pyspark processing huge amount of time series data which created reports and scheduled these reports to industries.

Created analytical reports on this real time environmental data using Tableau.

Generated final reporting data using Tableau for testing by connecting to the corresponding Hive tables using Hive ODBC connector.

Responsible for HBase bulk load process, created HFiles using MapReduce and then loaded data to HBase tables using complete bulk load tool.

Commit and Rollback methods were provided for transactions processing.

Fetch data to/from HBase using Map Reduce jobs.

Used HBase for storing the Meta data of files and maintaining the file patterns.

Framed the business logics based on the customers’ requirements and implemented it using Talend. Worked on the design, development and testing of Talend mappings. Created ETL job infrastructure using Talend Open Studio.

Designed and implemented SQL queries for data analysis and data validation and compare data in test and production environment.

Created tables, loading the data and writing Hive queries which will run internally in map.

Created partitioned tables in Hive and analyzed the partitioned and bucketed data and compute various metrics for reporting.

Documented requirements in Jira as a backlog of user stories for team.

Led grooming sessions, sprint planning, retrocessions and daily stand-ups with the teams during the absence of scrum master.

Scheduling Jobs Managing to remove the duplicate log data files in HDFS using Oozie.

Used Flume extensively in gathering and moving log data files from Application Servers to a central location in Hadoop Distributed File System (HDFS).

Created action filters, parameters and calculated sets for preparing dashboards and worksheets in Tableau.

Create views in Tableau Desktop that are published to internal team for review and further data analysis and customization using filters and actions.

Designed the Azure Cloud relational databases analyzing the clients and business requirements.

Worked on environmental data migration from On-Premises to cloud databases (Azure SQL DB).

Implemented disaster recovery and failover servers in cloud by replicating the environmental data across multiple regions.

Extensive experience on AKS.

Have extensive experience in creating pipeline jobs and implementing scheduling triggers.

Mapped data flows using Azure Data Factory(V2) and used Key Vaults to store credentials.

Have an ample amount of experience working on Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW).

Experience in creating Elastic pool databases and scheduled Elastic jobs for executing T-SQL procedures.

Developed business intelligence solutions using SQL server and load data to SQL & Azure Cloud databases.

Worked on creating tabular models on Azure analysis services for meeting customers reporting requirements.

Built pipelines using Azure Data Factory, Azure Databricks and loading data into Azure data lake, Azure data warehouse and monitored databases.

created tabular models on Azure analysis services based on business requirements.

Experienced in building modern data warehouses in Azure Cloud, including building reporting in PowerBI.

Experienced in designing and implementing data integration solutions using Talend Catalog, with a strong focus on optimizing data flow, ensuring data quality and integrating various data sources.

Proposed designs thinking about the expense/spend in Azure and develop recommendations right size information framework.

Worked on creating correlated sub-queries to determine complex business queries including various tables from various data sets.

Developed ETL jobs in python to extract, validate, transform and load data into various modules.

Developed an ample amount of backend modules using Python Flask Web Framework using ORM models.

Worked on projects designed with waterfall and agile methodologies, delivered high-quality deliverables on time.

Extensively worked on the tuning and optimizing SQL Queries to reduce run time by working mainly on Indexes.

Implemented Web-Services for several modules using RESTful services.

Used various debuggers like pdb, gdb and tools like pylint and Coverity for static analysis.

Developed the XML Schema documents as well as framework for parsing XML documents.

Worked on Dynamic and static queries for Microsoft SQL server, developed complex inner and outer joins. Analyzed and enhanced the performance of various queries and stored procedures.

Expertise in configuration, logging and exceptional handling.

Overseen and evaluated Hadoop log file and worked in analyzing SQL scripts and designed the solution for the process using PySpark.

Used PyCharm, Jupyter Notebook as IDE and Git for version control. Testing and deploying the application on Tomcat and NGINX. Used Jenkins for continuous integration of the code.

Done stress and performance testing, benchmark for the cluster.

Environment: Linux, Windows 2000, XP, NGINX, Python, Flask Framework, Web Services, pdb, gdb, pylint, PySpark, Tomcat, Jenkins, Selenium, JSON, Pycharm, Jupyter, Git, HBase, HTML, FLUME, HDFS, REST, OOZIE, =, DB2, Apache Hive, SQL, TALEND, Sqoop, Apache Hue, Excel, pivot tables, Talend open studio, Azure Data factory, Azure Databricks, Azure Data Lake, Azure Data Warehouse, Azure Blob, AKS, etc…

Client: Arrow Technologies, Chennai, India July 2018 - April2020

Role: Hadoop Developer

Responsibilities:

Imported modules like NumPy and Keras on Spark session and created directories for main data and output data.

Developed a patient health data analysis and visualization platform on GCP, enabling healthcare providers to securely store, analyze, and visualize patient data for improved decision-making and patient care.

Analyze and read training and testing datasets into the data directory as well as into Spark variables for easy access and trained the data based on a sample submission.

Hands on experience with big data tools like Hadoop, Spark, Hive, Pig.

Worked on databases like Cassandra, SQL.

Developed multiple MapReduce jobs in java for data cleaning.

Worked on NoSQL databases including HBase, MongoDB.

Created shell scripts for automating day-to-day processes.

Created Hadoop streaming jobs using python.

Involved in story-driven agile development methodology and actively participated in daily scrum meetings.

Worked on reading images and for easier data manipulation transformed all the images to NumPy arrays.

Created web services for the internal product and worked on optimizing the performance of the product.

Monitor Apache Hadoop cluster connectivity and security.

Manage and analyze Apache Hadoop log files.

Created shell scripts to monitor the health of Hadoop daemon services and created contingencies based on any warnings or failure conditions.

Proficient in developing, deploying, and supporting scalable and high-performance ETL pipelines leveraging distributed data movement technologies and approaches, with hands-on experience in streaming ingestion and processing.

Proficient in utilizing Informatica PowerCenter to design and implement ETL solutions for cloud-based data platforms, including Google Cloud Platform (GCP) tools such as Cloud Storage, BigQuery, and DataProc.

Provided ad-hoc queries and data metrics to the Business Users using Hive, Pig.

Server-side data processing and data validation.

Worked on POC to check various cloud platforms including Google Cloud Platform (GCP).

Experience in working with technologies like GCP Dataproc, Big Query, GCS.

Worked on cloud dataflow and Apache Beam.

Design and implement high availability and self-recovering solutions for web and database applications within GCP.

Managed infrastructure with Cloudera manager.

Proficient in C and C++ programming with hands-on experience on Unix/Linux operating systems.

Experience in using cloud shell for multiple tasks and for deploying of services.

Involved in story-driven agile development methodology and actively participated in daily scrum meetings.

Environment: Hadoop, MapReduce, Spark, Hive, HDFS, PIG, Cloudera, Flume, C, C++, HBase, MongoDB, Cassandra, Oracle, NoSQL and Unix/Linux, Amazon web services, Google Cloud Platform, Apache Beam, Cloud shell, GCP Dataproc, Big query, Unix shell.

Client: Plumsoft Solutions, Hyderabad, India Jan 2017 – Jun 2018

Role: Python Developer

Responsibilities:

Query optimization through SQL tools for quick response time.

Worked on Database views, tables and objects.

Worked on various phases of projects like design, development and testing and deployment phases.

Worked on various bug fixes in the dev environment.

Used GitHub as a version control tool.

Developed backend modules using tornado framework, later started to use Flask framework.

Utilizing Python web service frameworks such as Flask and Django to develop and deploy RESTful web services and clean APIs.

Proficient in Python asyncio, an event-driven framework for writing efficient and scalable concurrent code, utilizing coroutines, event loops, and asynchronous I/O to achieve high performance in network applications.

Written shell scripts to trigger scheduled jobs.

Written python services to extract information from XML documents.

Written python parsers to extract useful information from design database.

Utilizing Linux/Unix platforms for developing and deploying Python applications.

Experienced working on Python versions 2.7 and 3.

Have in-depth knowledge and experience in using the leading Python data analysis libraries such as Pandas, Xarray, and NumPy for data cleaning, manipulation, and analysis.

Experienced in developing web-based applications using python, SQL, REST, XML, AWS.

Hands on experience with multi-threading and multi-processing.

Developed SQL scripts for automation purposes.

Automated the creation of pdfs and word documents and streamlined the process of sending these documents to the respective clients.

Created a registration portal using MySQL as the database for over 3500 clients.

Worked on creating sample scripts for testing purposes.

Written restful services to query, update and delete from SQL.

Environment: Python 2.7, python 3, SQL, Excel, GitHub, Oracle, Aws, Linux/Unix, Flask, Django, Web Services, Selenium, HTML.

Certifications:

Certified in Microsoft Azure Data Engineer- Associate Level



Contact this candidate