Sr. Big Data Engineer

Location:

Tappahannock, VA

Posted:

August 04, 2023

Contact this candidate

Resume:

Jesus Flores

Big Data/ Cloud Engineer

Email: ***************.*@*****.***; Phone: 347-***-****

PROFESSIONAL PROFILE

•A seasoned professional offering over 8 years of experience in the development of custom Big Data Solutions both on-premises and in the cloud.

•Hands-on experience using HiveQL, SQL databases (SQL Server, SQLite, MySQL server, Oracle SQL, as well as data lakes and cloud repositories to pull data for analytics.

•Optimized the pipelines by 44% using cloud service.

• to Highly knowledgeable in data concepts and technologies including AWS pipelines, and cloud repositories (Amazon AWS, Azure, Cloudera).

•Proficient in working on AWS tools (Redshift, Kinesis, S3, EC2, EMR, DynamoDB, Elasticsearch, Athena, Firehose, Lambda)

•Used AWS services like S3 and Redshift for storage or large datasets

•Imported data from AWS S3 into Spark on EMR, preparing the data for ingestion into other systems

•Reduced query response time by 40% through query optimization and indexing techniques.

•Familiarity with integrating Docker and Kubernetes into CI/CD pipelines for automated builds, tests, and deployments.

•Strong understanding of data governance principles and experience in implementing security controls and access policies using Azure Active Directory, Azure Key Vault, or Azure Private Link.

•Expertise in selecting and implementing appropriate Azure data storage solutions, such as Azure Blob Storage, Azure Data Lake Storage, or Azure Cosmos DB, based on data characteristics and usage patterns.

•Proficient in data modeling techniques such as star schema, snowflake schema, and data normalization.

•Expertise in selecting and implementing appropriate GCP data storage solutions, such as Google Cloud Storage, Google Cloud Bigtable, or Google Cloud Firestore, based on data characteristics and usage patterns.

•Proficient in using BigQuery's SQL capabilities to perform complex data analysis and generate actionable insights.

•Demonstrated proficiency in deploying, managing, and optimizing Apache Hadoop and Apache Spark clusters using Dataproc.

•Skilled in building end-to-end data pipelines, from data ingestion to data transformation and data loading, using Azure Data Factory and Azure Databricks, ensuring seamless and reliable data flow.

•Experience in utilizing Databricks in multi-cloud or hybrid cloud environments for data engineering and analytics workloads.

•Experience in performance tuning, for instance, SQL and Spark query tuning.

•Strong technical skills in Scala, Python, and experience with big data technologies such as Hadoop and Spark.

•Expert at creating HDFS and implementing RDMBs into HDFS using the Sqoop tool.

•In-depth knowledge of real-time ETL/Spark analytics using Spark SQL with visualization

•Strong analytical skills for troubleshooting and problem-solving.

•Experience using Hadoop clusters, Hadoop HDFS, Hadoop tools, Spark, Kafka, and Hive in social data, media, and financial analytics using the Hadoop ecosystem.

•Communicate with multiple stakeholders and project managers.

TECHNICAL SKILLS

IDE: Eclipse, IntelliJ, PyCharm

PROJECT METHODS: Agile, Scrum, DevOps, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Design Patterns

HADOOP DISTRIBUTIONS: Hadoop, Cloudera Hadoop

CLOUD PLATFORMS: Amazon AWS, Azure, GCP

AZURE: Data Factory, Databricks, SQL Database, Synapse Analytics, HDInsight, Data Lake Storage and Cosmos DB

AWS: AWS RDS, AWS EMR, AWS Glue, AWS Redshift, AWS S3, AWS Lambda, AWS Kinesis, AWS ELK, AWS Cloud, AWS IAM Formation.

GCP: BigQuery, Dataflow, Dataproc, Google Cloud Storage, Data Catalog, Cloud Pub/Sub, Data transfer service.

DATABASES & DATA WAREHOUSES: Apache HBase, MySQL, SQL Server, BigQuery, Hive.

FILE SYSTEMS: HDFS, S3, GCS.

DATA VISUALIZATION TOOLS: Tableau, Power BI.

PROGRAMMING: Python, Java, Shell Scripting.

ORCHESTRATION: Apache Airflow, Oozie

CICD: Git, GitHub, Jenkins, Docker, Kubernetes.

PROFESSIONAL EXPERIENCE

Jan 2022 - Present

Infosys, New York, NY

Global leader in next-generation digital services and consulting. We enable clients in more than 50 countries to navigate their digital transformation.

Designation: Sr. Big Data Engineer

•Used Spark SQL to process huge amounts of structured data.

•Developed multiple Spark Streaming and batch Spark jobs using Python on AWS.

•Developed Kafka Queue System to Collect Log data without Data Loss and published to various sources.

•Programmed Spark code using Python and Spark-SQL/Streaming for faster processing of data.

•Developed AWS CloudFormation templates to create a custom infrastructure for the pipeline.

•Implemented AWS IAM user roles and policies to authenticate and control access.

•Utilized a cluster of three Kafka brokers to handle replication needs and allow for fault tolerance.

•Monitored resources, such as Amazon DB Services, CPU Memory, and EBS volumes.

•Automated, configured, and deployed instances on AWS and Data centers.

•Developed and maintained build and deployment scripts for Test, Staging, and Production environments using Maven and Shell.

•Worked with JSON responses in Kafka consumer written in Python.

•Implemented ETL pipelines using AWS Glue to consume for S3 buckets and RDS.

•Implemented AWS Lambda functions to run scripts in response to events in Amazon Dynamo DB table or S3.

•Decoded raw data and loaded it into JSON before sending the batched streaming file over to the Kafka producer.

•Specified nodes and performed data analysis queries on Amazon Redshift Clusters on AWS.

•Worked on AWS Kinesis for processing huge amounts of real-time data.

•Populated database tables via AWS Kinesis and AWS Redshift.

May 2020 – Dec 2021

SAP, Newtown Square, PA

As the market leader in enterprise application software, we're helping companies of all sizes and in all industries run better by redefining ERP and creating networks of intelligent enterprises that provide transparency, resiliency, and sustainability across supply chains.

Sr Big Data Engineer

•Worked closely with customers and stakeholders to optimize SQL queries they used so that we could move to a more efficient data structure to optimize performance.

•Designed data models to optimize the querying performance in the AWS Redshift data warehouse.

•Orchestrated workflows in Apache Airflow to run ETL pipelines using tools in AWS.

•Exporting data to AWS RDS SQL from on-premises data sources.

•Created Spark jobs to migrate data from on-premises to the AWS cloud.

•Developed Spark jobs for data processing and cleaning using python as a programming language.

•Creating test data and unit testing scripts to test Python scripts and validate data as part of the CI/CD pipeline.

•Implemented AWS Secrets Manager into Glue jobs to help encrypt account numbers and other private information for client hashing and protection.

•Hands-on application with AWS Cloud (PaaS & IaaS).

•Developed large-scale enterprise applications using Spark for data processing in the AWS cloud platform.

•Scheduled workflows and ETL processes for constant monitoring and support with AWS Step Functions to orchestrate lambda functions in Python.

•Created Amazon SNS topics for email notifications to get updates on several Lambda functions, Glue jobs, and Tables using python as a programming language.

•Implemented SQL queries in AWS Athena to view table contents and data variables in multiple datasets for data profiling.

•Partitioning and bucketing in ranges of log file information to differentiate the information on commonplace and combinations of supported business needs.

•Team communication over MS Teams with project tracking on Jira.

•Followed Agile methodologies having daily team stand up and weekly direct report meetings and weekly customer touchpoints throughout the software development life cycle.

•Served Advisory role regarding Spark best practices.

Apr 2019 – May 2020

Qubole, Santa Clara, CA

Qubole, an Idera Inc. company, provides a simple and secure data lake platform for machine learning, streaming, and ad-hoc analytics. Market leaders throughout the world have one common denominator: they are all winning with data. They leverage their data assets to learn from the past using business intelligence tools;

Sr Big Data Engineer

•Migrated on-premises Hadoop Ecosystem to GCP Cloud and designed a new data architecture using Google Cloud Big Data Services.

•Designed and developed data pipelines to ingest data from various sources into Google Cloud Storage and Google BigQuery for Data-at-Rest processing.

•Optimized data processing using Google Cloud Dataproc, Google Cloud Dataflow, and Google BigQuery.

•Developed ETL jobs using Google Cloud Dataflow and Google Cloud Dataprep for data cleansing and transformation.

•Developed batch processing of data sources using Apache Spark on Google Cloud Dataproc.

•Implemented machine learning models using Google Cloud ML Engine and TensorFlow for predictive analytics.

•Created External tables in Google BigQuery and loaded data into tables and query data using SQL.

•Used Google Cloud Data Transfer Service for efficient transfer of data between Google Cloud Storage, Google BigQuery, and external databases.

•Developed Spark code using Python and Spark-SQL for faster testing and data processing on Google Cloud Dataproc.

•Imported structured data from relational databases using Google Cloud Data Transfer Service to process using Spark on Google Cloud Dataproc and stored data into Google Cloud Storage in CSV format.

Apr 2017 – Apr 2019

VMWare Inc., Palo Alto, CA

VMware, is an American cloud computing and virtualization technology company with headquarters in Palo Alto, California. VMware was the first commercially successful company to virtualize the x86 architecture.

Designation: Sr Hadoop Data Developer

•Deployed and managed Hadoop clusters on Azure using HDInsight.

•Designed and implemented data ingestion solutions using Azure Data Factory to move data from various sources to Azure Blob Storage and Azure Data Lake Storage.

•Developed ETL jobs using Azure Data Factory and Databricks.

•Used Azure Databricks for data processing and machine learning tasks.

•Developed Spark jobs using Scala and PySpark on Azure Databricks for batch processing of data sources.

•Implemented Azure Stream Analytics for real-time processing of streaming data.

•Used Azure Data Lake Storage and Azure Synapse Analytics for storing and analyzing big data.

•Designed and developed data models using Azure Synapse Analytics and SQL for data warehousing and reporting.

•Performed data migrations from on-premises databases to Azure using Azure Database Migration Service.

•Worked with Azure Cosmos DB, a NoSQL database, for data storage and processing.

•Used Azure HDInsight with HBase and Hive for querying and analyzing big data.

•Created and managed Azure Data Lake Analytics jobs for large-scale data processing.

•Used Azure Stream Analytics with Event Hubs for processing real-time data streams.

•Designed and implemented data governance policies for data security and compliance.

•Used Azure Log Analytics for monitoring and troubleshooting data processing pipelines.

March 2015 – Apr 2017

Reltio, Redwood City, CA

Deliver contextual customer view in real time based on business roles and objectives. Enrich profiles with transactions & interactions to create richer customer profiles.

Designation: Data Engineer

•Worked with Spark to create structured data from the pool of unstructured data received.

•Implemented Spark using Scala and Spark SQL for faster testing and processing of data.

•Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, and Scala.

•Implemented advanced procedures like text analytics and processing using in-memory computing capabilities like Apache Spark written in Scala.

•Documented the requirements including the available code which should be implemented using Spark, Hive, HDFS, and Elasticsearch.

•Maintained ELK (Elasticsearch, Kibana) and Wrote Spark scripts using Scala shell.

•Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.

•Fine-tuned resources for long-running Spark Applications to utilize better parallelism and executor memory for more caching.

•Used Apache Spark and Scala on large datasets using Spark to process real-time data.

•Transferred Streaming data from different data sources into HDFS and HBase using Apache Flume.

•Fetching the live stream data from DB2 to the Hbase table using Spark Streaming and Apache Kafka.

•Involved in complete Big Data flow of the application starting from data ingestion from upstream to HDFS, processing the data into HDFS using Spark Streaming.

•Developed ETL pipelines using Spark and Hive for performing various business-specific transformations.

•Automated the pipelines in Spark for bulk loads as well as incremental loads of various datasets.

•Worked on building input adapters for data dumps from FTP Servers using Apache spark.

•Integration of Kafka with Spark for real-time data processing.

•Performed streaming data ingestion to the Spark distribution environment, using Kafka.

•Extracted Real-time feed using Spark streaming and convert it to RDD and processed data into Data Frame and load the data into Cassandra. Elasticsearch and Logstash performance and configure tuning.

•Responsible for designing and deploying new ELK clusters (Elasticsearch, Logstash, Kibana, Beats, Kafka, zookeeper, etc.

•Bash source databases and created ETL pipeline into Kibana and Elasticsearch. Involved in the process of data acquisition, data pre-processing, and data exploration of a project in Scala.

•Developed Spark applications for the entire batch processing by using Scala.

•Developed Spark scripts by using Scala shell commands as per the requirement and used PySpark for proof of concept.

•Worked on continuous Integration with Jenkins and automated jar files at end of the day.

Jan 2014 – Mar 2015

Accenture, Chicago, IL

Accenture (formerly known as Andersen Consulting) is a provider of strategy, consulting, interactive, technology, and operations services with digital capabilities. The company operates in five segments: Communications, Media & Technology, Financial Services, Health & Public Services, Products, and Resources

•Involved in meetings with a cross-functional team of key stakeholders to derive a set of functional specifications, requirements, use cases, and project plans.

•Involved in researching various available technologies, industry trends, and cutting-edge applications.

•Designed and set up POCs to test various tools, technologies, and configurations, along with custom applications.

•Used the image files of an instance to create instances containing Hadoop installed and running.

•Designed a cost-effective archival platform for storing big data using Hadoop and its related technologies.

•Connected various data centers and transferred data between them using Sqoop and various ETL tools.

•Extracted the data from RDBMS (Oracle, MySQL) to HDFS using Sqoop.

•Used the Hive JDBC to verify the data stored in the Hadoop cluster.

•Worked with the client to reduce churn rate, read, and translate data from social media websites.

•Integrated Kafka with Spark Streaming for real-time data processing

•Imported data from disparate sources into Spark RDDs for processing.

•Built a prototype for real-time analysis using Spark Streaming and Kafka.

•Collected the business requirements from the subject matter experts like data scientists and business partners.

•Involved in the Design and Development of technical specifications using Hadoop technologies.

•Load and transform large sets of structured, semi-structured, and unstructured data.

•Used different file formats like Text files, Sequence Files, and Avro.

•Loaded data from various data sources into HDFS using Kafka.

•Tuning and operating Spark and its related technologies like SQL.

•Used shell scripts to dump the data from MySQL to HDFS.

•Used NoSQL databases like MongoDB in implementation and integration.

•Worked on streaming the analyzed data to Hive tables using Storm for making it available for visualization and report generation by the BI team.

•Configured Oozie workflow engine scheduler to run multiple Hive, Sqoop, and pig jobs.

•Consumed the data from the Kafka queue using Spark Streaming.

•Used Oozie to automate/schedule business workflows that invoke Sqoop and Pig jobs.

EDUCATION

Bachelor's Degree in Multimedia and Digital Animation

Faculty of Mathematical Physics Sciences, U.A.N.L

Contact this candidate