Big Data Engineering Director

Location:

Fremont, CA, 94538

Posted:

November 06, 2022

Contact this candidate

Resume:

SANHITA SARKAR

Fremont, California *****

Phone : 650-***-****

*******@*****.***

https://www.linkedin.com/in/sanhita-sarkar-b6bb18/ CAREER SUMMARY

Engineering Director, Innovation Enthusiast, and an Agent of Change with 24+ years of industry experience in building software products and solutions and driving teams to deliver on business needs. Overseeing design, implementation and delivery of optimized software features and solutions on diverse GPU/CPU servers, storage, and Data Lakehouse architectures, which involve Artificial Intelligence, Machine Learning, Blockchain Smart Contracts, Big Data, Hadoop, Graph Analytics, Data Science, SQL and NoSQL Databases, Metadata Indexing/Search, Event-driven Architectures, Business Intelligence, Visualization, and IoT Applications from edge to cloud, and are aligned with industry verticals like Media and Entertainment, Healthcare, Industrial Manufacturing, and Defense and Intelligence. Specialties: Engineering Management; Agile SAFe planning; Operations; Strategic Vision and Implementation to meet new industry trends and technology standards; ISV Partnership; Entrepreneurship; Team building and people management skills to innovate, mentor and deliver software products/solutions based on open source technologies and partner solutions that are aligned with market trends and customer needs; Defining Product Value Proposition for Marketability; Driving ROI based Metrics and on-budget projects. PROFESSIONAL ACCOMPLISHMENTS

• Enabled new markets and customers in IoT, AI and SQL/NoSQL database technologies by delivering performance-optimized software features and solutions on both converged and disaggregated GPU/CPU compute and data infrastructures offered by Western Digital. Delivered these capabilities in Program Increments within an Agile SAFe environment to meet customer needs from edge to cloud, ranging from car manufacturing to media and entertainment, cloud storage file hosting, internet search, telecom, and genomics and healthcare. Authored patents, and technical papers, and an active speaker at various conferences, including corporate-sponsored conferences (refer to link at the end).

• Delivered big data analytics software/solutions within SGI R&D division, a traditional player in High Performance Computing markets and executed a similar role at Teradata Corporation. Evaluated current and emerging data analytics software, platforms/storage architectures, database and application features, product release / development readiness for emerging platforms; drove implementation to delivery of analytical software, solutions, and cloud offerings like software as a service and platform as a service for various customers.

• Hired top talent and motivated teams within Engineering to develop optimized solutions which were revenue- focused for horizontal and vertical markets. These solutions were based on the core product, and fully integrated with advanced software technologies, either open source or partner based. Solutions were built with real-world data to deliver significant business value to various customer segments.

• Drove cross-functional activities internally, and involving Partners throughout the solution/product lifecycle, staying within an allocated budget.

• Accomplished company-sponsored extensive management and leadership trainings, including the leader essentials program, commitment-based management, agile /scrum methodologies, and others.

• Became one of the honorary “Agents of Change” for augmenting corporate processes with a capability to use AI/ML/Big Data technologies for actionable insight and foresight. Delivered sales operational intelligence software as a service in the private cloud, allowing for analysis of sales data in near-real time to predict revenue of various corporate products.

• Authored patents involving AI architectures leveraging varied systems infrastructures; smart analytics for IoT data pipelines; intelligent data tiering using machine learning; software as a service and platform as a service in the cloud; and graph visualization. Authored several white papers; active speaker and panel/session chair at 2

various AI, Big Data, and Storage conferences.

• Organizer of the Artificial Intelligence and Big Data Sciences Meetup Group - organized several sponsored Meetup events at DataWorks Summit, Strata Data conference and hands-on sessions on AI/ML, big data analytics, cancer genomics, IoT and ICT Applications.

EDUCATION

• Ph.D. 1996. Electrical Engineering and Computer Science. University of Minnesota, Minneapolis, MN. Focus: Parallel and Distributed Databases; Performance and Scalability; Data Mining. Worked with corporations during research and implementation. Authored papers and presented at conferences. PROFESSIONAL EXPERIENCE

WESTERN DIGITAL CORPORATION, Milpitas, California (July 2015 - Present) Global Director of Software Development Engineering: AI/ML, Big Data, Databases, and Systems Architecture

• Driving an engineering team using Agile SAFe methodologies, to define, design, develop and deliver performance-optimized software features and solutions on converged and disaggregated architectures and Data Lakehouses involving GPU/CPU compute and composable data infrastructures ranging from NVMe/NVMe-oF all-flash arrays, object storage systems to Zoned Named Space systems (e.g., NVIDIA Tesla V100/A100 GPU servers, OpenFlex, NVMe-oF platforms, ZNS systems, IntelliFlash, ActiveScale and others) o Ongoing design, development and optimization of the key components of a Data Lakehouse on next-gen storage systems involving MongoDB and Cassandra NoSQL databases in the streaming layer, Delta Lake and Data Lake in the batch layer, and AI/ML data pipelines in the serving layer. o Delivered an end-to-end optimal solution stack and a Reference Architecture for a Percona MySQL database utilizing a MyRocks storage engine on Zoned Storage NVMe SSDs. o Delivered features for fast ingestion (with Kafka), metadata indexing and search (with Elasticsearch), new file and object notifications (with Kafka streaming framework), SQL query capability (with Presto and Hive), real-time/smart analytics (with Spark Streaming and Cassandra), data tiering and fast retrievals (using ML clustering algorithms and a metadata management technique), image/text/video analytics (with TensorFlow, PaddlePaddle and PyTorch), and optimal data provisioning and smart data placement strategy for multi-tenant applications on MySQL/MyRocks stack utilizing zoned namespace storage systems.

o Delivered various end-to-end AI data pipelines on a disaggregated architecture comprising GPUs, flash arrays and object storage; demonstrated ease of resource provisioning and independent scalability, scalable AI model training throughput and fast inference response times without compromising on data persistence, durability and cost. Used AI frameworks like TensorFlow, PaddlePaddle and PyTorch for image and text analytics, and video object detections on incoming streaming data, followed by file/object metadata indexing/search and visualization using Elasticsearch and Kibana.

• Delivered benchmark results, reference configurations, best practices and value propositions for the above features/solutions for ease of customer deployments.

• Developed a software-defined prototype system for smart archiving and analytics for WD Manufacturing, comprising a Six Sigma DMAIC process framework. Used a disaggregation of compute cluster running Hadoop with S3 connector to an object storage, within an OpenStack framework. Designed an innovative data partitioning scheme for manufacturing process data that allowed for a reduction in query response times for detecting defect patterns by 17x. Deployed the system both on-premise and in a private cloud and optimized performance end-to- end from ingestion to analytics to visualization on an hourly dashboard, using real-world manufacturing data. Demonstrated this actionable solution to reduce the TCO by th of the original and yield a higher ROI for the business, compared to the existing methodology. This allowed for IT and Manufacturing teams to adopt the strategy in production, in the long run.

• Explored Blockchain Smart Contracts and applicability to industry verticals, like healthcare for future implementations in products.

• Work closely with Product Management and Marketing to define use cases for industry verticals and help promote 3

existing and upcoming products, aligned with customer needs.

• Ongoing work on AI data pipelines and software ecosystems from edge to core to cloud by combining various products and demonstrating price/performance benefits to various IoT customers in car manufacturing, defense and intelligence, media and entertainment and genomics/healthcare.

• Author patents and technical white papers; regularly speak and chair panels/sessions at conferences and meetups

(AI Dev World, Global AI /Big Data Conference, Flash Memory Summit, SNIA SDC, Strata Data Conference, AI, and Big Data Sciences Meetup, and others. Refer to link at the end). Active member of WD Invention Review Board to ensure the quality of patent applications in big data/analytics and systems architecture domains.

• Completed several company-sponsored leadership trainings. TERADATA CORPORATION, Santa Clara, California (Dec 2014 - July 2015) Director of Engineering, Big Data/ML, Performance

• Responsible for delivering performance and scalability of the distributed Teradata Aster Analytics Discovery Platform (SQL-MapReduce, SQL-Graph, R, Machine Learning Algorithms) / Big Data / Hadoop / UDA (Unified Data Architecture) features, infrastructure and real-world big data application workloads for new releases and emerging technologies.

• Big Data Application workloads (Petabyte scale) explored were aligned with Retail, social media, Healthcare, Telecom and Financial vertical markets, as follows: o GenBase (data management and complex analytics for Healthcare market); BigBench (End-to-end retail workflow comprising ETL, processing and analytics of structured (table joins) data, unstructured

(customer behavioral/sentiment) data and semi-structured (web clickstream) data, and reporting queries); Facebook/Twitter Graph Analytics; and TPCx-HS (supporting analytics atop Hadoop/HDFS).

• Explored Machine Learning models (K-Means, Decision Tree, Support Vector Machine, etc.) for performance, scalability, and accuracy (e.g., network anomaly / intrusion detection and prediction, sensor data analytics, etc.).

• Involved in software design and development to meet performance requirements – e.g., analytics profiling system and performance advisory features into the product.

• Defined the Performance Engineering Life Cycle (PELC) to be aligned with the Software Development Life Cycle (SDLC) in an ecosystem involving Continuous Integration (CI) and Appliance platforms. The approach included software design, development and automation involving technologies for Big Data and Analytics like Accumulo, Docker containers, Hadoop/YARN, Kafka, Storm, Spark and Machine Learning Algorithms for real- time Actionable Intelligence.

• Fostered industry collaborations and partnerships based on emerging Big Data, Cloud and Analytical market trends for competitive product advantages.

• Organized Big Data Science Meetups (at corporate facilities, Strata conference and Hadoop Summit) focusing on models and solutions for IoT, ICT, Machine Learning and Cloud-based infrastructures. Consultant to Huawei Technologies, Santa Clara, California (Sep 2014 – Nov 2014 ) Chief Architect / Director - Big Data, Cloud Infrastructure & Platforms (R&D)

• Built and led a team of Architects and designed the Huawei R&D Big Data Insight Platform for private cloud, using OpenStack (project Sahara). The platform comprised Hadoop/YARN, Kafka, Ambari/OpenStack, Storm, Spark, NoSQL database, Graph analytics and Machine Learning. Designed 3 use cases for the Big Data Insight Platform hosted in the private cloud. Use cases involved R&D Organizational KPIs, R&D Development Quality Metrics and Data Center Operational Efficiency.

• Organized two Big Data Science Meetups, sponsored by Huawei along with presentations on ICT, IoT and IoE Applications.

SGI (Silicon Graphics International), Milpitas, California (Feb 2007 – June 2014) Director of Engineering, Big Data Solutions and Performance

• Built and led a team of engineers & architects to design and deliver performance-optimized SaaS and PaaS offerings for big data analytics. This involved machine learning, BI tools, in-memory databases, and Hadoop aligned with latest market trends to meet corporate verticals like Financial Services, Healthcare and Defense/Intelligence. Delivered -

o Optimized Hadoop/YARN bundled appliances (large petabyte-scale) and software as a service (SaaS) offering involving Cloudera, Pivotal HAWQ, Impala, Hive, HBase, Spark, Crossbow and Tableau for visualization. Application use cases were Cancer Genomics and Financial fraud detection. 4

o Terabyte-scale in-memory database solutions (Oracle 11g, Oracle 12c, and MS SQL Server); SAP HANA solution and Apps certifications on a large-memory, scale-up platform like SGI UV; Graph database solutions (Neo4j, AllegroGraph) along with Graph visualization (Keylines, Gephi, Relfinder); NoSQL database appliances (e.g., MarkLogic XQuery) on scale-out clusters. o SaaS and PaaS offerings (Data Science as a Service) involving Machine Learning and Predictive Analytics for fraud detection, credit risk management and cancer genomics in the Cloud. o Optimized Linux KVM and OpenStack solutions on SGI UV NUMA platform for the Cloud. Delivered best practices for optimal performance and manageability of virtual machines on multi-core and large memory platforms in the Cloud.

o Several TPC and SPEC benchmarks (like SPECjbb, SPECpower, SPECsfs); Hadoop benchmarks; internally developed benchmarks for Big Data Applications; enabled profiling tools and methodologies for big data applications.

• Led IaaS, PaaS and SaaS operations in the data center to implement, automate and deliver various enterprise solutions, and support partner and customer POCs, as applicable.

• Defined Big Data Strategy and ecosystem aligned with market trends and voice of customers. Organized and hosted Big Data Meetups within corporate premises and at external conferences like Hadoop Summit, Strata, etc. Presented at annual SGI User Group conferences, Sales conferences, and external conferences.

• Authored patents involving design and optimization of big data ecosystems on multi-core, large memory architectures in the Cloud; and graph visualization. PANTA SYSTEMS, INC., Cupertino, California (Aug 2005 – Feb 2006) Sr. Manager, Enterprise Performance Group

• Built, mentored, and managed a team of Senior Principal Engineers to design, develop and deliver high- performance, scalable Grid solutions on Linux Blade clusters using Oracle RAC. Delivered TPC performance benchmarks on AMD x86-64 Linux clusters configured with IB-attached storage arrays for Grid Computing. ORACLE CORPORATION, Redwood Shores, California (July 1997 - May 2005) Platform Manager and Principal Member of Technical Staff

• Managed and led a team of Sr. software engineers to design, develop and deliver performance optimizations for Oracle code releases (8i, 9i, 10gR1 and R2) on scale-up and scale-out Unix and Linux platforms. Optimizations in Oracle involved asynchronous I/O execution, oracle profiling, compiler enhancements, InfiniBand latency improvements for RAC, NUMA locality, parallel query execution and others. Worked closely with Business/Alliance Partners, Release Engineering, Build/Packaging, Product Management and Customers. Worked within a matrix and geographically dispersed organization involving teams in US, APAC, and European Development Centers.

• Delivered several TPC-C, TPC-H and Application Benchmarks demonstrating leadership of Oracle performance in the database industry.

• Authored several papers, presented at external conferences, and served conference committees, like HP World and others.

• Completed Oracle-sponsored management and leadership trainings. INSTITUTE OF MATHEMATICS AND ITS APPLICATIONS, Minnesota (Aug 1996 – June 1997) Post-Doctoral Research Fellow

• Performed research (funded by National Science Foundation) on Data Mining algorithms, Parallel and Distributed Databases. Collaborated with companies during research and implementation. Published several papers. CONFERENCE PRESENTATIONS, PUBLICATIONS, PATENTS

• Link to Publications

Contact this candidate