SOUBHIK CHAKRABORTY
Computer Programmer & Technologist
[ ********.*@*****.*** Ó +91-880-***-**** R G-302, Rohan Nilay-1, Aundh Pune, India
github.com/soubhik-c linkedin.com/in/shoubhik-chakraborty-461aa424 CAREER PHYLOSOPHY
“What we have to learn to do, we learn
by doing.”
- Aristotle
STRENGTHS
Hard-working (16/24) Fast Learner
Customer Interaction Analytical Skills
TECHNICAL SKILLS
Concurrent Programming
Data Structures
ML/DL/OR
Cloud Computing
distributed systems bigdata nosql
scala core java c++ python
spark elasticsearch syslogng
kafka docker swarm k8s helm
javacc antlr lex/yacc
aws azure packet
linux windows macintosh solaris
RECOGNITIONS
3
VMware Individual Excellence Award
for outstanding contributions in the
India Center
3
VMware Hackday Competition
Stood 2nd – came up with a graphical
navigation of distributed query plan
using d3js.
Opus Appreciated for
outstanding contribution in product
implementation on behalf of Janney
Montgomery Scott, Wall Street, NY
EDUCATION
Diploma in Advanced Computing
CDAC - Chennai
Aug 1999 – Jan 2000
B.Sc. in Mathematics
Government Autonomous Science College
Mar 1994 – Mar 1997
EXPERIENCE
18+ years of programming expertise
BigData Consulting
Freelancing
April 2018 – Present Pune
Scaling EFK stack with syslog-ng/rsyslog to ingest 500k to 1000k msgs/sec. in aws using docker swarm or kubernetes handling cloud bursts, zero message loss, parallel streaming to data lakes, DR & WAN replication.
Analysis & tuning of ScyllaDB for 100K write/sec & 3-4K scan/sec of 1-2TB datasets, used in ETL pipeline’s write-ahead logging. Staff Engineer
VMware (rejoin)
Jun 2017 – Feb 2018 Pune
Implementation of Distributed Network Encryption (NSX security). Staff Engineer
SnappyData Technologies Pvt. Ltd.
Mar 2016 – Jun 2017 Pune
First few lead engineers developing a HTAP database.
Thought leadership in Approximate Query Processing
Implemented several Spark optimizations like adding CBO, extending whole stage code generation, min-max column statistics et cetra in Spark 1.6/2.0 . Staff Engineer
VMware India Pvt. Ltd.
Apr 2013 – Feb 2016 Pune
Design & implementation of global index (total ordering) & distributed joins.
Off heap slab allocator for the database using java unsafe apis. Sr. Member of Technical Staff
VMware India Pvt. Ltd.
May 2010 – Mar 2013 Pune
Implemented early conflict - lockfree distributed transaction in the database.
Developed eventual consistency based active-passive WAN replication. MY WORK DAY
Productivity Improvement
Architecture & Design
Play or Hobbies
Reading technical
papers & study
academic research
Resolving issues &
requirement analysis
of Customers
Sleeping & family
time
Learning on
the go
Administration
Coding
Mentoring
Reviewing
Bug analysis
Gathering code
insights & new
ideas
0created in L
ATE
X
PUBLICATIONS
SnappyData: Streaming, Transactions, and
Interactive Analytics in a Single Cluster
SIGMOD 2016, San Fransisco, CA, ACM
full paper: snappydata.io/snappy-industrial
SnappyData: A Unified Cluster for
Streaming, Transactions, and Interactive
Analytics
CIDR 2017
PATENTS
Participant ———
GII Get Initial Image non-blocking replication of data in an upcoming node while operations continue to happen in the source copy.
Efficient write-through, write-behind to hadoop.
Co-Inventor ———
Applying a DDL statement atomically in a cluster.
Criteria based data eviction from in-memory cache. Others that couldn’t be filed ———
Early conflict detection in a distributed transaction.
Query optimization by data co-location.
RECENT BIGDATA/CLOUD CONSULTATION
Overview ———
Apr’ 2018 - present
Experience with big data management upto 100TB & high performance computing upto 10 nodes cluster.
Proficient level at cloud deployments with aws (awscli, root account administration), azure & packet clouds.
Intermediate level at provisioning and automation with ansible, vagrant, packer.
Advanced user of containerization using docker,
docker-swarm, kubernetes (calico, flannel etc n/w
optimization with Ena & sriov/dpdk support, block device volume optimization), daemon sets & statefulsets helm charts.
Opensource developer and contributor to gemfire, apache spark (cbo), snappydata & syslogng.
DCEngines ———
Apr’ 2018 (3 months)
Analysis & tuning of scylla database for 100K writes/sec & 3-4K scans/sec of 1-2TB datasets, used in ETL pipeline’s write-ahead logging deployed on packet cloud.
Next Gen HTAP database product architecture & designing.
Bigdata infrastructure setup on packet cloud, firewalling, virtual machine performance tuning, aligning networking configuration with underlying hyper-v.
Nevis Networks ———
Sep’2018 (7 months)
Scaling elasticsearch, kibana & syslog-ng/rsyslog/fluentd to ingest 500k-1000k msgs/sec with absolutely no jitters as part of OLS (Open Log Stack) centralized logging
infrastructure for network device monitoring and
troubleshooting deployable on baremetal/amazon/azure, private & hybrid clouds.
Optimized syslog-ng’s elasticsearch java-plugin towards achieving high throughput. refer pull request 2728 for more details.
Performance tuned kernel settings in aws vms for udp log collection in syslog-ng/rsyslog with zero packet loss.
Deviced custom kibana dashboard giving an unified view of the system by gathering log collectors’ ( syslog-ng / rsyslog / fluentd ) statistics and elasticsearch index statistics. This provided self monitoring of the product at various levels & document quick bottleneck identification guidelines and deterministic troubleshooting steps according to the deployment environment.
Guided aws support to identify underlying hypervisor tuning observed from VM level bottlenecks related to networking & kernel mainline upgrade on centos.
Identified various issues in docker-swarm overlay networking effecting scalability and performance of the virtual machines and docker containers.
Engaged with syslog-ng team to help evaluating native http destination plugins and identifying udp socket source weaknesses.
Learned docker & docker swarm alongside above mentioned log ingestors while deliverying desired performance results. Lumina Networks ———
May’18 (till now)
Enhancing deployment of the same OLS (Open Log Stack) solution on kubernetes cluster using helm charts at a very large scale (1000s of network devices & network service agents and 10s of ES nodes) providing centralized log analytics.
Targetting OLS to support adhoc and analytical class long running scan oriented querying upto 100TB of data as per various data retention policy in elasticsearch without effecting ’zero jitter ingestion rate’.
Tuning of calico kubernetes overlay networking on aws using sriov and dpdk support from xen/kvm hypervisors deployed on dedicated h/w.
Evaluating prometheus for realtime monitoring of the deployment infrastructure complementing ols centralized logging & its self monitoring seemlessly integrated with other lumina products.
Working on provisioning of 1000s of nodes & 10s of ES data nodes using packer & ansible generated images, vagrant & ansible based baremetal and aws/azure cloud deployments.
Providing technical assitance to define enterprise wide processes for configuration management, CI/CD of various lumina products, jenkin integration, artifact repository managemnt, product rollout & production deployment.
Instrumental in setting up inhouse artifactory, git, helm repositories catering to large scale lumina customers like AT&T, TMobile, Verizon as part of leap-infra team. 0created in L
ATE
X
WORK HISTORY
Staff Engineer
SnappyData Technologies
Mar 2016 – Jun 2017 Pune
scala microsoft sql server aws linux
Features:
Integrated Spark streaming & provided parallel ingestion over microsoft CDC listed in the blog here.
Performance Enhancements:
Implemented cost based optimizer in spark to prefer colocated join between tables & indexes and optimal join order.
Part of optimizing WholeStageCodeGen to emit vectorized scan/filter/project plan operating on 60GB+ columnar table data.
Replaced spark’s hash aggregation and hash join operators to utilize vectorization, column encodings delaying unsafe row creation thus alleviating gc pressure.
Participated in min-max scan optimization for column batches.
Benchmarked and improved performance of TPCC & TPCH mixed workload by 3X at 100GB scale.
Generated execution profile of YCSB comparing between memsql, cassandra & snappydata on 1 to 100 GB datasize. Staff Engineer
VMware India Pvt. Ltd.
May 2010 – Feb 2016 Pune
SnappyData (in stealth mode) ———
snappydata.io
4 months
scala linux
Product vision and idea formation centering around MPP databases.
Evaluated multiple open source products suitable to the purpose.
Finished proof of concept implementation in 2 months while learning functional programming & apache spark.
Extended spark with the combinatorial parser making it suitable for realtime query analysis.
Implemented basic unification of the two products via DataSource apis exposed in spark.
Enriched DataFrame/RDDs with stratified sampling over spark streaming.
Compared with spark’s builtin MLlib with our approximation.
Feasibility study of supporting MLlib apis through sql interface applied on the unified storage in gemfirexd. GemFireXD ———
Re-positioned as clustered database – gemfirexd.docs.pivotal.io 5 years & 5 months
scala core java c++ linux
Features:
Participated in thrift based ODBC driver implementation.
Technical guide for GemFireXD monitoring system – Pulse.
Designed & implemented "explain" a.k.a query execution tracing & query-plan visualization in command line for troubleshooting perf issues. Exposed as javaagent
instrumentation optionally for more indepth profiling.
Implemented session monitoring for administrators showing client details of long running resource intensive queries.
Query cancellation through jdbc api & builtin procedures.
User Defined Functions and cluster wide dynamic jar installation & loading/unloading.
Parallel WAN replication of colocated tables.
Memory Optimizations:
ResultSet streaming from remote nodes for queries returning large number of rows.
Fine tuned memory footprint of GemFire PartitionedRegion to reduce per entry overhead from 1024 bytes to 128 bytes.
Implemented CompactHashMap for extremely specialized global index adaptive caching on local node’s
PartitionResolver with 4-8 bytes of per entry overhead. Performance Enhancements:
Avoided System.nanoTime 2.x kernel bug by implementing low level performance counter using jna/jni for query tracing.
Enriched DRDA protocol for single hop primary key based point queries.
Wrote a very light weight parser for query pattern matching of statements to capitilize on query-plan caching.
Accomodated this part-constant-part-variable token in the Cost Based Optimizer for optimized plan.
Introduced Rule Based Optimization to reduce compile time of nested views from 20 mins to 5-10 seconds.
Introduced local index persistence that expedited cold restarts of the cluster from few hours to 1 0 minutes per index.
Modified ConcurrentSkipList to intake partial sorted batches of rows while populating back local indexes from disk.
Introduced EntrySet iterator to PartitionedRegion that utilized memory optimized key/value storage.
Modified entrySet to use DiskSavyIterator helping sequential disk reads reducing kernel pressure.
Used DirectByteBuffers & java NIO for zero copy network IO.
Helped implementing selector model on java sockets decoupling 1-1 reader threads.
GC Pause Reduction:
Came up with byte-to-byte comparison for massive scans in order to reduce garbage bursts avoiding long gc pauses.
Optimized low level row de-serialization functions to be more jit friendly avoiding frequent decompilations.
Plugged in our own finalizer instead of java’s interface that invariably takes 3 gc cycles to finally cleanup dependent objects.
Modified ConcurrentHashMap with double dispatch
mechanism to avoid unnecessary object creation for apis like putIfAbsent.
Modified ConcurrentSkipList internal Node classes to reduce object overhead in local indexes.
Altered BigDecimal/BigInteger/DateTime/String classes via reflection to mutate and reuse shell objects while scanning large datasets.
0created in L
ATE
X
Senior Technical Lead
GemStone Systems
Feb 2007 – May 2010 Pune
GemFireXD ———
SQL on top of distributed cache – gemfirexd.docs.pivotal.io 2 years & 6 months
core java linux
Modifying Apache Derby:
Started with a simple idea of providing SQL interface over GemFire distributed caching api to reduce usage complexity and learning curve of the application developers.
Modified sql parser to add new DDL extensions related to additional GemFire capabilities of Region.
Applying DDL atomically across the grid with higher read priority over metadata writes & backing off capability in case of partially applied DDL.
Added new distributed query plan to AST & query-plan shipping for servicing select queries on the cluster.
Enriched optimizer with colocated join-order, global hash index Vs local sorted index selection, distribution cost considerations.
Implemented delta update propagation based on table metadata instead of full row transmission.
Support for efficient jdbc batch insert api.
Enhance code generation for n-way merge of "order by" &
"group by" queries, optimized top-N queries, avoid double de-serialization of rows when applying projections.
For performance benefit of point queries came up with a concept of global index (auxilary distributed hashmap).
Modified core mechanics of GemFire PartitionedRegion for consistency guarantees between primary table and global indexes without transactions in place.
Consistency guarantees of After Triggers in DMLs without transactions.
GemFire ———
pivotal.io/pivotal-gemfire open sourced as Apache Geode 9 months
c++ java linux windows solaris
Client/Server Security feature:
Introduced AUTH layer in jgroups membership management protocol used for both peer-to-peer and wan gateway interfaces.
Implemented mutual authentication between gemfire clients and server.
Client side querying:
Extended server side OQL processor to c++ client for sub-millisecond response time.
Senior Technical Consultant
Opus Software Solutions
Oct 2000 – Jan 2007 Chennai
m27 ———
a framework that helped focusing on “design by contract” methodologies
c++ microsoft windows
Developed a custom c++ compiler delegating finally to microsoft visual c++ compiler after 10 stage transformations. This enabled new keywords in application code to treat classes as business objects, model system and incorporate change requirements easily.
The compiler generated low level c++ api calls for object-relational mapping, auto-remoting of methods (RMI), publish-subscribe for notifications, automatic object caching
& lifetime management.
mTalk - It took care of data-dependent routing, connection load balancing, message persistence and publish-subscribe asynchronous notifications.
Trading Systems ———
Realtime Order & Risk Management system
vc++ microsoft windows microsoft sql server
Developed online transaction processing trading applications interfacing with various stock exchanges viz. NSE, BSE, NYSE, AMEX, HKSE. Some applications involved standard protocols like Financial Xchange (FIX) and widely used NYSE’s CMS.
Application development involved Business Process Modeling Language, Unified Modeling Language, hierarchical state models and finite state models (state machines).
Deployed TradeNow, DirectXchange and VelocityXchange systems that had order routing engine, realtime risk management, post trade management and back office
interfaces.
Programmar Analyst
Netlink Technologies
Mar 2000 – Oct 2000 Chennai
visual basic oracle asp vbscript
IZone - Satyam ISP browsing terminal
Developed server side of the product offering session management of internet users, presenting their internet usage time, providing re-charge options, billing & MIS reporting, user rights management etc as key features. Programmar Analyst
Risha Software Solutions
Aug 1997 – Sep 1999 Jabalpur
foxpro (for dos & unix)
Inventory Management and Resource Scheduling
Proofing mgt system, defence project from Director General of Quality Assurance (DGQA), deployed over 5km radius intranet using Novel Netware and SCO open server.
The system provided optimal schedule for thousands of ammunitions’ proofing (a quality check procedure by firing them) based on 42 factors.
In other words, its an Operational Research problem with 42 mutually dependent parameters, each parameter derived based on multiple other factors and input parameters.
Provided an inventory adjustment module implementing rollback feature in FoxPro.
0created in L
ATE
X