GUNASURIYA.B
Senior Data Engineer / Architect · Cross-cloud Streaming · Voice AI

Pipelines
that pay rent.

Cross-cloud streaming infrastructure for the Voice-AI era.

By Gunasuriya Balasubramani  ·  Senior Data Engineer / Architect
01Impact
$1M
Annual
run-rate
savings
▼ –87%

A 26-node Cloudera CDH estate retired in favor of a GCP-native, autoscaling Unified Data Platform. 5× faster, hardware failures eliminated, on-call finally sleeps.

  • Run-rate cut
    −87%
    ~$1.14M → ~$144K · per year
  • Throughput
    5.0×
    15-min batch → 3-min stream
  • Clouds unified
    4×
    GCP · OCI · AWS · on-prem
  • Tenure
    7+ yrs
    IoT · MarTech · Voice AI
02 · About

Seven years building data systems that pay rent.

— Profile

Senior Data Engineer with seven-plus years across IoT telemetry, marketing analytics, and now enterprise Conversational AI. Currently the technical authority for a six-engineer pod in Bangalore at SoundHound AI — consolidating GCP, OCI, AWS and on-prem sources into a single source of truth for Voice-AI workloads.

Architecture, code review, design approval, mentorship; daily collaboration with engineering teams in the US, UK and Ukraine. Previously responsible for marketing-platform instrumentation at eDirectSys and the SolarPulse IoT telemetry product at Mahindra Teqo.

Open to remote Staff or Principal roles — Bangalore or UAE. Available for client meetings and conferences in Dubai.

03 · Featured Case Study

A 26-node Hadoop estate, retired in favor of an architecture that pays for itself.

The migration that defines my recent work. Designed and led the cutover off a static on-prem Cloudera cluster onto a GCP-native, autoscaling Unified Data Platform — without losing a byte and without a single incident-night during transition.

▼ Decommissioned

Cloudera On-Prem CDH

Compute
26 nodes · static
Latency
15-min batch windows
Reliability
HDD failures, regular outages
Toil
Hardware ops, legacy overhead
Scaling
Manual, capped
● Live in production

GCP Unified Data Platform

Compute
1M / 2W Dataproc + autoscaling
Latency
~3-min streaming average
Reliability
Zero hardware failures, high uptime
Toil
GitOps-managed, fully observable
Scaling
Elastic, demand-driven
— Outcome

~$1M in annual savings. 5× faster. Same data, same SLAs, and an on-call rotation that finally sleeps.

04 · Reference Architecture

One source of truth — fed by every cloud you already run.

/* Conceptual cross-cloud streaming fabric — abstracted to respect NDA. Each source cloud owns its regional Kafka and processes locally; writes converge in a unified BigQuery hub via the Storage Write API. */

Source clouds
GCP / Native
Kafka + NiFi
spark · dataproc
OCI
Kafka regional
pyspark · oci dataflow
AWS
Kafka regional
spark · emr
On-prem
Kafka VM
spark · self-managed
Hub
Unified Data Platform
BigQuery · CloudSQL · GCS
Storage Write API · BQ DTS
Consumers
Research BU
Notebooks · ML
Analytics BU
Dashboards · BI
Product BU
APIs · GKE Services
Hot path
ClickHouse · Realtime
Auth · Secrets
Vault · Workload Identity Federation
Network
Private interconnect (cross-cloud)
Orchestration
Airflow · Terraform · ArgoCD · Jenkins
Migration mode
Parallel build → dual-write → validate → cutover
05 · Career

Three companies. One throughline.

2022 — Now
3.5+ yrs

Interactions LLC SoundHound AI

Senior Data Engineer · Voice & Conversational AI

Architecting the cross-cloud Unified Data Platform consolidating streaming, analytical and transactional sources from GCP, OCI, AWS and on-prem into a single source of truth.

Acting as technical authority for a 6-engineer pod in Bangalore — designing the architecture, approving the implementation path and reviewing code; coordinating daily with US, UK and Ukraine engineering teams. Acquired by SoundHound AI in September 2025.

Stack
GCP · OCI · AWS
Kafka · NiFi · Spark
Dataproc · BigQuery · Dataflow
Terraform · Airflow · ArgoCD
2021 — 2022
~2 yrs

eDirectSys

Full-stack & Data Engineer · Digital Marketing

Built data pipelines and the analytics surface for a high-volume email marketing platform: bulk-send infrastructure, instrumentation for opens / clicks / conversions, and the dashboards that helped clients see — in near-real-time — how each campaign was performing.

Stack
Python · SQL
ETL pipeline design
Campaign instrumentation
Visualization & dashboards
2019 — 2021
~2 yrs

Mahindra Teqo

Data Engineer · Renewable-Energy IoT

Worked on SolarPulse — a real-time telemetry product for offshore solar plants. Streamed live data from inverter, grid and environmental sensors via gateway devices into a central time-series store, powering live operational dashboards and forecasting models that operators relied on to keep plants running.

Stack
IoT ingestion · Streaming
Time-series storage
Live dashboards
Forecasting pipelines
06 · Technical Stack

The toolbox, unsentimental.

Core
  • Apache Sparkpyspark · sql
  • Python3.11+
  • SQLbq · psql
  • Structured Streamingspark
Cloud
  • Google Cloudprimary
  • Oracle Clouddeep
  • AWSworking
  • On-prem / Hybridexperienced
Streaming & Storage
  • Apache Kafkabroker · cdc
  • Apache NiFirouting
  • BigQuerywarehouse
  • Dataflow · Dataproc · EMRcompute
  • CloudSQL · GCS · OCI Objectstore
  • ClickHousehot path
Ops · GitOps · Sec
  • Terraformiac
  • Airfloworch
  • ArgoCD · Jenkinscd
  • GKEk8s
  • Vault · WIFsecrets · auth
07 · Certifications

Two Google Cloud certifications, in flight.

G
Google Cloud Associate Cloud Engineer
▲ Sitting · May 2026
G
Google Professional Data Engineer
▲ Sitting · June 2026