Posted May 2026
Platform Engineer - Infrastructure, Backend & Data
Core Platform - embedded across Infra, Backend, and Data Engineering
This is not a classical DevOps role.
It is DevOps reimagined for the AI era, where the boundary between infrastructure, backend, and data pipelines is not a clean handoff but a working surface you operate across every day.
View full job description
What we are building
We are building an AI-native platform designed around LLMs, agents, RAG pipelines, vector stores, queues, and bursty workloads.
Our system runs LLM and inference workloads continuously, handles quiet-then-bursty traffic patterns, has agents calling agents and tools calling tools, will increasingly involve GPUs, and cannot afford runaway infra cost or vendor lock-in.
You do not need to be an AI infrastructure expert on day one. You do need to be curious about it and willing to learn token economics, GPU utilization, and inference patterns as the role demands.
The kind of engineer we want
- Reads code and does not wait for a Jira ticket with infra requirements before sizing a service or debugging an issue.
- Writes real Python for internal tooling, deployment automation, and service-level fixes.
- Thinks in runtime behavior, not diagrams: async patterns, file descriptors, Spark partitioning, retry loops, database pressure, and resource sizing.
- Treats infra cost as a function of code quality and avoids writing the expensive version in the first place.
- Is comfortable being wrong, then fixing it with loose opinions and thorough debugging.
What you will actually do
Run cloud infrastructure across multiple providers
Own provisioning, networking, cluster lifecycle, and cost through Terraform and Helm.
Operate Kubernetes seriously
Manage production namespaces, stateless services, async workers, Spark jobs, AI inference workloads, Helm charts, upgrades, and platform stability.
Contribute to the data platform
Work with Spark, Airflow, Iceberg, and Trino on Kubernetes. Size jobs, debug performance, improve DAG infrastructure, and partner with Data Engineering on capacity and reliability.
Sit close to the backend team
Read services, understand connection pooling, async, queues, caching, and Python gotchas. Pair on changes that affect runtime cost and behavior.
Build CI/CD that engineers actually like
Create fast feedback, safe rollouts, easy rollback, and sensible secret management using GitHub Actions and practical pipeline design.
Own observability across the stack
Use Prometheus, Grafana, and Loki for core signals, plus AI-specific metrics like token spend, model latency, and cache hit rates.
Help us think about cost
Use right-sizing, consolidation, spot strategy, and migrations to make meaningful multi-cloud credits last.
Influence design, not just operations
Act as a peer in system design, push back on bad architecture, and ask teams to restructure flows when runtime behavior demands it.
Required skills
- 3-10 years building production systems as a platform, DevOps, SRE, backend engineer, data engineer, or a mix.
- Linux, networking, and cloud fundamentals on at least one major provider.
- Production Kubernetes experience beyond bootstrap tutorials.
- Comfort with Terraform or equivalent infrastructure as code for real infrastructure.
- Strong Python experience on backend services, not just scripts.
- Working knowledge of distributed computing such as Spark, Flink, Dask, Beam, or similar.
- Experience building or maintaining CI/CD pipelines that real teams depended on.
- Observability done right: dashboards and alerts that someone other than you actually used.
- Clear written communication.
Nice to have
- GPU workloads or AI inference serving in production, such as vLLM, TGI, Triton, SageMaker, or Bedrock.
- Depth with Iceberg, Trino, Airflow, or Kafka.
- Cost-attribution work and the ability to explain spend per service or tenant.
- Backend experience in Rust or Go.
- Open-source contributions or a homelab we can geek out about.
Meta-skills
- Reads code and is not afraid of an unfamiliar codebase.
- Uses AI tools daily, including in this role.
- Comfortable with ambiguity and fast iteration.
- Strong engineering intuition.
Failure, resilience, and chaos thinking
AI systems fail in new ways: partial LLM failures, vendor API rate limits, retry storms from agents, streaming interruptions mid-token, and model degradation. You should be comfortable with circuit breakers, graceful degradation, fallback paths, and feature flags at the infra level, and willing to learn AI-specific patterns as we encounter them.
How you will know you are succeeding
- The platform stays stable through real production load and you are the reason it does.
- Deployments get faster and rollbacks become boring.
- A Spark or Airflow problem lands on you and gets fixed without waiting for the Data Engineering queue.
- A backend bottleneck lands on you and you root-cause it through code, not just metrics.
- Infra cost per request goes in the right direction and you can explain why.
- Engineers across infra, backend, and data come to you when they are designing something because they want to, not because they have to.
Why this role
- You will not be siloed. Most DevOps roles stop at YAML and dashboards; this one puts you across infra, backend, and data with ownership to fix things wherever they break.
- AI-native means new problems: token economics, GPU utilization, and agent retry storms are genuinely new challenges we are figuring out together.
- Modern tooling and culture: AI-assisted coding, async-first communication, low ceremony, and high ownership.
- Hybrid work: about 3 days in office in Bangalore or Chandigarh, 2 from wherever works for you. Full remote can also be discussed.
- Early enough that decisions matter, stable enough that you will not be firefighting weekends.
