Core area

AI Inference

Model serving, latency, throughput, KV-cache pressure, GPU/cloud economics, and AI factory architecture.

Thesis

Inference is where AI product quality, user experience, margin, and infrastructure reality meet. Good systems require measured tradeoffs across latency, throughput, memory, batching, scheduling, hardware, and deployment topology.

Core areas

Serving Stacks

Notes and artifacts will collect practical tradeoffs, measurement patterns, and architecture implications.

Latency and Throughput

Notes and artifacts will collect practical tradeoffs, measurement patterns, and architecture implications.

KV Cache and Memory Pressure

Notes and artifacts will collect practical tradeoffs, measurement patterns, and architecture implications.

Batching and Scheduling

Notes and artifacts will collect practical tradeoffs, measurement patterns, and architecture implications.

Cost Models

Notes and artifacts will collect practical tradeoffs, measurement patterns, and architecture implications.

GPU Cloud and AI Factory Notes

Notes and artifacts will collect practical tradeoffs, measurement patterns, and architecture implications.

Benchmarking Methodology

Notes and artifacts will collect practical tradeoffs, measurement patterns, and architecture implications.

Reliability and Observability

Notes and artifacts will collect practical tradeoffs, measurement patterns, and architecture implications.

Projects and artifacts

public repo

ai-hub

Practical learning and building hub for modern AI systems, including inference engineering, agents, security, EvalOps, and model architecture notes.

AI systemsInferenceEvalOps

What it demonstrates: How a broad AI systems knowledge base can organize production patterns, labs, and technical reading paths.

public repo

ga-gcp-ai-journal

Working journal for learning, building, and documenting generative AI workflows on Google Cloud.

GCPJournalAI workflows

What it demonstrates: How daily technical notes can capture architecture tradeoffs and experiment results without becoming a scratchpad.

public repo

ga-gcp-ai-labs

Hands-on Google AI and GCP lab projects, experiments, and reference implementations.

GCPVertex AILabs

What it demonstrates: How cloud AI labs can stay structured around validation, repeatable setup, and deployable patterns.

public repo

ga-gcp-ai-patterns

Cloud AI patterns and labs around production deployment.

GCPAI patternsCloud

What it demonstrates: How cloud primitives shape deployable AI systems.

public repo

inference-engineering

Inference engineering notes and examples around serving, latency, throughput, and deployment topology.

InferenceServingGPU economics

What it demonstrates: Inference as the point where product quality, user experience, and infrastructure economics meet.

public repo

inference-engineering-book

Book workspace for AI Inference Engineering.

BookInferenceArchitecture

What it demonstrates: Long-form systems treatment of model serving and AI factory architecture.

Reading path

1

How to think about inference latency

2

Serving stack map

3

Benchmarking methodology

4

Cost/latency calculator

5

AI factory implications