Deep Dive of Kubernetes Network

K8S is a dynamic network. Pods are ephemeral. IP change on every restart.

Containers with in the pod shared a single network namespace.

K8S networking Model:

1) Every Pod receive a unique and cluster wide IP address.

2) All pods on the same node can communicate directly without NAT

3) All pods on different nods can communicate directly without NAT

4) A Pod self seen IP is identical to the IP other pods use to reach it [Flat network]

Kubernetes specifies what is required and CNI plugins decide How to implement it

Communication pattern in K8S

Container to Container - within same pod via loopbackup [127.0.0.1]

Pod to Pod - Direct IP communication across nodes without address translation

Pod to Service - Kube proxy intercepts traffic and load balancing to healthy end points

External to Service - Exposed via NodePort, LoadBalance type or Ingress controller

Node to Pod - Kubelet and monitoring agents

Kube-Proxy:
Kube proxy runs on every node as a DaemonSet and part of the Kubernetes control plane. It watches API sever for any change of resource or end points. API server initate a end point object when selector create a resourece. Kube proxy is maintaining a chain of IP table mode. I used to maintain local and forward routing. IPVS is a kernel level virutal load balancer. It will handle thousand of service request and routing at a same time.
Pod to service will take care of kube proxy and pod to pod communication will take care of CNI.

CoreDNS:

CoreDNS is the cluster DNS server and deployed as a deployment in the kube system namespace.Every pod of /etc/resolv.conf is inject to point into CoreDNS.

Pod Networking:

Each pod has an Own network namespace and fully isolated stack. The namespace contain virutal vNICs, routing table and iptable rules.

Infra [pause] container creates and own a network namespace for the pod. All application containers in the Pod share the Infra container namespace at startup.

Virtual [veth] pair : Two virtual NICs connect between Pod and Node side. One end lives inside the Pod's network namespace [eth0] and other end is attached to Node like linux bridge [cbr0]

Traffic flow : Pod [eth0] -> veth pair -> host bridge -> node routing table -> destination

Cross Node communication:
Node to Node communication is used Overlay approach and Underlay approach.
Overlay approach is encapsulated a traffic and decapsulated from destionation node.
Underlay approach is a direct routing method.
Modern CNIs like calico & cilium will support both approach.
Overlay (VxLAN/Geneve) - It is universal compatibility and cloud friendly. It will support upto 50 bytes per packet if MTU set to 1450
Underlay - It required physical network to accept and route through BGP routing

Analysis a packet flow under flannel CNI.

controlplane:~$ kubectl get pods -n kube-flannel -o wide

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES

kube-flannel-ds-5sv5v 1/1 Running 0 15m controlplane <none> <none>

kube-flannel-ds-n7dxx 1/1 Running 0 15m node01 <none> <none>

controlplane:~$

node01:~$ tcpdump -i flannel.1 -n 'tcp' -vvv

tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), snapshot length 262144 bytes

0 packets captured

0 packets received by filter

0 packets dropped by kernel

node01:~$

Service:

K8S will face very difficult to manage an IP address across PODs. This issue will fix by service which providing an stable virtual IP (Cluster IP). It will act as a load balancer across all the pods. It will enable a loose coupling within application.

Service components:

Selector : Determines which pods belongs to this service
Cluster IP: Virtual IP assigned by K8S
Port: The port of the service listens on
TargetPort: The port on the container that the service forwards traffic to
Endpoints: The actual pod IPs and ports maintained by the endpoint controller
Metadata: Name, namespace, and labels for the identification and discovery

DNS Service:

CoreDNS will create a DNS record for service by automatically

FQDN format: <Service name>.<namespace>.svc.cluster.local

Short names: same namespace can be used as <servicename>

Search Domains: Kubernetes injects search paths for automatic resolution

A Records : Return the ClusterIP for standard Service lookup

SRV Records: For advanced applications needing protocol and port information

Kube-Proxy:

It is running as DameonSet on every Node and responsible for Service networking

Use Linux Kernel iptables rules for packet filtering and NAT

CoreDNS Configuration:

ConfigMap-based : All configuration in /etc/coredns/Corefile ConfigMap

plugins : Support for various plugins [K8S, etcd, forward]

Zone Configuration : Define which domains Core DNS manages

Upstream DS : can forward unknown queries to external DNS servers

Caching : Caches DNS responses to reduce latency and load

Logging : Can enable query logging for troubleshooting

Common issues related to Services:

Service is not reachable - We need to verify the selector label should match with Pod labels.

Some Pods are not receiving traffic - Validate the pod readiness status and liveness probes.

DNS is not resolving - We need to validate the CoreDNS in kube-system namespace

High Latency - Validate the kube-proxy mode, It may be iptables overhead

Uneven load distribution : Check pod resources and scheduling across nodes

Service IP not allocated : Verify Cluster IP range configured and available

Deployment Methods:

Multi tier applications : Cluster IP for backend and LoadBalancer for frontend

Hybrid deployments: Database might be deployed in cloud and services were deployed in Local

Blue-Green deployment : The customer has 2 types of setups for Prod, They will tested in standby before applied in active production environment.

Canary Deployments : They will segregate a loads through LoadBalancer and send 10% of loads into latest deployment.

Service Mesh Integration : Using Services as foundation for advanced networking

Multi-cluster : Services can be federated across multiple clusters.

Created service with Cluster IP for Web application:

ontrolplane:~$ kubectl get pods -o wide

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES

web-64c966cf88-45528 1/1 Running 0 20s 10.244.1.5 node01 <none> <none>

web-64c966cf88-4x8xq 1/1 Running 0 20s 10.244.1.4 node01 <none> <none>

web-64c966cf88-n66tm 1/1 Running 0 20s 10.244.1.3 node01 <none> <none>

controlplane:~$ kubectl expose depolyment web --name=web-service --t^C

controlplane:~$ kubectl expose deployment web --name=web-service --type=ClusterIP --port=80 --target-port=80

service/web-service exposed

controlplane:~$ kubectl describe service web-service

Name: web-service

Namespace: default

Labels: app=web

Annotations: <none>

Selector: app=web

Type: ClusterIP

IP Family Policy: SingleStack

IP Families: IPv4

IP: 10.97.253.216

IPs: 10.97.253.216

Port: <unset> 80/TCP

TargetPort: 80/TCP

Endpoints: 10.244.1.3:80,10.244.1.4:80,10.244.1.5:80

Session Affinity: None

Internal Traffic Policy: Cluster

Events: <none>

Created a test CoreDNS service and validate with Cluster IP address of Web application:

ontrolplane:~$ kubectl run -it --image=nicolaka/netshoot --restart=Never test-dns -- sh

All commands and output from this session will be recorded in container logs, including credentials and sensitive information passed through the command prompt.

If you don't see a command prompt, try pressing enter.

~ # nslookup web-service

;; Got recursion not available from 10.96.0.10

Server: 10.96.0.10

Address: 10.96.0.10#53

Name: web-service.default.svc.cluster.local

Address: 10.97.253.216

;; Got recursion not available from 10.96.0.10

Ingress and Ingress Controller:

Ingress controller continuously monitor the Kubernetes API for Ingress, Service and Secret changes.

Configuration Generation : Controller will generate a configuration for underlying of Load Balancer if detect any changes in API and.

Configuration Push : Updated configuration is applied to the actual load balancer or reverse proxy daemon.

Health Checks: Controllers verify backend services are healthy and update routing accordingly

Event-Driven: Entire process is asynchronous and event driven.

Backward Compatibility : Controller will ensure that exist traffic is not disrupted during configuration changes.

NGINX Ingress Controller:
* Most widely adopted controller in production and maintained by the community and NGINX.

* NGINX open source reverse proxy as the underlying HTTP/HTTPS server

* Rich Feature Set: Rate limiting, request/response rewriting, WAF integration, mutual TLS, JWT validation

Traefik - Modern & Cloud Native Option

* It is build for Kubernetes and microservice. It doesn't require service reboot while ingress configure changes.

* Buit-in Web UI and REST API for monitoring and management

* Powerful middleware chain for request transformation

* Lower memory and CPU consume compare to NGINX.

Other Popular Ingress Controller:

AWS ALB Controller : Provision AWS application Load Balancers directly, native AWS integration.

GCP Load Balancing : Google Cloud's integrated solution with advanced routing and DDoS protection

Azure App Gateway: Microsoft managed ingress solution with WAF and SSL offloading

Istio Ingress Gateway: Service mesh approach, combines ingress with advanced traffic management and security policies

Cilium Ingress Controller: eBPF based controller, ultra high performance and advanced networking features.

Configure TLS in Ingress resources:
TLS Block Structure : Specify hosts, Secret Name and optional Paths.

tls:

- hosts:

- api.test.com

- web.test.com

secretName: my-tls-secret

Secret Format: Kubernetes TLS secrets contain tls.cert and tls.key [private key] as base64 encoded data

Hostname Matching: TLS certificate hostname must match the Ingress host specification, mismatches cause browser warning.

Wildcard Certificate : Support *.test.com to serve multiple subdomains with single certificate

Mixed protocol : It can serve simultaneously for HTTP and HTTPS.

Controller Specific : Different controllers may support additional TLS features vis annotations

Advanced Routing Patterns:

Header based routing : Route based on HTTP headers controller

Query parameter Routing : Some controllers support routing based on query parameters

Weight Based Routing - Distribute traffic percentage wise across multiple backends

Request Transformation : Add/modify headers, append/strip paths before sending to backend services

Rate Limiting : Limit request per IP, hostname or custom key

Authentication/Authorization : Some controllers will support JWT validation, OAuth flows or mutual TLS verification at ingress.

Gateway API:
K8S is having limited Ingress by design. Gateway API will use to over limit of Ingress. It has a three Layers. 1) GatewayClass (controller implementation) 2) Gateway (listener and security config) 3) HTTPRoute (routing rules)

HTTPRoute is similar to Ingress rules but more flexible and reusable across multiple Gateways.

Create an Ingress controller:

controlplane:~$ kubectl create namespace ingress-ngnix

namespace/ingress-ngnix created

controlplane:~$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.8.1/deploy/static/provider/baremetal/deploy.yaml

namespace/ingress-nginx created

serviceaccount/ingress-nginx created

serviceaccount/ingress-nginx-admission created

role.rbac.authorization.k8s.io/ingress-nginx created

role.rbac.authorization.k8s.io/ingress-nginx-admission created

clusterrole.rbac.authorization.k8s.io/ingress-nginx created

clusterrole.rbac.authorization.k8s.io/ingress-nginx-admission created

rolebinding.rbac.authorization.k8s.io/ingress-nginx created

rolebinding.rbac.authorization.k8s.io/ingress-nginx-admission created

clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx created

clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx-admission created

configmap/ingress-nginx-controller created

service/ingress-nginx-controller created

service/ingress-nginx-controller-admission created

deployment.apps/ingress-nginx-controller created

job.batch/ingress-nginx-admission-create created

job.batch/ingress-nginx-admission-patch created

ingressclass.networking.k8s.io/nginx created

validatingwebhookconfiguration.admissionregistration.k8s.io/ingress-nginx-admission created

controlplane:~$ kubectl create namespace ingress-ngnix

namespace/ingress-ngnix created

controlplane:~$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.8.1/deploy/static/provider/baremetal/deploy.yaml

namespace/ingress-nginx created

serviceaccount/ingress-nginx created

serviceaccount/ingress-nginx-admission created

role.rbac.authorization.k8s.io/ingress-nginx created

role.rbac.authorization.k8s.io/ingress-nginx-admission created

clusterrole.rbac.authorization.k8s.io/ingress-nginx created

clusterrole.rbac.authorization.k8s.io/ingress-nginx-admission created

rolebinding.rbac.authorization.k8s.io/ingress-nginx created

rolebinding.rbac.authorization.k8s.io/ingress-nginx-admission created

clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx created

clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx-admission created

configmap/ingress-nginx-controller created

service/ingress-nginx-controller created

service/ingress-nginx-controller-admission created

deployment.apps/ingress-nginx-controller created

job.batch/ingress-nginx-admission-create created

job.batch/ingress-nginx-admission-patch created

ingressclass.networking.k8s.io/nginx created

validatingwebhookconfiguration.admissionregistration.k8s.io/ingress-nginx-admission created

controlplane:~$ kubectl get pods -n ingress-nginx

NAME READY STATUS RESTARTS AGE

ingress-nginx-admission-create-wjmb8 0/1 Completed 0 29s

ingress-nginx-admission-patch-nvst7 0/1 Completed 0 29s

ingress-nginx-controller-5c5949d455-8zn8s 1/1 Running 0 29s

controlplane:~$ kubectl get svc -n ingress-nginx

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

ingress-nginx-controller NodePort 10.110.76.184 <none> 80:30384/TCP,443:32500/TCP 61s

ingress-nginx-controller-admission ClusterIP 10.105.205.149 <none> 443/TCP 61s

controlplane:~$ kubectl create deployment nginx-app --image=nginx --replicas=2 -n test

error: failed to create deployment: namespaces "test" not found

controlplane:~$ kubectl create ns test

namespace/test created

controlplane:~$ kubectl create deployment nginx-app --image=nginx --replicas=2 -n test

deployment.apps/nginx-app created

controlplane:~$ kubectl expose deployment nginx-app --name=nginx-service --port=80 --target-port=80 -n test

service/nginx-service exposed

controlplane:~$ kubectl get pods -n test

NAME READY STATUS RESTARTS AGE

nginx-app-766796df68-826f9 1/1 Running 0 56s

nginx-app-766796df68-ggz4p 1/1 Running 0 56s

controlplane:~$ kubectl get svc -n test

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

nginx-service ClusterIP 10.97.195.253 <none> 80/TCP 52s

controlplane:~$ kubectl get svc -n test -o wide

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR

nginx-service ClusterIP 10.97.195.253 <none> 80/TCP 2m43s app=nginx-app

controlplane:~$

controlplane:~$ kubectl apply -f ingress-host-based.yaml

ingress.networking.k8s.io/host-based-ingress created

controlplane:~$ kubectl get ingress -n test

NAME CLASS HOSTS ADDRESS PORTS AGE

host-based-ingress nginx nginx.local 172.16.20.6 80 33s

controlplane:~$ kubectl describe ingress host-based-ingress -n test

Name: host-based-ingress

Labels: <none>

Namespace: test

Address:

Ingress Class: nginx

Default backend: <default>

Rules:

Host Path Backends

---- ---- --------

nginx.local

/ nginx-service:80 (10.244.1.6:80,10.244.1.7:80)

Annotations: <none>

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Normal Sync 67s (x2 over 67s) nginx-ingress-controller Scheduled for sync

Normal Sync 10s nginx-ingress-controller Scheduled for sync

controlplane:~$

Network Policy:

K8S API objects that act as Layer 3/4 firewalls and controlling pod-to-pod and pod-to-external communication.

Security Model: It will deny by default, we can define explicitly allow what's needed.

Container Network Interface (CNI) : K8S delegates network implementation to CNI plugins.

It will act as OR operation if multiple entries. It will allow traffic to any matching destination is allowed simultaneously.

Microservice Policy:

3 Tier Architecture : Web (frontend), API (backend), Database (persistent layer)

Traffic Flow: Client -> web:80, Web -> API:8080, API -> Database:5432

Policy Layer 1: Web pods have egress to API pods on port 8080, deny all other egress exceptDNS

Policy Layer 2: API pods have ingress from web pods on 8080 and egress to Database pods on 5432 and deny all else.

Policy Layer 3: Database pods have ingress from API pods on 5432 only and never initiates outbound

Debugging of Network policy:

Connection Timeout/ Refused : Check if a policy is selecting the pod and list polices in the namepsace.
List policy - kubect get networkpolicy -n namespace | grep pod-label
Describe policy - kubectl describe networkpolicy name -n namepsace
check pod labels : kubectl get pods -n namespace --show-labels
Test connectivity : kubectl exec source-pod -- curl destination-pod:port -v
Check for DNS issues - If application cannot resolve hostnames, policy may missing egress of that host.
CNI logs: kubectl logs -n kube-system calico-node/cilium-agent | grep DENIED