Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod watches stop sending data after a while, but remain connected in 1.18.4 #1755

Closed
blushingpenguin opened this issue Jul 29, 2020 · 21 comments
Assignees
Labels
Feedback General feedback networking question resolution/answer-provided Provided answer to issue, question or feedback.

Comments

@blushingpenguin
Copy link

What happened:

After upgrading from 1.16.10 to 1.18.4 watches stop working after a period of time. The connection to api server from the pod remains (i.e. it can be seen with lsof), but no further watch data is sent.

What you expected to happen:

With < 1.16.10 (back to 1.13), periodically the api server connection would drop after a period of time (which I believe was expected) -- in that case we'd just reconnect.

How to reproduce it (as minimally and precisely as possible):

I don't have an easy reproduction (the code is a custom job scheduler written in .NET using the k8s csharp client) -- I can probably boil it down if reproduction is necessary.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
    1.18.4
  • Size of cluster (how many worker nodes are in the cluster?)
    3 x b2ms
  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)
    dotnet / node microservices
@ghost ghost added the triage label Jul 29, 2020
@ghost
Copy link

ghost commented Jul 29, 2020

Triage required from @Azure/aks-pm

@ghost ghost added the action-required label Jul 31, 2020
@ghost ghost removed the triage label Aug 3, 2020
@ghost
Copy link

ghost commented Aug 8, 2020

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Aug 8, 2020
@palma21 palma21 added Feedback General feedback question and removed Needs Attention 👋 Issues needs attention/assignee/owner Top 10 Issue action-required labels Aug 10, 2020
@ghost ghost added the stale Stale issue label Oct 10, 2020
@ghost
Copy link

ghost commented Oct 10, 2020

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@blushingpenguin
Copy link
Author

this was still happening as of a few days ago

@ghost ghost removed the stale Stale issue label Oct 10, 2020
@arsnyder16
Copy link

@blushingpenguin I was experiencing this as well. At one point it was very intermittent and then it started happening consistently. I was advised by the AKS team to configure my watches to be lower than 4 minutes. This is the default request timeouts for the azure load balancer, this did resolve my issue. Unfortunately these silent failure are problematic in that there are internal workloads with the cluster that have watches longer than 4 minutes.

From my understanding this the load balancer on the control plane side so its out of the users control on configuration, but maybe we could get some clarification from someone on the AKS team though @juan-lee @palma21

FWIW. I was on 1.17, andusing nodejs, I would see it very infrequently in previous version well. not sure if/how the k8s version correlates to the issues.

Here are some related issues
#1877
#1052

@thomasfrederikhoeck
Copy link

thomasfrederikhoeck commented Oct 12, 2020

@blushingpenguin I'm experiencing it when I have long-lived applications in the cluster where the application deploys k8 jobs using inter-cluster comms. It times-out after 4 min silently.

I think the reason is that the load balancer that AKS is using for intercluster calls to the API has "TCP reset" disabled but I don't think there is anyway to change this.

@bergeron
Copy link

bergeron commented Oct 13, 2020

Seeing this too. I think it happens when the watcher receives no events for > 4 minutes.

I wrote a C# program that reproduces the problem: https://gist.github.com/bergeron/10c88fd26aa683619fc75cfd85f63acf

When I modify the watched resource <= every 4 minutes, everything works:

  • The watcher receives the modification events.
  • The watcher is closed after ~30 minutes (expected, the application should re-open the watch).

But when I modify the watched resource > every 4 minutes:

  • The watcher receives no modification events.
  • The watcher is never closed (no OnClosed event or exception).

This happens both when running the watcher outside the Kubernetes cluster (connecting to the API server via *.azmk8s.io:443), and also when running the watcher inside a pod in the Kubernetes cluster (connecting via KUBERNETES_SERVICE_HOST:KUBERNETES_SERVICE_PORT or 10.0.0.1:443).

I didn't find the AKS version to have any effect. It reproduced on AKS 1.16.7, 1.16.13, 1.16.15, 1.17.9, 1.17.11, 1.18.6, 1.18.8, 1.19.0. I don't see this problem on kubeadm 1.15. I haven't tested other Kubernetes environments.

Setting timeoutSeconds to < 4 minutes on the watch call seems to work though. The connection closes before it starts missing events and our application can restart the watch.

@thomasfrederikhoeck
Copy link

@bergeron Someone created an very similar script (Python though) on #1052 and only experienced it on AKS not on AWS EKS.

@nilo19
Copy link
Member

nilo19 commented Oct 15, 2020

I use the simple csharp app from @bergeron and I noticed that there is a TCP RST package after ~5min idle. Actually, this behavior is by design that the AKS underlay SLB would send a RST signal to the client after ~4min idle. The client must handle the signal and re-establish the connection. There are several ways to handle this:

  1. turn down the timeout time to less than 4min
  2. config TCP keepalive

@blushingpenguin
Copy link
Author

For #2 with the k8s c# client there is no way of turning on tcp keepalives as it uses HttpClient: dotnet/runtime#31267.
I'm not sure what you mean by #1.

I also don't see a TCP RST (if there was a RST, the connection would fail on the client -- that would be ok as it just needs to restart the watch -- however as no RST is sent the connection actually stays open from the client's perspective).

@thomasfrederikhoeck
Copy link

thomasfrederikhoeck commented Oct 16, 2020

@nilo19 I never see a TCP RST. It almost seems like the SLB doesn't have RST enabled. I'm on AKS 1.17.7. NOTE: that all this is in-cluster comms.

If apply the following python script

import time
import logging
from kubernetes import config

logging.basicConfig(
    format='%(asctime)s %(levelname)-8s %(message)s',
    level = logging.DEBUG)

config.load_incluster_config()

from kubernetes import client
v1 = client.CoreV1Api()
logging.info('Calling 1st time')
v1.list_namespaced_pod('default')
logging.info('Sleeping 5 minutes')
time.sleep(300)
logging.info('Calling 2nd time')

# this call will timeout after 15 minutes
v1.list_namespaced_pod('default')
logging.info('OK')

I get the following logs.

2020-10-16 07:41:29,261 INFO     Calling 1st time
2020-10-16 07:41:29,344 DEBUG    response body: {Removed my AKS specifics}
2020-10-16 07:41:29,352 INFO     Sleeping 5 minutes
2020-10-16 07:46:29,447 INFO     Calling 2nd time
2020-10-16 08:02:07,806 WARNING  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='10.0.0.1', port=443): Read timed out. (read timeout=None)")': /api/v1/namespaces/prefect/pods
2020-10-16 08:02:07,838 DEBUG    response body: {Removed my AKS specifics}

The following happens:

  1. First call is a success
  2. Waits for 5 min.
  3. Call with same connection
  4. Time-out after 15 min as no response was given from call.
  5. Retyr connect.

@marwanad
Copy link

@thomasfrederikhoeck are you capturing the tcpdumps on the node or the pod itself?

@thomasfrederikhoeck
Copy link

@marwanad sorry, I don't understand what you mean. I run it as a Deployment on the cluster where I'm using the python:3.7 image as base image. I then look at the logs and get the what I have described above. You might need to set up a Role and Rolebinding to allow connecting for the pod to the k8 api.

@marwanad
Copy link

marwanad commented Oct 19, 2020

@marwanad sorry, I don't understand what you mean. I run it as a Deployment on the cluster where I'm using the python:3.7 image as base image. I then look at the logs and get the what I have described above. You might need to set up a Role and Rolebinding to allow connecting for the pod to the k8 api.

My bad if I wasn't clear enough. I was curious about the part "I never see a TCP RST." - how are you validating that?

edit: All I'm saying is looking at those logs isn't enough to tell you if you receive a RST packet or not - See kubernetes-client/python#1132. You need to capture tcpdump from the host and the pod.

@thomasfrederikhoeck
Copy link

@marwanad pardon my ignorance but what I meant was that the client never receives a message that the connection has been closed.

Isn't the issue you are pointing to the other way around, where the connection is intentionally closed by the client using the python kubernetes package (but then never really closed)?

I guess it would be pretty easy for someone from the AKS team to check if the SLB used for the k8 api has TCP Reset enabled or not, right?

@nilo19
Copy link
Member

nilo19 commented Oct 20, 2020

@thomasfrederikhoeck the TCP reset on our LB is on by default and the idle timeout is 4min. I create a watch and let it idled for 4min. Then I could collect a TCP RST using tcpdump. I believe the issue is introduced because the k8s client failed to handle the RST packet.

@thomasfrederikhoeck
Copy link

@nilo19 Okay, that makes sense. Thank you (and @marwanad ) for the answers 😄

@yangl900
Copy link

yangl900 commented Oct 27, 2020

Hi @thomasfrederikhoeck , just want to let you know that I picked this issue and did some testing. I'm still doing more testing but I posted the current test code and results in this repo https://github.com/yangl900/knet#client-python-tests.

The findings I have is that:

  • Specifically the k8s python client does NOT have tcp keepalive on
  • Recently AKS deployed an improvement that RST packet will be sent to client on 4min and 10s, with my testing I can observe the RST packet, and I can see the k8s python client handled it well. I used the python script you shared and it passed my testing constantly.

I'm still trying to see if there is a workaround to enable keepalive for urllib3, not an expert on python :)

You can deploy the same script to have a test:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
rules:
- apiGroups: [""] # "" indicates the core API group
  resources: ["configmaps"]
  verbs: ["get", "watch", "list", "update", "patch", "delete", "create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
subjects:
# You can specify more than one "subject"
- kind: ServiceAccount
  name: default # "name" is case sensitive
  apiGroup: ""
roleRef:
  # "roleRef" specifies the binding to a Role / ClusterRole
  kind: Role #this must be Role or ClusterRole
  name: pod-reader # this must match the name of the Role or ClusterRole you wish to bind to
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: knet-apiserver-watcher-py
spec:
  selector:
    matchLabels:
      knet: apiserver-watcher-py
  replicas: 1
  template:
    metadata:
      labels:
        knet: apiserver-watcher-py
    spec:
      containers:
        - name: watcher
          image: yangl/apiserver-watcher-py
          imagePullPolicy: Always
        - name: tcpdump
          image: corfr/tcpdump
          command:
            - "/usr/sbin/tcpdump"
            - "-i"
            - "any"
            - "-nn"

@yangl900 yangl900 self-assigned this Oct 30, 2020
@yangl900
Copy link

Hi @blushingpenguin @bergeron @arsnyder16 - just want to let you know with recent AKS infrastructure release, you should see RST always. We still recommend client side turn on TCP keepalive, but I understand not all k8s client does that, e.g. python does not have it today. Please let me know if you still experience any issue with watches. I have some testing results posted on https://github.com/yangl900/knet and will keep posting.

@thomasfrederikhoeck
Copy link

@yangl900 sorry for the late reply. I had a change to check it out know and everything is working smoothly for me now. Thank you! :-)

@palma21 palma21 added the resolution/answer-provided Provided answer to issue, question or feedback. label Nov 3, 2020
@ghost
Copy link

ghost commented Nov 5, 2020

Thanks for reaching out. I'm closing this issue as it was marked with "Answer Provided" and it hasn't had activity for 2 days.

This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Feedback General feedback networking question resolution/answer-provided Provided answer to issue, question or feedback.
Projects
None yet
Development

No branches or pull requests

9 participants