Kubernetes – Troubleshooting cert-manager and Ingress Controllers (Any, but with EKS and AKS focus)

Having trouble with your cert-manager? Here are some tasks you may follow that could guide you to the solution:

In this example, we will use some resources names like ingress-nginx for the nginx ingress controller namespace. Before proceeding, please update all the commands bellow to the name of your actual resources.

Troubleshooting your Cluster and Cert-Manager installation:

1 – Check which ingress controller you installed. If not, this may be the problem: (ex)

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx <+ params...>

2 – Which kind of issuer are you using? Depending the kind, you may need to assign the correct roles to the cluster, or create a credential with necessary permissions in the case of AWS Route53.

Example of IAM Role for Route53:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "route53:GetChange",
      "Resource": "arn:aws:route53:::change/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "route53:ChangeResourceRecordSets",
        "route53:ListResourceRecordSets"
      ],
      "Resource": "arn:aws:route53:::hostedzone/*"
    },
    {
      "Effect": "Allow",
      "Action": "route53:ListHostedZonesByName",
      "Resource": "*"
    }
  ]
}

Example of permissions assignment for AKS using Terraform:

# Creating a identity to be used in your AKS Cluster:
resource "azurerm_user_assigned_identity" "aks_01" {
  name                = "${local.client}-${lower(local.environment)}-aks-01-identity"
  resource_group_name = data.azurerm_resource_group.rg_01.name
  location            = local.location
}
# Assign permission to your private DNS Zone:
resource "azurerm_role_assignment" "role_assign_aks_01_dns" {
  scope                = data.azurerm_private_dns_zone.dns_zone_01.id
  role_definition_name = "Private DNS Zone Contributor"
  principal_id         = azurerm_user_assigned_identity.aks_01.principal_id
}
# Assign permission to your AKS Network:
resource "azurerm_role_assignment" "role_assign_nw_01" {
  scope                = azurerm_virtual_network.aks_01_nw_01.id
  role_definition_name = "Network Contributor"
  principal_id         = azurerm_user_assigned_identity.aks_01.principal_id
}
# Assign permission to your MAIN Network: (Needed when using private IP's that resides on another Resource Groups. Set the scope to the correct Network where the Public IP or Prefix resides)
resource "azurerm_role_assignment" "role_assign_aks_01_nw_01" {
  scope                = data.azurerm_virtual_network.nw_01.id
  role_definition_name = "Network Contributor"
  principal_id         = azurerm_user_assigned_identity.aks_01.principal_id
}
# Assign permission to your MAIN Resource Group. (Microsoft.Network/routeTables/write - Also necessary to the AKS Cluster be able to reach the other group Resources)
resource "azurerm_role_assignment" "role_assign_aks_01_rg" {
  scope                = data.azurerm_resource_group.rg_01.id
  role_definition_name = "Contributor"
  principal_id         = azurerm_user_assigned_identity.aks_01.principal_id
}

3 – Did you disabled the Cert-Manager Validation? It needs to be set to allow the system resources that cert-manager requires to bootstrap TLS to be created in its own namespace. Do the following:

kubectl label namespace ingress-nginx cert-manager.io/disable-validation=true --overwrite

4 – Double check if you are installing the correct cert-manager for your environment. Example for EKS/AKS Clusters:

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager <+ params...>

5 – After installing cert-manager did you create your ClusterIssuer? No? Follow a example using AWS Route53 and access/secret keys (Valid for clusters running anywhere, but if you are running EKS cluster, you may use roles to achieve that):

clusterissuer.yaml (Example with wildcard certificate and Production/Staging Let’s Encrypt issuers)

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod-dns-aws
spec:
  acme:
    email: your@email.com
    privateKeySecretRef:
      name: letsencrypt-prod-dns-aws
    server: https://acme-v02.api.letsencrypt.org/directory
    solvers:
      - dns01:
          # cnameStrategy: Follow
          route53:
            region: us-west-2
            hostedZoneID: ZXXXXXXXXXXXXXXXXX # Necessary only if your did not allow route53:ListResourceRecordSets
            # accessKeyID: AKIXXXXXXXXXXXXXXX # secret with IAM Role - Hard-code the key here or use as informed bellow
            accessKeyIDSecretRef:
              name: route53-credentials
              key: accessKey
            secretAccessKeySecretRef:
              name: route53-credentials
              key: secretKey
        selector:
          dnsZones:
            - yourdomain.com
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-stag-dns-aws
spec:
  acme:
    email: your@email.com
    privateKeySecretRef:
      name: letsencrypt-stag-dns-aws
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    solvers:
      - dns01:
          # cnameStrategy: Follow
          route53:
            region: us-west-2
            hostedZoneID: ZXXXXXXXXXXXXXXXXX # Necessary only if your did not allow route53:ListResourceRecordSets
            accessKeyIDSecretRef:
              name: route53-credentials
              key: accessKey
            secretAccessKeySecretRef:
              name: route53-credentials
              key: secretKey
        selector:
          dnsZones:
            - yourdomain.com
          #dnsNames:
           # - "*.yourdomain.com"
           # - "yourdomain.com"

Now apply:

kubectl apply -f clusterissuer.yaml

6 – Did you create the Ingress Controller? Remember that it is not automatic, your also need this step after installing cert-manager.
As soon as you do this, you will be able to find it with kubectl get ingress --all-namespaces. Example:

ingress.yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nginx-ingress
  namespace: prd
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    cert-manager.io/cluster-issuer: letsencrypt-prod-dns-aws
    # cert-manager.io/cluster-issuer: letsencrypt-stag-dns-aws # Uncomment to use staging LE for this ingress
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS" # Use HTTPS(443). If you wanna use HTTP(80), comment this line
    nginx.ingress.kubernetes.io/proxy-ssl-verify: "off" # Disable proxy ssl verification. Useful when you are using self-signed certificates for internal resources.
    ######### USE THIS TO GROUP INGRESS CONTROLLERS ##########
    ### FROM DIFFERENT NAMESPACES USING THE SAME INGRESS IP ###
    # alb.ingress.kubernetes.io/group.name: "group"
    # alb.ingress.kubernetes.io/group.order: "1"
    #######################################################
    # Some parameters you may need:
    # nginx.ingress.kubernetes.io/proxy-body-size: "64m"
    # nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    # nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
    # nginx.ingress.kubernetes.io/proxy-buffer-size: "128k"
    # nginx.ingress.kubernetes.io/proxy-buffers-number: "8"
    # nginx.ingress.kubernetes.io/client-body-buffer-size: "128k"
spec:
  ingressClassName: nginx
  rules:
    - host: yourdomain.com
      http:
        paths:
          - pathType: Prefix
            backend:
              service:
                name: your-nginx-alb
                port:
                  number: 443
            path: /
    - host: www.yourdomain.com
      http:
        paths:
          - pathType: Prefix
            backend:
              service:
                name: your-nginx-alb
                port:
                  number: 443
            path: /
  tls:
    - hosts:
      - "yourdomain.com"
      - "*.yourdomain.com"
      secretName: yourdomain-tls-secret

Now apply:

kubectl apply -f ingress.yaml

7 – Now your certs must be ready to use and you can deploy your services/lbs/pods

Still not working?

Well, lets dig out the problem!

Troubleshooting your Cluster and Cert-Manager configuration:

PS: Please try to do all the checking bellow before deleting the resources. You may find the problem externally like a AWS permission or even a port to be enabled. That’s why is always better to check all logs before proceeding with deletes.

Check for any existing Network Policy:

kubectl describe networkpolicy -n prd

Check Ingress

kubectl get ingress --all-namespaces
kubectl describe ingress xxxxxxx

Check Services Selectors and write them down:

kubectl get services --all-namespaces
# or specific namespace -> kubectl get svc -n prd
kubectl describe svc -n prd nginx-edge-alb

Check if your DNS is responding properly: (Notice that you may flag “cnameStrategy: Follow” into ClusterIssuer if your DNS record is a CNAME)

nslookup yourdomain.com

Now check PODs Labels if they match services/LoadBalancer Selectors:

kubectl get pods -n prd
kubectl describe pod -n prd nginx-86657c549f-tkgsx

Check ClusterIssuer if your ACME Account is registered: (Remember: ClusterIssuer is namespace independent)

kubectl describe clusterissuer letsencrypt

Example of a good return:

Status:
  Acme:
    Last Registered Email:  your@email.com
    Uri:                    https://acme-v02.api.letsencrypt.org/acme/acct/XXXXXXXXXXX
  Conditions:
    Last Transition Time:  2023-04-21T17:22:04Z
    Message:               The ACME account was registered with the ACME server
    Observed Generation:   1
    Reason:                ACMEAccountRegistered
    Status:                True
    Type:                  Ready
Events:                    <none>

Check Certificates

kubectl get certificate --all-namespaces
kubectl describe certificate -n prd yourdomain-tls-secret

Check Orders

kubectl get orders
kubectl describe order yourdomain-tls-secret-l5q5v-3169669692

# In the end, if you don't find the error, you could try delete the order to see if it solves the problem:
kubectl delete order yourdomain-tls-secret-l5q5v-3169669692

Check Endpoints

kubectl get endpoints
kubectl describe endpoints nginx-alb

# In the end, if you don't find the error, you could try delete the endpoint to see if it solves the problem:
kubectl delete endpoints -n prd nginx-alb

Check cert-manager logs

kubectl get pods -n ingress-nginx
kubectl logs -f -n ingress-nginx cert-manager-7964b89d66-st42t

Check challenges

kubectl get challenges --all-namespaces
kubectl describe challenge yourdomain-tls-secret-ppb5j-734007623-184628386

# In the end, if you don't find the error, you could try delete the challenge to see if it solves the problem:
kubectl delete challenge yourdomain-tls-secret-ppb5j-734007623-184628386

Deixe um comentário

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *