Having trouble with your cert-manager? Here are some tasks you may follow that could guide you to the solution:
In this example, we will use some resources names like ingress-nginx
for the nginx ingress controller namespace. Before proceeding, please update all the commands bellow to the name of your actual resources.
Troubleshooting your Cluster and Cert-Manager installation:
1 – Check which ingress controller you installed. If not, this may be the problem: (ex)
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx <+ params...>
2 – Which kind of issuer are you using? Depending the kind, you may need to assign the correct roles to the cluster, or create a credential with necessary permissions in the case of AWS Route53.
Example of IAM Role for Route53:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "route53:GetChange",
"Resource": "arn:aws:route53:::change/*"
},
{
"Effect": "Allow",
"Action": [
"route53:ChangeResourceRecordSets",
"route53:ListResourceRecordSets"
],
"Resource": "arn:aws:route53:::hostedzone/*"
},
{
"Effect": "Allow",
"Action": "route53:ListHostedZonesByName",
"Resource": "*"
}
]
}
Example of permissions assignment for AKS using Terraform:
# Creating a identity to be used in your AKS Cluster:
resource "azurerm_user_assigned_identity" "aks_01" {
name = "${local.client}-${lower(local.environment)}-aks-01-identity"
resource_group_name = data.azurerm_resource_group.rg_01.name
location = local.location
}
# Assign permission to your private DNS Zone:
resource "azurerm_role_assignment" "role_assign_aks_01_dns" {
scope = data.azurerm_private_dns_zone.dns_zone_01.id
role_definition_name = "Private DNS Zone Contributor"
principal_id = azurerm_user_assigned_identity.aks_01.principal_id
}
# Assign permission to your AKS Network:
resource "azurerm_role_assignment" "role_assign_nw_01" {
scope = azurerm_virtual_network.aks_01_nw_01.id
role_definition_name = "Network Contributor"
principal_id = azurerm_user_assigned_identity.aks_01.principal_id
}
# Assign permission to your MAIN Network: (Needed when using private IP's that resides on another Resource Groups. Set the scope to the correct Network where the Public IP or Prefix resides)
resource "azurerm_role_assignment" "role_assign_aks_01_nw_01" {
scope = data.azurerm_virtual_network.nw_01.id
role_definition_name = "Network Contributor"
principal_id = azurerm_user_assigned_identity.aks_01.principal_id
}
# Assign permission to your MAIN Resource Group. (Microsoft.Network/routeTables/write - Also necessary to the AKS Cluster be able to reach the other group Resources)
resource "azurerm_role_assignment" "role_assign_aks_01_rg" {
scope = data.azurerm_resource_group.rg_01.id
role_definition_name = "Contributor"
principal_id = azurerm_user_assigned_identity.aks_01.principal_id
}
3 – Did you disabled the Cert-Manager Validation? It needs to be set to allow the system resources that cert-manager requires to bootstrap TLS to be created in its own namespace. Do the following:
kubectl label namespace ingress-nginx cert-manager.io/disable-validation=true --overwrite
4 – Double check if you are installing the correct cert-manager for your environment. Example for EKS/AKS Clusters:
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager <+ params...>
5 – After installing cert-manager did you create your ClusterIssuer? No? Follow a example using AWS Route53 and access/secret keys (Valid for clusters running anywhere, but if you are running EKS cluster, you may use roles to achieve that):
clusterissuer.yaml
(Example with wildcard certificate and Production/Staging Let’s Encrypt issuers)
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod-dns-aws
spec:
acme:
email: your@email.com
privateKeySecretRef:
name: letsencrypt-prod-dns-aws
server: https://acme-v02.api.letsencrypt.org/directory
solvers:
- dns01:
# cnameStrategy: Follow
route53:
region: us-west-2
hostedZoneID: ZXXXXXXXXXXXXXXXXX # Necessary only if your did not allow route53:ListResourceRecordSets
# accessKeyID: AKIXXXXXXXXXXXXXXX # secret with IAM Role - Hard-code the key here or use as informed bellow
accessKeyIDSecretRef:
name: route53-credentials
key: accessKey
secretAccessKeySecretRef:
name: route53-credentials
key: secretKey
selector:
dnsZones:
- yourdomain.com
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-stag-dns-aws
spec:
acme:
email: your@email.com
privateKeySecretRef:
name: letsencrypt-stag-dns-aws
server: https://acme-staging-v02.api.letsencrypt.org/directory
solvers:
- dns01:
# cnameStrategy: Follow
route53:
region: us-west-2
hostedZoneID: ZXXXXXXXXXXXXXXXXX # Necessary only if your did not allow route53:ListResourceRecordSets
accessKeyIDSecretRef:
name: route53-credentials
key: accessKey
secretAccessKeySecretRef:
name: route53-credentials
key: secretKey
selector:
dnsZones:
- yourdomain.com
#dnsNames:
# - "*.yourdomain.com"
# - "yourdomain.com"
Now apply:
kubectl apply -f clusterissuer.yaml
6 – Did you create the Ingress Controller? Remember that it is not automatic, your also need this step after installing cert-manager.
As soon as you do this, you will be able to find it with kubectl get ingress --all-namespaces
. Example:
ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: nginx-ingress
namespace: prd
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
cert-manager.io/cluster-issuer: letsencrypt-prod-dns-aws
# cert-manager.io/cluster-issuer: letsencrypt-stag-dns-aws # Uncomment to use staging LE for this ingress
nginx.ingress.kubernetes.io/backend-protocol: "HTTPS" # Use HTTPS(443). If you wanna use HTTP(80), comment this line
nginx.ingress.kubernetes.io/proxy-ssl-verify: "off" # Disable proxy ssl verification. Useful when you are using self-signed certificates for internal resources.
######### USE THIS TO GROUP INGRESS CONTROLLERS ##########
### FROM DIFFERENT NAMESPACES USING THE SAME INGRESS IP ###
# alb.ingress.kubernetes.io/group.name: "group"
# alb.ingress.kubernetes.io/group.order: "1"
#######################################################
# Some parameters you may need:
# nginx.ingress.kubernetes.io/proxy-body-size: "64m"
# nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
# nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
# nginx.ingress.kubernetes.io/proxy-buffer-size: "128k"
# nginx.ingress.kubernetes.io/proxy-buffers-number: "8"
# nginx.ingress.kubernetes.io/client-body-buffer-size: "128k"
spec:
ingressClassName: nginx
rules:
- host: yourdomain.com
http:
paths:
- pathType: Prefix
backend:
service:
name: your-nginx-alb
port:
number: 443
path: /
- host: www.yourdomain.com
http:
paths:
- pathType: Prefix
backend:
service:
name: your-nginx-alb
port:
number: 443
path: /
tls:
- hosts:
- "yourdomain.com"
- "*.yourdomain.com"
secretName: yourdomain-tls-secret
Now apply:
kubectl apply -f ingress.yaml
7 – Now your certs must be ready to use and you can deploy your services/lbs/pods
Still not working?
Well, lets dig out the problem!
Troubleshooting your Cluster and Cert-Manager configuration:
PS: Please try to do all the checking bellow before deleting the resources. You may find the problem externally like a AWS permission or even a port to be enabled. That’s why is always better to check all logs before proceeding with deletes.
Check for any existing Network Policy:
kubectl describe networkpolicy -n prd
Check Ingress
kubectl get ingress --all-namespaces
kubectl describe ingress xxxxxxx
Check Services Selectors and write them down:
kubectl get services --all-namespaces
# or specific namespace -> kubectl get svc -n prd
kubectl describe svc -n prd nginx-edge-alb
Check if your DNS is responding properly: (Notice that you may flag “cnameStrategy: Follow
” into ClusterIssuer if your DNS record is a CNAME)
nslookup yourdomain.com
Now check PODs Labels if they match services/LoadBalancer Selectors:
kubectl get pods -n prd
kubectl describe pod -n prd nginx-86657c549f-tkgsx
Check ClusterIssuer if your ACME Account is registered: (Remember: ClusterIssuer is namespace independent)
kubectl describe clusterissuer letsencrypt
Example of a good return:
Status:
Acme:
Last Registered Email: your@email.com
Uri: https://acme-v02.api.letsencrypt.org/acme/acct/XXXXXXXXXXX
Conditions:
Last Transition Time: 2023-04-21T17:22:04Z
Message: The ACME account was registered with the ACME server
Observed Generation: 1
Reason: ACMEAccountRegistered
Status: True
Type: Ready
Events: <none>
Check Certificates
kubectl get certificate --all-namespaces
kubectl describe certificate -n prd yourdomain-tls-secret
Check Orders
kubectl get orders
kubectl describe order yourdomain-tls-secret-l5q5v-3169669692
# In the end, if you don't find the error, you could try delete the order to see if it solves the problem:
kubectl delete order yourdomain-tls-secret-l5q5v-3169669692
Check Endpoints
kubectl get endpoints
kubectl describe endpoints nginx-alb
# In the end, if you don't find the error, you could try delete the endpoint to see if it solves the problem:
kubectl delete endpoints -n prd nginx-alb
Check cert-manager logs
kubectl get pods -n ingress-nginx
kubectl logs -f -n ingress-nginx cert-manager-7964b89d66-st42t
Check challenges
kubectl get challenges --all-namespaces
kubectl describe challenge yourdomain-tls-secret-ppb5j-734007623-184628386
# In the end, if you don't find the error, you could try delete the challenge to see if it solves the problem:
kubectl delete challenge yourdomain-tls-secret-ppb5j-734007623-184628386