Secure Self-Service Custom Domains for Dev Portals
Vincent Le Goff
In the Dev Portal world, offering users the ability to use their own domain is a milestone on our way to fully customized Dev Portals. Since Konnect-hosted portals are fronted by a Kong gateway, we looked to use our own plugins to achieve this feature.
Leveraging the Open-Source ACME Plugin at SaaS Scale
The ACME plugin is an open-source Kong plugin which allows the user to integrate the gateway with Let's Encrypt or any other similar ACMEv2 service to obtain TLS certificates dynamically. This makes TLS certificate handling easy and sustainable as the plugin handles renewal of certificates automatically.
When we first looked at the plugin, it had a couple of drawbacks that made it fall short of our needs. The first issue was a guardrail put in place by the plugin authors to prevent some cases of abuse, limiting it to function with only a small handful of common top-level domains (TLDs).
In our case, with a global and creative customer base, we need this list to be either dynamically configured or disabled completely since we don't want to limit our customers' choice about what TLD they could use for their Konnect Dev Portals – not to mention the performance issues when parsing a large allow-list of TLD patterns. This pull request was implemented to disable the allow-list check.
A second drawback with the plugin and the implementation disabling the allowed TLD check is that it would allow certificates using the IPv4 address as the certificate common name, and some providers don't support this. Some Kong users, like us, would not want this to be possible with the ACME plugin. This pull request added an option to deny these types of certificates, by matching on the SNI to determine if it is an IPv4 address, which we enable on our Kong gateway fronting our Konnect Dev Portals.
Guarding Against Certificate Issuing Abuses
While these changes to the ACME plugin make it very flexible for our requirements, in this configuration it also opens up the possibility of a malicious user deliberately misusing the gateway to flood our certificate provider with requests and ACME challenges, surely resulting in bad outcomes for our relationship with our certificate provider. For example:
With the current state and configuration of the ACME plugin, this script would cause the gateway to provision certificates for the fake domains and would likely disrupt service for legitimate traffic in the process. This attack cannot be mitigated by application logic in the upstream implementation either, since the ACME plugin intervenes using SNI while the TLS connection is still being negotiated – all in the gateway.
To fix this, we create a custom plugin that checks if the custom domain (SNI) exists in our Konnect portal database prior to provisioning a certificate for the connection. In order for the plugin to work, it needs to run during the TLS connection negotiation but before the ACME plugin. To achieve this we use a feature in the gateway that allows us to set the relative priority of plugins that can run at the same time, and since we know the priority of the ACME plugin is 1705, we configure our plugin with a higher value:
We continue to implement the plugin by hooking in a function to run during the `certificate` phase, one of many phases defined in the gateway. The certificate phase is triggered when the TLS handshake is initiated between the client and the server. In the ACME plugin, this is when we reach out to the certificate provider to provision a certificate, and we need to potentially prevent this from happening on fake domains.
The hook is very simple and defined like this:
This function runs if the SNI is not cached in the `certificate` phase, before the ACME plugin thanks to our priority setting. It simply checks the SNI with `lookupUrl` which verifies the name exists in our Konnect database. If the customer configured things correctly, the entry will exist and this function will return with no effect, allowing the ACME plugin to proceed and cache the SNI and associated certificate. Otherwise, we close the connection without any HTTP error code because no TLS connection was ever fully established.
How it looks in the end:
In this case what happens if I try to request a non-existing portal?
In this example we try to access a non-existing portal and, with the verbose argument, we can see that the TLS handshake is initiated by the client with “hello” and fails with an error because the server does not answer properly. This happens because we call the “ngx.exit” routine after `lookupUrl` fails. So we break the handshake and then the connection.
End User Tasks
Now, with all the above features implemented, the service administrator has only two tasks:
Add a CNAME in the the domain's DNS configuration to resolve to the Konnect-generated portal URL which is found in the "Portal URL" settings page for the Dev Portal (allowing the URL to resolve to the Konnect IP):
Enter the custom domain in the "Portal URL" setting for the Dev Portal (allowing `lookupUrl` to work and the certificate to be provisioned):
Once the DNS settings are propagated, the custom portal is accessible!