Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove plaintext NATS #932

Merged
merged 3 commits into from
Oct 19, 2021
Merged

Remove plaintext NATS #932

merged 3 commits into from
Oct 19, 2021

Conversation

46bit
Copy link
Contributor

@46bit 46bit commented Jul 28, 2021

WHAT is this change about?

This commit removes the plaintext NATS nodes. All the software used in cf-deployment has been modified to handle this and use NATS over mTLS.

Without the nats job, and without the nats link, everything standard in cf-deployment will use the nats-tls job and link. Any users with custom components that still require plaintext NATS can re-enable it by keeping the plaintext NATS job.

What customer problem is being addressed?

Until now cf-deployment has deployed a NATS cluster that contained both plaintext and TLS nodes. Internal communication was over mTLS, but clients could choose to connect over plaintext.

Communicating over plaintext is undesirable, as discussed by many users in #906. Previous work on this issue has stopped components relying on plaintext NATS, but we should carry on and remove it entirely.

Please provide any contextual information.

This PR addresses this issue: #929

We have already merged all prerequisite PRs:

Has a cf-deployment including this change passed cf-acceptance-tests?

  • YES
  • NO

Does this PR introduce a breaking change? Please take a moment to read through the examples before answering the question.

YES - removes the existing nats job which could impact operators with custom components using NATS

All route registrars will also need a change introduced in cloudfoundry/routing-release#214.

How should this change be described in cf-deployment release notes?

As an operator I now know that all NATS traffic is sent over mTLS.

Does this PR introduce a new BOSH release into the base cf-deployment.yml manifest or any ops-files?

No

Does this PR make a change to an experimental or GA'd feature/component?

GA'd feature/component

Please provide Acceptance Criteria for this change?

Deploy this change, run the acceptance tests, and see how they pass despite the lack of plaintext NATS.

What is the level of urgency for publishing this change?

Slightly Less than Urgent

Tag your pair, your PM, and/or team!

Collaborating with @ameowlia on this

Michael Mokrysz added 3 commits July 27, 2021 12:18
This commit enables mTLS between service-discovery-controller and NATS.
Only the client auth has to be specified in YAML; the NATS CA gets into
service-discovery-controller via the nats-tls BOSH link.
Connections to TLS NATS need to be via hostname, not IP. The NATS TLS cert
is for the hostname, not the IP. The lines removed by this PR ensure that
components using NATS will get the hostname instead of the IP.

However, this is the default behaviour [1]. We already have other components
using TLS NATS without this line, so it seems perfectly safe.

[1] https://bosh.io/docs/links/#consumers under "ip_addresses"
Until now cf-deployment has deployed a NATS cluster that contained both
plaintext and TLS nodes. Internal communication was over mTLS, but clients
could choose to connect over plaintext.

This commit removes the plaintext nodes. All the software used in
cf-deployment has been modified to handle this and use NATS over mTLS.

Without the nats job, and without the nats link, everything will use the
nats-tls job and link.

Any users requiring plaintext NATS could re-enable it by keeping this
plaintext NATS job.
@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/179038645

The labels on this github issue will be updated when the story is started.

@46bit 46bit changed the title Remove plaintext NATS job Remove plaintext NATS Jul 28, 2021
Copy link
Member

@davewalter davewalter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I just had one question about the change to the route_emitter configuration.

cf-deployment.yml Show resolved Hide resolved
@46bit
Copy link
Contributor Author

46bit commented Jul 29, 2021 via email

@46bit 46bit marked this pull request as ready for review September 6, 2021 12:44
@46bit
Copy link
Contributor Author

46bit commented Sep 6, 2021

This is now only waiting on #931. Once that is merged then we can merge this.

@ameowlia
Copy link
Member

Hi all!

@46bit has been doing amazing work to make this possible and we would really love to get this change in soon if possible. If there is anything @46bit can do to unblock it, please let them know.

Thank you!

@46bit
Copy link
Contributor Author

46bit commented Oct 8, 2021

Thanks everyone that's been working to get this work merged! Just this one PR left :)

@davewalter
Copy link
Member

@46bit Can I go ahead and merge this on top of #931 or should I cut a release (once CI is green) and then merge this to avoid downtime during upgrade between releases?

@46bit
Copy link
Contributor Author

46bit commented Oct 9, 2021

@davewalter Unfortunately yes you are best to do two releases to avoid downtime. Embarrassingly I've forgotten what service-discovery-controller does so I'm not sure what the downtime impact would be. Maybe @ameowlia knows?

@davewalter
Copy link
Member

It's a container networking component that reconciles internal routes emitted by the route emitter via the NATS message queue.

I am happy to cut a release, merge the last PR, and then cut a second fast-follow release.

@plowin
Copy link

plowin commented Oct 18, 2021

Hi @davewalter, thanks a lot!
Do we roughly know when this could be done?

@davewalter davewalter merged commit 52b2991 into cloudfoundry:develop Oct 19, 2021
@davewalter davewalter self-assigned this Oct 19, 2021
@46bit 46bit deleted the remove-non-tls-nats branch October 19, 2021 00:30
@b1tamara
Copy link
Contributor

Hi @davewalter,
is there a concrete plan when the new release with this PR will be created in order to make the change productive?

@davewalter
Copy link
Member

Hi @b1tamara,

Unfortunately, this change has caused a number of failures in the CI pipeline that I need to resolve before I can cut a new release. I should have time to look at them tomorrow to figure out what changes need to be made to make everything work.

Regards,
Dave

cc @46bit

@davewalter
Copy link
Member

I managed to look at the failure in the Windows test:

Task 74 | 21:49:49 | Error: Failed to resolve links from deployment 'cf'. See errors below:
  - Failed to resolve link 'nats' with type 'nats' from job 'route_emitter_windows' in instance group 'windows2019-cell'. Details below:
    - No link providers found

This is puzzling to me since both the nats and nats-tls BOSH links are marked as optional in the route_emitter_windows job spec.

I will keep digging tomorrow.

@davewalter
Copy link
Member

The other failures are the clean install with isolation segments and upgrade install with isolated diego cells:

Error: Unable to render instance groups for deployment. Errors are:
  - Unable to render jobs for instance group 'isolated-diego-cell'. Errors are:
    - Unable to render templates for job 'route_emitter'. Errors are:
      - Error filling in template 'route_emitter.json.erb' (line 54: undefined method `collect' for nil:NilClass)
  - Unable to render jobs for instance group 'iso-seg-router'. Errors are:
    - Unable to render templates for job 'gorouter'. Errors are:
      - Error filling in template 'gorouter.yml.erb' (line 12: NATS server port number not found in properties nor in "nats" link. This value can be specified using the "nats.port" property.)

These appear to be more straightforward.

The route_emitter jobs route_emitter.json.erb template requires the diego.route_emitter.nats.tls.enabled property to be set to true for it to switch to "NATS TLS" mode. While this was set for the diego cells in the main manifest, it was not set for the isolation segment diego cells we test in CI. Fixed via a0dc959.

Similarly, the gorouter job's gorouter.yml.erb template requires the nats.tls_enabled property to be set for it to switch to "NATS TLS" mode. While this was set for the routers in the main manifest, it was not set for the isolation segment routers we test in CI. Fixed via 5988a3c.

@davewalter
Copy link
Member

I've also fixed the Windows test failure, via 56860d5. The windows2019-cell.yml ops-file still had the entries to exclude the ip_addresses from the nats and nats-tls links, which was causing BOSH to require the nats link. At this point, CI should go green over the weekend, which will allow me to cut a new release next week.

@davewalter
Copy link
Member

CI is finally green so I cut v17.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants