-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tcpdump for Everyone: Proposal to add pcap-release to cf-deployment #980
Comments
I think a very useful enhancement. It would be nice if the platform operator is able to set limits:
Also an operator option to "kill" specific or all captures immediately (for emergency cases) |
This looks pretty cool! it'll definitely make a lot of people's lives a lot easier. I can see people also beeing wary of something like this, since it could theoretically dump all traffic coming into all app containers to someone who compromizes pcap-api or pcap-server. Some questions I have:
|
A new community ops file would seem to me to be the best way to make this available to operators until/if it is accepted into the cf-deployment manifest properly. |
FWIW here is a recording of the prototype in action: https://youtu.be/XG28EYq_kaw?t=514 |
The captured packets are in pcap format, i.e. their timestamps are UNIX timestamps (elapsed ms since 1/1/1970). So even if you mix two streams together that happened at different points in time, the resulting timestamps will be correct.
That's also a question I would like to answer with the community. Current approach is to give
Yes! You can see all traffic going in and out of your container on any interface.
Yes. It works by asking runC for the container PID and then entering the network namespace of that pid. |
Created a request to create a new repository to continue work on this topic: |
Like the others above, I think this is a great idea. I think this would help make common debugging scenarios easier. I have thoughts around two main topics: security and performance. SecurityPermissions@geofffranks suggested considering adding a "new CC/UAA privilege for this, so that only people with app access + the network-dump privilege are able to do this". Sadly Cloud Foundry does not have custom RBAC rules. I think adding this permission to space dev is okay. As a space dev you can push code and an app already has access to all unencrypted traffic. So in a way, a space dev has always had the ability to view this information, this feature just makes it easier. If any users don't want their space devs to have this permission we could consider adding a bosh property that would (1) only allow admins to do network dumps or (2) allow admins and space devs to do network dumps. I also suggest adding an audit event before this goes GA. The ability to easily get unencrypted traffic makes me nervousAdmin users have always been able to get into app containers as root and inspect unencrypted traffic. (It is not easy, but it is doable). Space developers have always been able to write apps that obviously get that unencrypted data. But something about it being able to easily get access to unencrypted traffic makes me nervous. A specific concern I have is: does this violate Europe's General Data Protection Regulation (GDPR)?
Performance
My deploy failed :(When I followed the deploy steps in the release, I got an error saying that the release doesn't exist.
When I tried to create the release my self, I couldn't read from the blobstore.
Other questions
Next Steps
Thank you so much for all of your work on this @domdom82 and @plowin!! |
What about CLI support similar to |
I like that too. |
I was discussing the same option with @stephanme the other day. I think this may be the best approach as a per-space / app feature-toggle. |
Adding a feature "tcpdump" to https://v3-apidocs.cloudfoundry.org/version/3.122.0/#supported-app-features might be a straight forward option. |
@ameowlia I'm sorry your deploy failed 😞 The yamls haven't been cleaned up as to make it deployable from a non-dev release. For the blobstore issue I think we were using the SAP S3 here ( I didn't dare use the CF S3 before this got accepted ). As for the data privacy / GDPR concerns:I hope this won't be an issue if we can make sure that only people who could see unencrypted traffic to their own apps anyways can now do it more conveniently, but no one else. So there must be no attack surface that breaches privacy of course. For example, even if you tap into unencrypted HTTP traffic on your app, the pcap data that gets transferred to you will still be encrypted end-to-end. Storing pcap data in a storage bucketThis scenario has only been explored on the surface yet. For testing, I started writing a small go client that creates a bucket with all security standards in place that are required at SAP to fulfil data privacy (encrypted, non-public, force TLS access, data retention policy etc.), but I would argue it is the responsibility of the company using CF to provide a GDPR-compliant bucket and our code only puts object in it. We might want to check the bucket for some of these settings and issue a warning though. Performance / Impact of the feature on a Diego CellI fully agree here. I could think of the following limits:
Why is the pcap API deployed as a separate deployment?To be honest, I didn't know if cf-deployment was cool with adding new deployments or not. In my mind, cf-deployment contains only the core CF deployments like CC, Diego, UAA, etc. so I didn't dare cram my pcap API in there, too 😏 |
Since it sounds like the offline-storage aspect of |
That's an interesting thought. The current approach would be to just not use the offline feature if direct streaming is sufficient. However, it makes sense to explore the intersection of shared responsibility between customer and operator further.
The current design puts this responsibility on the operator of the platform, as the bucket is only a means to improve bandwidth during capture, less as a "pcap library" where customers could put their dumps. If the customer provided a bucket, they would also have to provide credentials to write to said bucket. Those credentials would have to be stored somewhere, which would introduce the need for a DB - something I wanted to avoid from the start.
Yes. The current design assumes a shared bucket where objects are tagged by user_id. A very simple model, where URL couldn't be shared even among users of the same org/space. This could be extended by further tagging things like org_id and/or space_id. Pcap server would be responsible to check the user token, validate the user_id with the one stored on the bucket object and grant or deny access accordingly.
S3 buckets can be configured with a "lifecycle rule" that would put a retention policy on every object in the bucket. This could be made configurable, the default would be 1 day. So after pulling a tcpdump you have 24h to download the pcap file using the URL given. The bucket would clean up the pcaps automatically. Enumerating buckets is an interesting idea. Initially, I thought about it only in the sense of single URLs with unique names (i.e. like a one-off token you can only use once to download your pcap). Enumerating by org/space/user goes in the direction of "pcap library for users" again - an interesting thought, yet unexplored in this design. The design aims to only replace your thin local bandwidth with a thick remote bandwidth of > 100Gbit/s when uploading to S3 instead of streaming to you directly.
I agree. The current goal is to have the "direct streaming" solution ready for everyone to use by Q4 2022. We may iterate further on the design afterwards. |
Hi everyone, thanks a lot for the initial feedback and discussions. In the actual version, it enables tcpdump-streaming (in a hacky way) for CF-apps. Our current work addresses (very agile though):
Ouf of scope (for now):
|
Made slight updates to terms and use-cases.
pcap-release was re-planned to be written in gRPC instead of plain HTTP for better streaming performance and improved control flow. This allows us to add messages to the user while capturing as well as traffic management using back pressure and other options. |
What is this issue about?
At SAP BTP networking and routing we regularly face
problems such as these:
Operators on the other hand get complaints like these:
If logs are not helping, the issue is usually resolved by helping the customer or the operator run a tcpdump of their application and analyzing the pcap files in Wireshark.
Of course, this means a lot of work for operations and development, but what if the users themselves were able to capture their app's traffic?
Enter Pcap-Release
We have started working on a solution that allows regular CF users as well as BOSH operators to debug into the network traffic of apps. The system is composed of three parts:
on every Diego Cellinside CF app containers as well as BOSH VMs. It canenter the network namespace of a CF app container andtap into its network devices using libpcap and BPF filters just like tcpdump does. It leverages gopacket, a golang pcap library by Google.The project
is currently hosted under my org as pcap-server-releasehas moved to a permanent location cloudfoundry/pcap-releaseThe repository provides an ops-file that integrates the pcap-release with Diego as well as an example manifest that deploys the pcap-API onto its own VMs.
Architecture
Explanation: Stream to User
Explanation: Stream to Storage / Download later
This is needed if traffic is too much for end user to handle. The traffic is instead streamed to an object store (like AWS S3) and tagged with the user's id.
Current status / next steps of the project
The project is considered pre-alpha. Basic use cases are working, some authentication and authorization is done. Pcap-API URL is registered using route-registrar. Connection between Pcap-API and Pcap-Agent is secured using mTLS.
Use Cases Complete / Missing:
Next steps will likely be:
The goal of this issue
We recently showed a demo of the release to the app runtime platform wg audience. It was well received and suggested to bring it to cf-deployment to discuss options to integrate it.
We would like to use this issue to answer the following questions:
Feel free to reach out to me on CF-Community Slack also! Handle is @domdom82
The text was updated successfully, but these errors were encountered: