How to Avoid CIDR Conflicts in AWS Sagemaker Notebooks
Networking can sometimes be quite complicated. Despite the oft repeated joke that “It’s always DNS”, sometimes your problem is even more difficult to diagnose than DNS.
According to Wikipedia, Classless Inter-Domain Routing (or CIDR) “is the method for allocating IP addresses and for IP routing” on the internet and on private networks. If there are conflicts in two networks’ CIDR ranges, it can cause headaches that make DNS problems look like childs play.
This is a story about how I unknowingly created a CIDR conflict, and I hope it will be useful to help you avoid CIDR conflicts in Sagemaker Notebooks in the future.
What is AWS Sagemaker
AWS Sagemaker is a popular Machine Learning platform that provides preconfigured environments which allow you to begin training ML models quickly.
Sagemaker Notebooks provide the platform to build and train your models. Under the covers an AWS Sagemaker Notebook is an EC2 instance running a Jupyter notebook packaged with numerous libraries and algorithms. This prevents users from having to configure their own compute, storage and libraries themselves.
A benefit of Sagemaker Notebooks is that you can run generic Python commands alongside your more complex ML code within the Jupyter notebooks. For example, while testing out some connectivity issues recently I was able to use the ubiquitous Python Requests library to perform an HTTP GET request within the Sagemaker Notebook to verify I was able to communicate with a webserver I launched in another AWS account.
Creating a VPC
When I create new projects or resources in AWS accounts, I find it useful to create a new VPC just for the resources in that specific project. This provides network isolation, security and connectivity to the specific components needed for the project. During VPC creation you need to provide a CIDR range that the VPC will use.
CIDR ranges must be built using the private address spaces that IETF RFC
1918 dictates. This
means that you choose the CIDR range, but it must be in the
for 256 Class C networks, or
172.16.0.0 for 16 Class B networks, or
for a Class A network.
The day I was working on these Sagemaker resources, I randomly chose the range
172.17.0.0 for this VPC, and this fateful decision would come to haunt me.
As described briefly above, Sagemaker Notebooks run on an AWS EC2 instance, and provide preconfigured tools for ML use cases. Part of that tooling is a wonderful open source product called Jupyter Notebook, which is a web based Python experimentation environment. Jupyter Notebooks run inside Docker on the Sagemaker Notebook EC2 instance.
When you configure Docker on an EC2 instance or any workstation/laptop/Linux
machine, it sets up virtual networking of its own to provide network
connectivity to the Docker containers. This local virtual networking sets up a
bridge network interface that allows communication between the Docker network
and the local machine as well as the internet if allowed. The default Docker
bridge network uses the range
172.17.0.0 for the
docker0 network interface.
CIDR Range Conflict Causes Connectivity Problem
Now that the background is set, you can see why a connectivity problem was
destined to occur on the Sagemaker Notebook instance. Since the VPC CIDR range
172.17.0.0, that meant all EC2 instances or network interfaces
created in that VPC would be provided with an IP address within the
Because the Jupyter Notebook running in Docker on the Sagemaker Notebook EC2
instance used the Docker bridge network
172.17.0.0 and listened for all
traffic being sent to that destination, it superceded the traffic sent to the
outside world. Any network packets sent to the default route (
0.0.0.0) via the
default gateway (
172.17.112.1) were actually intercepted by the same Docker
bridge network, and not sent outside the bridge network.
This is shown via the
route -n command on the Sagemaker Notebook EC2 instance terminal:
sh-4.2$ route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 172.17.112.1 0.0.0.0 UG 10002 0 0 eth2 169.254.0.0 0.0.0.0 255.255.255.0 U 0 0 0 veth_def_agent 169.254.169.254 0.0.0.0 255.255.255.255 UH 0 0 0 eth0 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0 172.17.112.0 0.0.0.0 255.255.240.0 U 0 0 0 eth2 172.18.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br-9bb29e923d05 192.168.0.0 0.0.0.0 255.255.0.0 U 0 0 0 bridge0
Was This Unexpected?
What could have been done to avoid this you ask?
As I mentioned earlier, the default Docker
bridge network uses the range
172.17.0.0 for the
docker0 network interface. It’s fine that AWS simply used
that default without modifying it, but that implementation detail should have
been documented in the Sagemaker Notebook documentation.
Ideally my suggested resolution is different. If the AWS end user (aka customer)
tries creating a Sagemaker Notebook in a VPC that has a CIDR range the same as
the default Docker bridge range (
172.17.0.0), they should change the Docker
bridge network range to something else to avoid a conflict.
Ultimately the responsibility was on me to create the VPC and the Sagemaker Notebook and make sure they worked correctly. But this was made infinitely more difficult by AWS making choices themselves. Because AWS Sagemaker did not document the Docker bridge network range they used, and did not programatically configure an alternate Docker bridge network range if the customer selected the same CIDR range for their VPC as the default Docker bridge network, I was left troubleshooting a nasty problem.
Please be aware this is still an open issue. (I say that in a figurative way not literal. There is no AWS Support ticket open, and they have not committed to fixing this in any way.)
If you want to avoid CIDR conflicts on Sagemaker Notebooks in your AWS VPC,
make sure you use a CIDR range other than