Contents

How to Avoid CIDR Conflicts in AWS Sagemaker Notebooks

Networking can sometimes be quite complicated. Despite the oft repeated joke that “It’s always DNS”, sometimes your problem is even more difficult to diagnose than DNS.

According to Wikipedia, Classless Inter-Domain Routing (or CIDR) “is the method for allocating IP addresses and for IP routing” on the internet and on private networks. If there are conflicts in two networks’ CIDR ranges, it can cause headaches that make DNS problems look like childs play.

This is a story about how I unknowingly created a CIDR conflict, and I hope it will be useful to help you avoid CIDR conflicts in Sagemaker Notebooks in the future.

What is AWS Sagemaker

AWS Sagemaker is a popular Machine Learning platform that provides preconfigured environments which allow you to begin training ML models quickly.

Sagemaker Notebooks provide the platform to build and train your models. Under the covers an AWS Sagemaker Notebook is an EC2 instance running a Jupyter notebook packaged with numerous libraries and algorithms. This prevents users from having to configure their own compute, storage and libraries themselves.

A benefit of Sagemaker Notebooks is that you can run generic Python commands alongside your more complex ML code within the Jupyter notebooks. For example, while testing out some connectivity issues recently I was able to use the ubiquitous Python Requests library to perform an HTTP GET request within the Sagemaker Notebook to verify I was able to communicate with a webserver I launched in another AWS account.

Creating a VPC

When I create new projects or resources in AWS accounts, I find it useful to create a new VPC just for the resources in that specific project. This provides network isolation, security and connectivity to the specific components needed for the project. During VPC creation you need to provide a CIDR range that the VPC will use.

CIDR ranges must be built using the private address spaces that IETF RFC 1918 dictates. This means that you choose the CIDR range, but it must be in the 192.168.0.0 range for 256 Class C networks, or 172.16.0.0 for 16 Class B networks, or 10.0.0.0 for a Class A network.

The day I was working on these Sagemaker resources, I randomly chose the range 172.17.0.0 for this VPC, and this fateful decision would come to haunt me.

Sagemaker Notebooks

As described briefly above, Sagemaker Notebooks run on an AWS EC2 instance, and provide preconfigured tools for ML use cases. Part of that tooling is a wonderful open source product called Jupyter Notebook, which is a web based Python experimentation environment. Jupyter Notebooks run inside Docker on the Sagemaker Notebook EC2 instance.

When you configure Docker on an EC2 instance or any workstation/laptop/Linux machine, it sets up virtual networking of its own to provide network connectivity to the Docker containers. This local virtual networking sets up a bridge network interface that allows communication between the Docker network and the local machine as well as the internet if allowed. The default Docker bridge network uses the range 172.17.0.0 for the docker0 network interface.

CIDR Range Conflict Causes Connectivity Problem

Now that the background is set, you can see why a connectivity problem was destined to occur on the Sagemaker Notebook instance. Since the VPC CIDR range utilized 172.17.0.0, that meant all EC2 instances or network interfaces created in that VPC would be provided with an IP address within the 172.17.0.0 range.

Because the Jupyter Notebook running in Docker on the Sagemaker Notebook EC2 instance used the Docker bridge network 172.17.0.0 and listened for all traffic being sent to that destination, it superceded the traffic sent to the outside world. Any network packets sent to the default route (0.0.0.0) via the default gateway (172.17.112.1) were actually intercepted by the same Docker bridge network, and not sent outside the bridge network.

This is shown via the route -n command on the Sagemaker Notebook EC2 instance terminal:

sh-4.2$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.17.112.1    0.0.0.0         UG    10002  0        0 eth2
169.254.0.0     0.0.0.0         255.255.255.0   U     0      0        0 veth_def_agent
169.254.169.254 0.0.0.0         255.255.255.255 UH    0      0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.17.112.0    0.0.0.0         255.255.240.0   U     0      0        0 eth2
172.18.0.0      0.0.0.0         255.255.0.0     U     0      0        0 br-9bb29e923d05
192.168.0.0     0.0.0.0         255.255.0.0     U     0      0        0 bridge0

Was This Unexpected?

What could have been done to avoid this you ask?

As I mentioned earlier, the default Docker bridge network uses the range 172.17.0.0 for the docker0 network interface. It’s fine that AWS simply used that default without modifying it, but that implementation detail should have been documented in the Sagemaker Notebook documentation.

Ideally my suggested resolution is different. If the AWS end user (aka customer) tries creating a Sagemaker Notebook in a VPC that has a CIDR range the same as the default Docker bridge range (172.17.0.0), they should change the Docker bridge network range to something else to avoid a conflict.

Wrap Up

Ultimately the responsibility was on me to create the VPC and the Sagemaker Notebook and make sure they worked correctly. But this was made infinitely more difficult by AWS making choices themselves. Because AWS Sagemaker did not document the Docker bridge network range they used, and did not programatically configure an alternate Docker bridge network range if the customer selected the same CIDR range for their VPC as the default Docker bridge network, I was left troubleshooting a nasty problem.

Please be aware this is still an open issue. (I say that in a figurative way not literal. There is no AWS Support ticket open, and they have not committed to fixing this in any way.)

If you want to avoid CIDR conflicts on Sagemaker Notebooks in your AWS VPC, make sure you use a CIDR range other than 172.17.0.0.

Photo by fabio on Unsplash