About How A Single Character Broke Alibaba Cloud’s Container Registry
You load the site… it doesn’t work. You check the console… some weird error. You examine the code… nothing specially wrong. The debugging nightmare begins! This post starts with something “too” common in programming, am I right? Yes, I am :D
There is just too many cases where production breaks “just because” and turns out is just a stupid character or an extra semicolon. Sometimes our fault, sometimes other’s, sometimes nobody’s ¯\_(ツ)_/¯.
What’s going on here?
This story happened with the Enterprise Edition of the Alibaba Cloud Container Registry. The coolest feature, from my point of view, of this so-called “Enterprise Edition” is the ability for users to cross-region replicate Docker Images. This comes specially handy when using exactly the same Docker Image is vital for deployments in several regions, as consistency is king there.
Another handy use-case for cross-border Image replications (And, at least for me, the most interesting one) is the ability to use Docker Images inside China Mainland in a reliable way, as pulling Docker Images from China is a pain experience.
So here I was, trying to cross-region replicate some Docker images from United States to Mainland China, just a simple (that’s what I thought) task. But the image was never replicated on the other end, hmm… Our problem could be explained as “There is something weird going on with how Alibaba Cloud Container Registry saves the Docker image’s layers”.
Basically a layer, or image layer, is a change on an image, or an intermediate image. Every command you specify (FROM, RUN, COPY, etc.) in your Dockerfile causes the previous image to change, thus creating a new layer.
See that “â€“? Yes, my experience also tells me that’s a red flag, I knew something was off when I saw that. I was inspecting the Image layers as I was quite desperate and all my efforts trying to find the problem were useless.
At the end, after inspecting the original base Docker Image used in this project, and after landing on the original Dockerfile, I found out how the author was using an unusual (but unicode thus correct) character. The offending character was the “U+201D“, also called “RIGHT DOUBLE QUOTATION MARK“. Yes, as said, is a unicode character, but unusual to use it when programming, where we just use the simple “QUOTATION MARK“, aKa “U+0022“.
Alibaba Cloud’s internal Operating System didn’t expect that character to be there, apparently, so when it was replicating the image internally from one region to another region, it got stuck and the image never got to the “other side” alive.
Can you replicate this?
I published this article after the bug was fixed, obviously, but the next Dockerfile served as guinea pig for me and Alibaba Cloud to further explore and fix the bug. Problem solved!
FROM httpd LABEL org.label-schema.build-date="”2020-06-25T12:10:51Z”"
When there’s a will there’s a way
A classic, right? but it’s true. I spent 5 days at my full-time work back then trying to figure out what was happening and seeing my role and my own sanity in serious danger.
At the end, it wasn’t even my fault! Was one of those one-in-a-lifetime bugs and this research helped Alibaba Cloud to further improve their product :D