RSS Gardening
RSS feed parsing is pretty messy, and that's well-known. People/libraries regularly insert random tags, bizarre timestamp formats, and various 'illegal' things in their feeds. And they're constantly changing item publication timestamps (personal pet peeve).
You're pretty much dead in the water if you don't embrace Postel's law: "Be conservative in what you do, be liberal in what you accept from others."
So, while it was ironic to hit an issue on launch day (yesterday), it was unsurprising.
A bunch of people tried out the app (cool), and I suspected the new errors were because of something there. No, it turned out to be from a feed that an existing customer added three months ago.
Some background: Yupdates looks through all <a>
and <img>
links in RSS/Atom content and turns relative links into absolute ones (clicking on links in item content should take you to the intended target). Sounds straightforward, but there have been many snags here.
This time, an issue with an image srcset value resulted in a single letter as one of the candidates to turn into an absolute URL. Each instance of that letter was replaced with the absolute URL version, and this happened in a loop (because the srcset contained 9 images). The item-size really blew up, well beyond our 100KB limit.
Encountering large items is fine (rare, and they get truncated), but, oops, that size limit is checked before link replacement runs in the intake pipeline (with the idea that we can cut some slack there as long as the original content was within the limit). But this issue increased the size by 50x and exceeded the DynamoDB item limit.
Deeper in the stack, Postel's law is off the table, and expectations are constantly asserted, but I just wasn't paranoid enough (until the final DB write). Yes, increasing item size is OK later in the pipeline, but I should have added a sanity check to just how many bytes that can transformation can add. Each input is processed in isolation, thankfully.
This won't be the last time I do some RSS gardening (weed pulling). It's not the most gratifying work, but I do love seeing errors disappear and getting people the correct content.
Each time it happens, I end up wondering about general solutions. Like with browsers and sloppy HTML at large, I don't think there's a way to change so many people's behavior on the content-producing side. Can we make parsers that don't need to special case so much?