> Is the Unicode character U+1F4A9 used in the conformance checker test
> suite for URLs really invalid?
No, it's valid. Thanks for catching this and taking time report it.
> It’s marked as novalid in test suite files like:
Yeah, I'll need to fix that. But before I do, I'll wait for a fix to the
upstream code of the URL parsing library the validator uses, called
galimatias. I've already filed a pull request with a proposed fix:
> In RFC 3987 this character is listed in the 10000-1FFFD range in the
> iuserinfo -> iunreserved -> ucschar production:
> iuserinfo = *( iunreserved / pct-encoded / sub-delims / ":" )
> iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
> ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
> / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
> / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
> / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
> / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
> / %xD0000-DFFFD / %xE1000-EFFFD
> In the Whatwg URL standard it’s listed as a valid URL code point, and
> will be converted to percent encoding during the normalisation process,
> but won’t flag an error. See
> https://url.spec.whatwg.org/#url-code-points > https://url.spec.whatwg.org/#authority-state
Yup. Your reading of the spec is right. I'd made the mistake of being lazy
and having the test suite just follow the (buggy in this particular case)
behavior galimatias on this, rather than checking it against the spec.