An Interest In:
Web News this Week
- April 18, 2024
- April 17, 2024
- April 16, 2024
- April 15, 2024
- April 14, 2024
- April 13, 2024
- April 12, 2024
Let's stop using [a-zA-Z]
If you, like me, regularly (see what I did here?) validate alphanumeric fields using Regex, you probably learned to do it like this:
'Till'.match(/[a-zA-Z0-9]+/gu)
This is technically correct, of course. And it's what most validation libraries will do when you tell them a field is alpha
/ alphanumeric
/ etc.
However, I have a problem with this approach and a lot (!) of other people do, too. Because I'm from Germany. More specifically, from a town called Ldenscheid. And Ldenscheid won't match the regular expression above because of the Umlaut. Same applies for languages like French, Spanish, Czech, just to name a few.
So how can we as developers be more inclusive towards languages other than English? Do we have to include all possible variations of the latin alphabet? That's a common suggestion, but of course, it doesn't scale well.
Luckily, Unicode has us covered:
'Ldenscheid'.match(/[\p{Letter}\p{Mark}]+/gu)
The \p
flag allows us to pick a so called Unicode Character Category. In Unicode, all characters are sorted into categories that we can use in our regular expression. The Letter
category includes letters from all kinds of languages, not just A-Z. But it does not include, e.g. <
, >
, +
or $
which is important for security. The Mark
category as lionelrowe pointed out in the comments (thanks) contains combining marks. In Unicode, a letter like can be either one or two combined code points. So depending on how the character is coded, we need the
Mark
category.
More details on the Mark category
If we omit the Mark
category and run the following Regex: 'Ldenscheid'.match(/[\p{Letter}]+/gu)
it will match Ldenscheid
, if the is encoded as a single character. On the other hand, if the
is encoded as a letter-mark-combination (
u +
), the regex will only match Lu
, because it will stop at the mark.
Browser support
Browser support for this feature is good, IE (not Edge) being the only exclusion.
Bonus
// Match only letters'Ldenscheid'.match(/[\p{Letter}\p{Mark}]+/gu)// Match letters and spaces'Prat filharmonici'.match(/[\p{Letter}\p{Mark}\s]+/gu)// Match letters and hyphens'le-de-France'.match(/[\p{Letter}\p{Mark}-]+/gu)// Match letters hyphens and spaces'le-de-France'.match(/[\p{Letter}\p{Mark}\s-]+/gu)
Original Link: https://dev.to/tillsanders/let-s-stop-using-a-za-z-4a0m
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To