Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
July 18, 2021 04:17 am GMT

Getting Started with Regular Expressions

Regular expressions (regex) are one of those things that folks seem to make fun of most of the time because they don't understand them, or partially understand them.

I decided to write this post after Ben Hong Tweeted out asking for good regex resources.

Is this post going to make you a regex guru? No, but it will teach some of the pitfalls that developers succumb to. The example code shown in the post will be for regular expression in JavaScript, but you should be able to use them in your language of choice or at least the concepts if the syntax is slighlty different.

Be Specific

Know what you're trying to look for. This may sound obvious on the surface, but it's not always the case. Let's say I want to find instances of three in a text file because we need to replace all instances of three with the number 3. You've done a bit of Googling and or checked out regex101.com. You're feeling pretty good so you write out this regular expression.

const reMatchThree = /three/g

Note: If you're new to regular expressions, everything between the starting / and the ending / is the regular expression. The g after the last / means global, as in find all instances.

You run the regular expression to match all instances of three so it can be replaced with 3. You look at what got replaced in the text and you're a little perplexed.

- There were three little pigs that live in their own houses to keep safe from the big bad wolf who is thirty-three years old.+ There were 3 little pigs that live in their own houses to keep safe from the big bad wolf who is thirty-3 years old.

three got replaced by 3 everywhere in the file, but why was thirty-three replaced? You only wanted threes replaced. And here we have our first lesson. Be specific. We only want to match when it's only the word three. So we need to beef up this regex a little. We only want to find the three when it's the first word in a sentence, has white space before and after it, or if it's the last word in a sentence. With that criteria, the regex might look like this now.

const reMatchThree = /(?:\s|^)(three)(?:\s|\|.|,|;|:|'|"|!|"|'|$)/g

Note: Don't worry if you're not familiar with all the syntax. The ^ character means the beginning of a line of text. The $ character means the end of a line of text. For more information about ^ and $, see Start of String and End of String Anchors.

When parts of a regex are contained by parentheses, it means a group, and what's in that group will return as a group as part of the match. If partos of a regex are contained by (?: and ), it means a non-capturing group as in, it won't show up in the matching regex object.

Other things to note ar that . means match anything. In our case, we're looking for a period in text, so we need to escape it with a backslash, \..

The grouped sections of the regex contain the | character. In regex, that means or. For example (?:\s|^) means find white space, \s or the start of the text, ^.

Another special regexe character that you may have noticed in the first group is \s. This means a whitespace character. It can be a space or tab.

Don't Be Too Greedy

Greed is usually not a good thing and greed in regex is no exception. Let's say you're tasked with finding all the text snippets between double quotes. If you remember from the previous section, I mentioned that . means any character. Another special character is +. It means at least one character. With this knowledge, you set out to build your regex. Also, for the sake of this example, we are going to assume the happy path, i.e. no double quoted strings withing double quoted strings.

const reMatchBetweenDoubleQuotes = /"(.+)"/g

You're feeling good and you run this regex over the file you need to extract the texts from.

Hi there "this text is in double quotes". As well, "this text is in double quotes too".

The results come in and here are the texts that the regex matched for texts within double quotes:

  • this text is in double quotes". As well, "this text is in double quotes too

Wait a minute!? That's not what you were expecting. There are clearly two sets of text within double quotes, so what went wrong? Lesson number two. Don't be greedy.

If we look again at the regex you created, it contains .+ which means literally match any character as many times as possible which is why we end up matching only this text is in double quotes". As well, "this text is in double quotes too because " is considered any character.

There are two ways to approach this in our simplified scenario. We can use the non-greedy version of +, by repalcing it with +?

const reMatchBetweenDoubleQuotes = /"(.+?)"/g

Which means find a ", start a capturing group then find as many characters as possible before you hit a "

Another approach, which I prefer is the following:

const reMatchBetweenDoubleQuotes = /"([^"]+)"/g

Which means find a ", start a capturing group then find as many characters as possible that aren't " before you hit a ".

Note: Some new syntax. [ and ] are a way to say match any of the following things. In our case though, we're using it with ^, i.e. [^, to say do not match any of the following things. I our case, we're saying to not match the " character.

That's all for now! If you have questions about regexes, drop a comment!

Resources


Original Link: https://dev.to/nickytonline/getting-started-with-regular-expressions-11dg

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To