Regular Expressions Tutorial: From Zero to Regex Hero
Regular expressions seem cryptic at first, but they are one of the most powerful text processing tools available. This beginner-friendly tutorial teaches you regex step by step with practical examples.
Introduction
Regular expressions (often shortened to regex or regexp) are patterns that describe sets of strings. They are used in virtually every programming language and many text editors for searching, matching, extracting, and replacing text. Despite their reputation for being cryptic and confusing, the core concepts are straightforward once you understand the building blocks.
By the end of this tutorial, you will be able to write regular expressions to validate email addresses, extract data from text, find and replace patterns in code, and much more. We will start from the absolute basics and build up to practical, real-world patterns.
What Are Regular Expressions
A regular expression is a sequence of characters that defines a search pattern. At its simplest, a regex is just a literal string — the pattern hello matches the text hello. But the power of regex comes from special characters (called metacharacters) that represent classes of characters, repetitions, positions, and groupings.
Think of regex as a very precise search language. Instead of searching for a specific word, you can search for patterns like any word that starts with a capital letter or any sequence of digits that looks like a phone number.
Literal Characters
The simplest regex patterns are literal characters. The pattern cat matches the text cat wherever it appears. It would match in the words cat, concatenate, and caterpillar. This is essentially what your browser's Ctrl+F find feature does — it matches the exact characters you type.
Most characters in regex match themselves literally. The letters a through z, A through Z, digits 0 through 9, and many symbols are literal. However, certain characters have special meaning and need to be escaped with a backslash if you want to match them literally. These special characters are the dot (.), asterisk (*), plus (+), question mark (?), caret (^), dollar sign ($), brackets ([ ]), braces ({ }), parentheses (( )), pipe (|), and backslash itself.
Character Classes
Character classes let you match any one character from a set. Square brackets define a character class. The pattern [aeiou] matches any single vowel. The pattern [0-9] matches any single digit. The pattern [a-zA-Z] matches any single letter, uppercase or lowercase.
You can negate a character class by placing a caret at the start. The pattern [^0-9] matches any character that is not a digit. The pattern [^aeiou] matches any character that is not a lowercase vowel.
Regex also provides shorthand character classes for common patterns. The shorthand \d matches any digit (equivalent to [0-9]). The shorthand \w matches any word character, meaning letters, digits, and underscore (equivalent to [a-zA-Z0-9_]). The shorthand \s matches any whitespace character, including spaces, tabs, and newlines. The uppercase versions \D, \W, and \S match the opposite — non-digit, non-word, and non-whitespace characters respectively.
The dot (.) is a special metacharacter that matches any single character except a newline. The pattern c.t matches cat, cot, cut, c1t, and c@t — any three-character string starting with c and ending with t.
Quantifiers
Quantifiers specify how many times a preceding element should be repeated. The asterisk (*) means zero or more times. The plus (+) means one or more times. The question mark (?) means zero or one time (making the element optional).
The pattern ab*c matches ac (zero b's), abc (one b), abbc, abbbc, and so on. The pattern ab+c matches abc, abbc, abbbc — but not ac, because plus requires at least one occurrence. The pattern colou?r matches both color and colour — the u is optional.
For exact repetition counts, use curly braces. The pattern a{3} matches exactly three a's. The pattern a{2,4} matches two, three, or four a's. The pattern a{2,} matches two or more a's.
Anchors
Anchors match positions rather than characters. The caret (^) matches the start of a string or line. The dollar sign ($) matches the end of a string or line. The combination ^hello$ matches only the exact string hello with nothing before or after it.
Word boundaries (\b) mark the transition between a word character and a non-word character. The pattern \bcat\b matches cat as a whole word but not the cat in concatenate or caterpillar. This is essential for precise word matching.
Groups and Alternation
Parentheses create groups that can be quantified as a unit and captured for extraction. The pattern (ab)+ matches ab, abab, ababab — one or more repetitions of the entire ab sequence, not just the b.
The pipe (|) provides alternation — matching one pattern or another. The pattern cat|dog matches either cat or dog. Combined with groups, (cat|dog)s matches cats or dogs.
Practical Examples
Let us apply these concepts to real-world patterns. To validate an email address, a basic pattern is ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$. Breaking this down, [a-zA-Z0-9._%+-]+ matches one or more characters in the local part, @ matches the literal at sign, [a-zA-Z0-9.-]+ matches the domain name, \. matches a literal dot, and [a-zA-Z]{2,} matches the top-level domain with at least two letters.
To match a US phone number in various formats, the pattern \(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4} matches formats like (555) 123-4567, 555-123-4567, 555.123.4567, and 5551234567. The \(? and \)? make the parentheses optional, [-.\s]? makes the separator optional, and \d{3} and \d{4} match the digit groups.
To extract URLs from text, the pattern https?://[^\s]+ matches any http or https URL. The s? makes the s optional, :// matches the literal protocol separator, and [^\s]+ matches one or more non-whitespace characters — effectively capturing everything until the next space.
To find all HTML tags, the pattern <[^>]+> matches any opening or closing tag. The < matches the literal opening bracket, [^>]+ matches one or more characters that are not a closing bracket, and > matches the literal closing bracket.
Lookahead and Lookbehind
Lookahead and lookbehind are advanced features that match a position based on what comes next or before, without including those characters in the match. A positive lookahead (?=pattern) asserts that what follows matches the pattern. A negative lookahead (?!pattern) asserts that what follows does not match.
For example, \d+(?= dollars) matches one or more digits only if they are followed by the text "dollars" — but the word "dollars" is not included in the match result. This is extremely useful for precise extraction.
Lookbehind works the same way but checks what comes before the current position. A positive lookbehind (?<=pattern) and negative lookbehind (?
Common Mistakes
The most common regex mistake is being too greedy. The pattern <.> applied to the text bold matches the entire string from the first < to the last > because the dot-star is greedy — it matches as much as possible. Using the lazy quantifier <.?> instead matches and separately, as intended.
Another common mistake is forgetting to escape special characters. If you want to match a literal dot (like in a filename extension), you must use \. instead of just a dot, which matches any character.
Over-relying on regex for complex parsing tasks is also a pitfall. HTML, JSON, and other structured formats are better parsed with proper parsers. The famous advice applies: if you try to solve a problem with regex, now you have two problems. Use regex for pattern matching in flat text, not for parsing nested structures.
Testing Your Regex
Always test your regex against a variety of inputs — valid cases, edge cases, and invalid cases that should not match. Our Regex Tester tool lets you enter a pattern and test text, showing matches highlighted in real time with detailed match information. This immediate feedback accelerates learning and debugging significantly.
Performance Considerations
Regex engines can be fast or slow depending on the pattern. Catastrophic backtracking occurs when a pattern has ambiguous paths that cause the engine to try exponentially many combinations. Patterns like (a+)+ or (a|a)* are dangerous because they have overlapping alternatives.
To avoid performance issues, be as specific as possible in your patterns. Use character classes instead of dots when you know what characters to expect. Avoid nested quantifiers on overlapping patterns. Test your patterns against long inputs to check for slowness.
Conclusion
Regular expressions are an essential tool in every developer's toolkit. The key concepts — literal characters, character classes, quantifiers, anchors, groups, and alternation — combine to create patterns that can match virtually any text structure. Start with simple patterns, test thoroughly, and build up to more complex expressions as your confidence grows. With practice, regex will become one of your most powerful and frequently used skills.