Complete Guide to Regular Expressions for Text Processing
March 2025
Regular expressions (regex) match patterns in text. They power search-and-replace, validation, parsing, and extraction. A single regex can find every email in a document, validate a phone number, or extract dates. This guide covers core concepts and practical use with our regex tester, regex generator, and find and replace tool.
Regex syntax varies slightly between languages (JavaScript, Python, Perl, etc.), but the core concepts are portable. We focus on patterns that work in JavaScript and most modern engines. For validation and data extraction, regex pairs well with our find dates and times tool and extract URLs tool.
Regex basics
A regex is a sequence of characters that defines a search pattern. Literal characters match themselves: cat matches "cat". Special characters (metacharacters) have meaning: . matches any character, * means "zero or more of the previous". Test patterns in the regex tester.
Anchors: ^ matches start of string, $ matches end. ^Hello matches "Hello" only at the beginning. world$ matches "world" only at the end. Use both for exact line matches. In multiline mode, ^ and $ match the start and end of each line. The \b word boundary matches the position between a word character and a non-word character, useful for whole-word matches.
Escaping metacharacters
To match a literal dot, asterisk, or other metacharacter, escape with backslash: \., \*, \[. In character classes, some metacharacters lose meaning: . inside [] is literal. The regex tester highlights matches so you can verify behavior before using in the find and replace tool.
Character classes
[abc] matches one of a, b, or c. [0-9] matches any digit. [a-zA-Z] matches any letter. [^abc] matches anything except a, b, c. Shorthand: \d = digit, \w = word character, \s = whitespace. Uppercase inverts: \D = non-digit.
Quantifiers
* = zero or more. + = one or more. ? = zero or one. {n} = exactly n. {n,} = n or more. {n,m} = between n and m. By default, quantifiers are greedy (match as much as possible). Add ? for lazy: .*? matches minimally.
Capture groups
Parentheses create capture groups: (\d{3})-(\d{4}) captures area code and number. In replace, $1 and $2 refer to the captures. Use (?:...) for non-capturing groups when you need grouping but not capture. The regex replace toolkit supports capture groups in replacement.
Find and replace
Regex replace transforms matched text. (\w+)\s+(\w+) with replacement $2, $1 swaps first and last name. The find and replace tool supports regex mode. Use the pattern rewriter for batch transformations.
Common patterns
Email (simplified): [\w.-]+@[\w.-]+\.\w+. URL: https?://[^\s]+. Phone: \d{3}[-.]?\d{3}[-.]?\d{4}. The regex generator can help build patterns from examples.
| Pattern | Matches |
|---|---|
| \d+ | One or more digits |
| \w+ | Word characters (letters, digits, underscore) |
| [A-Za-z]+ | Letters only |
| ^.+$ | Entire line (at least one character) |
| \s+ | Whitespace (spaces, tabs) |
| \bword\b | Whole word "word" |
For extracting structured data, combine regex with our extract quoted text or extract hashtags and mentions tools. For pattern-based rewrites, the pattern rewriter applies transformations across matches.
Use cases
Validation: check if input matches a format (email, phone, postal code). Search and replace: swap "last, first" to "first last", strip HTML tags, normalize whitespace. Extraction: pull emails, URLs, or dates from logs. Parsing: split CSV lines, parse log formats. The mark words based on rules tool uses pattern matching for conditional highlighting. For batch text processing, see if-then text rules.
Common mistakes
- Forgetting to escape.
.matches any character;\.matches a literal dot. In URLs and file paths, escape dots and slashes as needed. - Greedy vs lazy.
.*matches as much as possible. Use.*?for minimal matches when extracting content between delimiters. - Anchors in the wrong place.
^catmatches "cat" at the start;cat^is invalid or matches literally depending on the engine. - Character class confusion.
[^abc]means "not a, b, or c". The^inside[]negates the class. Outside[],^is the start anchor.
Tools
Regex Tester – test patterns. Regex Generator – build from examples. Regex Replace Toolkit – find and replace. Find and Replace – general search/replace with regex. Developer Tools – full catalog.