Every developer has a regex moment. You need to validate an email address, extract a date from a log file, or strip HTML tags from a string — and someone suggests a regular expression. You find one on Stack Overflow, it works, you move on. But you have no idea what the pattern actually means, and you're too nervous to touch it.
This is a shame, because regular expressions are one of the most powerful and portable tools in any developer's toolkit. The same concepts work in JavaScript, Python, Java, Go, Ruby, and most text editors. Once you understand the building blocks, you can read and write them confidently — and solve in one line what would otherwise take twenty.
What Are Regular Expressions?
A regular expression (regex or regexp) is a sequence of characters that defines a search pattern. That pattern can be applied to a string to test whether the string matches, to find substrings that match, or to replace matching substrings with something else. Regex is supported natively in virtually every programming language and most code editors, with minor syntax differences between flavours (more on that later).
Regex patterns are compiled into finite automata by the regex engine — a state machine that scans through your input string making transitions based on each character. Understanding this model is not required to use regex effectively, but it helps explain why some patterns are fast and others catastrophically slow.
Character Classes Explained
Character classes are the basic building block of regex. They define a set of characters, and the pattern matches any single character from that set. Square brackets create a custom class; shorthand escapes provide commonly used classes:
[abc]— matches exactly one of: a, b, or c[a-z]— range: any lowercase ASCII letter[A-Za-z0-9]— any alphanumeric character[^abc]— negated class: any character except a, b, or c\d— any digit; equivalent to[0-9]\D— any non-digit; equivalent to[^0-9]\w— any word character: letters, digits, and underscore; equivalent to[A-Za-z0-9_]\W— any non-word character\s— any whitespace: space, tab, newline, carriage return, form feed\S— any non-whitespace character.— any character except newline (use thesflag to include newlines)
Note that \d, \w, and \s are ASCII-only in most engines by default. In JavaScript with the u flag or in Python with re.UNICODE, they match full Unicode equivalents. This matters if you're validating text in non-Latin scripts.
Quantifiers: Controlling Repetition
Quantifiers attach to the preceding element (a character, class, or group) and specify how many times it must appear for the pattern to match:
*— zero or more times (greedy)+— one or more times (greedy)?— zero or one time; makes the preceding element optional{3}— exactly 3 times{2,5}— between 2 and 5 times (inclusive){3,}— 3 or more times
By default, quantifiers are greedy — they match as many characters as possible while still allowing the overall pattern to match. Adding a ? after a quantifier makes it lazy (reluctant): it matches as few characters as possible. Compare <.*> (greedy, matches from the first to the last angle bracket on a line) with <.*?> (lazy, matches each tag individually). This distinction is critical when parsing HTML-like or delimited content.
Anchors: Matching Positions, Not Characters
Anchors are zero-width assertions — they match a position in the string rather than a character. This makes them essential for validation patterns where you need to match the entire input, not just a substring within it.
^— start of string (or start of line in multiline mode)$— end of string (or end of line in multiline mode)\b— word boundary: the position between a word character and a non-word character\B— non-word boundary: inside a word\A— absolute start of string (Python, Java, Ruby; not JS)\Z— absolute end of string (Python, Java, Ruby; not JS)
The most common anchoring mistake: writing a validation pattern without ^ and $. The pattern \d{10} matches any 10 consecutive digits — it will happily match inside the string "abc12345678901xyz". The correct validation pattern is ^\d{10}$. Always anchor your validation patterns.
Capture Groups vs Non-Capturing Groups
Parentheses serve two purposes in regex: grouping (applying quantifiers to a sub-expression or using alternation) and capturing (saving the matched text for later use). Understanding the difference lets you write patterns that extract exactly the data you need.
// Capturing group — saves the match
const match = "2024-03-15".match(/^(\d{4})-(\d{2})-(\d{2})$/);
// match[1] = "2024", match[2] = "03", match[3] = "15"
// Non-capturing group — groups without saving
const match2 = "colour color".match(/colou?r/g);
// Use (?:...) when you only need grouping, not extraction
// Named capturing group — self-documenting
const match3 = "2024-03-15".match(/^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})$/);
// match3.groups.year = "2024", match3.groups.month = "03"Named capture groups (supported in JavaScript ES2018+, Python, Java, .NET, and others) are worth using whenever you extract more than one piece of data. They make the pattern self-documenting and protect against positional index bugs when you add or reorder groups.
Non-capturing groups (?:...) are purely structural — they allow you to apply quantifiers or alternation to a sub-pattern without polluting your capture group indices. Using non-capturing groups by default and capturing groups only when you need the data is good regex hygiene.
Lookahead and Lookbehind Assertions
Lookahead and lookbehind are zero-width assertions that match a position in the string based on what comes before or after, without consuming characters. This makes them powerful for patterns like "match this word only if it is followed by that word" without including the following word in the match.
(?=...)— positive lookahead: matches if the pattern ahead matches(?!...)— negative lookahead: matches if the pattern ahead does not match(?<=...)— positive lookbehind: matches if the pattern behind matches(?<!...)— negative lookbehind: matches if the pattern behind does not match
// Match a number only if followed by "px"
/\d+(?=px)/
// Match a word only if NOT preceded by "no "
/(?<!no )\bword\b/
// Password validation: must contain at least one digit
/^(?=.*\d).{8,}$/Lookahead is universally supported. Lookbehind is supported in modern JavaScript (ES2018+), Python, Java, and .NET, but not in older JavaScript engines. For maximum compatibility, lookbehind should be tested against your minimum supported runtime.
Common Real-World Regex Patterns
These patterns cover the most frequent validation use cases. Each is annotated with the tradeoffs involved — the perfect email regex does not exist, but a pragmatic one does.
- Email (pragmatic):
^[\w.+\-]+@[\w\-]+\.[a-zA-Z]{2,}$— catches the vast majority of invalid inputs; not RFC 5322 compliant (nothing useful is) - Indian mobile number:
^(?:\+91|0)?[6-9]\d{9}$— handles optional +91 or 0 prefix, validates that the number starts with 6-9 - GST number:
^\d{2}[A-Z]{5}\d{4}[A-Z][1-9A-Z]Z[0-9A-Z]$ - URL (basic):
^https?:\/\/[\w\-]+(\.[\w\-]+)+([\w\-._~:/?#\[\]@!$&'()*+,;=%]*)?$ - IPv4 address (strict):
^(25[0-5]|2[0-4]\d|1\d{2}|[1-9]\d|\d)(\.(25[0-5]|2[0-4]\d|1\d{2}|[1-9]\d|\d)){3}$— validates each octet is 0-255; the simple(\d{1,3}\.){3}\d{1,3}allows 999.999.999.999 - Date (YYYY-MM-DD):
^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$ - Hex colour:
^#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$ - URL-safe slug:
^[a-z0-9]+(?:-[a-z0-9]+)*$ - Strong password:
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$— requires lowercase, uppercase, digit, and special character - PIN code (India):
^[1-9][0-9]{5}$
Regex Performance Pitfalls
Catastrophic Backtracking
Catastrophic backtracking is a class of performance bug where certain patterns applied to certain inputs cause the regex engine to explore an exponential number of possible matches. The canonical example is (a+)+ applied to a string of a's followed by a character that doesn't match. The engine backtracks through every possible grouping of the a's, which for 30 characters can take longer than the age of the universe.
The root cause is nested quantifiers applied to overlapping expressions. The patterns (a+)+, (\w+\s?)+, and ([a-zA-Z]+)* are all vulnerable. The fix is to ensure that nested quantified groups cannot match the same characters in multiple ways. Possessive quantifiers (a++) and atomic groups ((?>a+)), available in Java, PCRE, and .NET but not JavaScript, prevent backtracking into the group. In JavaScript, the solution is to restructure the pattern.
Anchoring for Performance
An unanchored pattern must be tried at every position in the input string. For short strings this is irrelevant; for multi-megabyte log files or strings processed in a tight loop, it matters. If your pattern must match at the start of the string, include ^. The engine can then fail immediately on any string that doesn't begin with the expected pattern, rather than scanning through the entire string looking for a match position.
Language-Specific Differences
Most regex syntax is portable across languages, but there are meaningful differences that will catch you when switching contexts:
- JavaScript uses
/pattern/flagsliteral syntax ornew RegExp(). Flags:g(global),i(case-insensitive),m(multiline),s(dotAll, makes . match newlines),u(Unicode). Lookbehind requires ES2018+. No possessive quantifiers. - Python uses the
remodule. Raw strings (r"pattern") avoid double-escaping backslashes. Supports named groups, lookahead/behind. No possessive quantifiers in stdlib; theregexmodule adds them. - Java uses
Pattern.compile()andMatcher. Backslashes in Java string literals must be doubled:\\din the string becomes\din the pattern. Supports possessive quantifiers (a++) and atomic groups. - Go uses the
regexppackage which implements RE2 syntax — deliberately no backtracking, which means no lookahead/lookbehind and no backreferences, but guaranteed linear-time matching. Catastrophic backtracking is impossible in Go.
Testing and Debugging Regex
Regex patterns should always be tested against a representative set of both valid inputs (that must match) and invalid inputs (that must not match). Edge cases to specifically test include: empty strings, strings that are one character too short or long, strings with Unicode characters, strings with newlines, and inputs designed to trigger backtracking.
When a regex produces unexpected results, the fastest debugging approach is a real-time tester where you can see exactly which parts of the input matched, which capture groups were populated, and where the match failed. Trying to reason about complex patterns in your head rarely works — you need visual feedback from the engine itself.
Use Tanvrit's Regex Tester to write your pattern, test it against multiple sample strings simultaneously, and see match positions and capture group values highlighted in real time — all in your browser, no server involved, no account required. Open the Regex Tester →