Содержание
Regular expressions are a computer science concept where simple patterns
describe the format of text. Pattern matching is the process of applying
these patterns to actual text to look for matches. Most modern regular
expression facilities are more powerful than traditional regular expressions
due to the influence of languages such as Perl, but the short-hand term
regex has stuck and continues to mean "regular expression-like pattern
matching". In Perl 6, though the specific syntax used to describe the
patterns is different from PCRE[5] and POSIX[6], we continue to call them regex.
A common writing error is to duplicate a word by accident. It is hard to
catch such errors by rereading your own text, but Perl can do it for you
using regex:
my $s = 'the quick brown fox jumped over the the lazy dog';
if $s ~~ m/ « (\w+) \W+ $0 » / {
say "Found '$0' twice in a row";
}
The simplest case of a regex is a constant string. Matching a string against that regex searches for that string:
if 'properly' ~~ m/ perl / {
say "'properly' contains 'perl'";
}
The construct m/ ... / builds a regex. A regex on the right hand side of
the ~~ smart match operator applies against the string on the left hand
side. By default, whitespace inside the regex is irrelevant for the matching,
so writing the regex as m/ perl /, m/perl/ or m/ p e rl/ all produce
the exact same semantics--although the first way is probably the most readable.
Only word characters, digits, and the underscore cause an exact substring search. All other characters may have a special meaning. If you want to search for a comma, an asterisk, or another non-word character, you must quote or escape it[7]:
my $str = "I'm *very* happy";
# quoting
if $str ~~ m/ '*very*' / { say '\o/' }
# escaping
if $str ~~ m/ \* very \* / { say '\o/' }
Searching for literal strings gets boring pretty quickly. Regex support
special (also called metasyntactic) characters. The dot (.) matches a
single, arbitrary character:
my @words = <spell superlative openly stuff>;
for @words -> $w {
if $w ~~ m/ pe.l / {
say "$w contains $/";
} else {
say "no match for $w";
}
}
This prints:
spell contains pell
superlative contains perl
openly contains penl
no match for stuff
The dot matched an l, r, and n, but it will also match a space in the
sentence the spectroscope lacks resolution--regexes ignore word
boundaries by default. The special variable $/ stores (among other things)
only the part of the string that matched the regular expression. $/ holds
these so-called match objects.
Suppose you want to solve a crossword puzzle. You have a word list and want to
find words containing pe, then an arbitrary letter, and then an l (but
not a space, as your puzzle has extra markers for those). The appropriate
regex for that is m/pe \w l/. The \w control sequence stands for a
"Word" character--a letter, digit, or an underscore. This chapter's example
uses \w to build the definition of a "word".
Several other common control sequences each match a single character:
Таблица 9.1. Backslash sequences and their meaning
| Symbol | Description | Examples |
|---|---|---|
\w | word character | l, ö, 3, _ |
\d | digit | 0, 1 |
\s | whitespace | (tab), (blank), (newline) |
\t | tabulator | (tab) |
\n | newline | (newline) |
\h | horizontal whitespace | (space), (tab) |
\v | vertical whitespace | (newline), (vertical tab) |
Invert the sense of each of these backslash sequences by uppercasing its
letter: \W matches a character that's not a word character and \N
matches a single character that's not a newline.
These matches extend beyond the ASCII range--\d matches Latin,
Arabic-Indic, Devanagari and other digits, \s matches non-breaking
whitespace, and so on. These character classes follow the Unicode
definition of what is a letter, a number, and so on.
To define your own custom character classes, listing the appropriate
characters inside nested angle and square brackets <[ ... ]>:
if $str ~~ / <[aeiou]> / {
say "'$str' contains a vowel";
}
# negation with a -
if $str ~~ / <-[aeiou]> / {
say "'$str' contains something that's not a vowel";
}
Rather than listing each character in the character class individually, you
may specify a range of characters by placing the range operator .. between
the beginning and ending characters:
# match a, b, c, d, ..., y, z
if $str ~~ / <[a..z]> / {
say "'$str' contains a lower case Latin letter";
}
You may add characters to or subtract characters from classes with the +
and - operators:
if $str ~~ / <[a..z]+[0..9]> / {
say "'$str' contains a letter or number";
}
if $str ~~ / <[a..z]-[aeiou]> / {
say "'$str' contains a consonant";
}
The negated character class is a special application of this idea.
A quantifier specifies how often something has to occur. A question mark
? makes the preceding unit (be it a letter, a character class, or something
more complicated) optional, meaning it can either be present either zero or
one times. m/ho u? se/ matches either house or hose. You can also
write the regex as m/hou?se/ without any spaces, and the ? will still
quantify only the u.
The asterisk * stands for zero or more occurrences, so m/z\w*o/ can
match zo, zoo, zero and so on. The plus + stands for one or more
occurrences, \w+ usually matches what you might consider a word (though
only matches the first three characters from isn't because ' isn't a
word character).
The most general quantifier is **. When followed by a number, it matches
that many times. When followed by a range, it can match any number of times
that the range allows:
# match a date of the form 2009-10-24:
m/ \d**4 '-' \d\d '-' \d\d /
# match at least three 'a's in a row:
m/ a ** 3..* /
If the right hand side is neither a number nor a range, it becomes a
delimiter, which means that m/ \w ** ', '/ matches a list of characters
each separated by a comma and whitespace.
If a quantifier has several ways to match, Perl will choose the longest one. This is greedy matching. Appending a question mark to a quantifier makes it non-greedy[8]
For example, you can parse HTML very badly[9]with the code:
my $html = '<p>A paragraph</p> <p>And a second one</p>';
if $html ~~ m/ '<p>' .* '</p>' / {
say 'Matches the complete string!';
}
if $html ~~ m/ '<p>' .*? '</p>' / {
say 'Matches only <p>A paragraph</p>!';
}
To apply a modifier to more than just one character or character class, group items with square brackets:
my $ingredients = 'milk, flour, eggs and sugar';
# prints "milk, flour, eggs"
$ingredients ~~ m/ [\w+] ** [\,\s*] / && say $/;
Separate alternations--parts of a regex of which any can match-- with vertical bars. One vertical bar between multiple parts of a regex means that the alternatives are tried in parallel and the longest matching alternative wins. Two bars make the regex engine try each alternative in order and the first matching alternative wins.
$string ~~ m/ \d**4 '-' \d\d '-' \d\d | 'today' | 'yesterday' /
[5] Perl Compatible Regular Expressions
[6] Portable Operating System Interface for Unix. See IEEE standard 1003.1-2001
[7] To search for a literal string--without using the pattern matching
features of regex--consider using index or rindex instead.
[8] The non-greedy general quantifier is $thing **? $count, so the
question mark goes directly after the second asterisk.
[9] Using a proper stateful parser is always more accurate.