Commit 0427871f authored by root's avatar root

Update for 19/12

parent 99118f0a
#!/bin/bash
#SBATCH --partition=physical
#SBATCH --constraint== physg4
#SBATCH --constraint=physg4
#SBATCH --ntasks=72
# Load modules, show commands etc
echo $(hostname ) $SLURM_JOB_NAME running $SLURM_JOBID >> hostname.txt
......@@ -5,7 +5,6 @@
#SBATCH --nodes=1
#SBATCH --ntasks=4
module load ORCA/3_0_3-linux_x86-64-OpenMPI-1.6.5
source ~/.bashrc
module load ORCA/4_1_0-linux_x86-64-OpenMPI-3.1.3
$EBROOTORCA/orca orca.in 1> orcaNEW3.out
......@@ -67,15 +67,19 @@ and compare to
sed -E 's/QQLQ?//' gattaca.txt
Backreferences
=============
==============
Regular expressions can also backreference (which is technically beyond being a regular language), that is match a previous sub-expression with the values \1, \2, etc. The following is a useful example where one can search for any word (case-insensitive, line-numbered) and see if that word is repeated, catching common typing errors such as 'The the'.
Regular expressions can also backreference (which is technically beyond being a regular language), that is match a previous sub-expression with the values \1, \2, etc. The following is a useful example where one can search for any word (case-insensitive, line-numbered) and see if that word is repeated, catching common typing errors such as 'The the'. Note the multiple ways of doing this.
grep -inw '\([a-z]\+\) \+\1' files
grep -inw '\([a-z]\+\) \+\1' file
grep -Einw '([a-z]+) +\1' files
grep -Einw '([a-z]+) +\1' file
grep -Ein '\<([a-z]+) +\1\>' files
grep -Ein '\<([a-z]+) +\1\>' file
Another similar example, matching repeat characters rather than words (hencem no -w).
grep '\([A-Z]\)\1' gattaca.txt
An example to append the string "EXTRA TEXT" to each line.
......@@ -93,4 +97,3 @@ For example, if an escape character precedes a character that is not a regular e
For example:
`awk '/QL\QA$/' gattaca.txt` will be treated like `awk '/QLQA$/' gattaca.txt` in most contemporary versions of awk. Some however will treat `awk '/QL\QA$/' gattaca.txt` and `awk '/QL\\QA$/' gattaca.txt` as equivalent.
// Example derived from https://www.tutorialspoint.com
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches {
private static final String REGEX = "\\brat\\b";
private static final String INPUT = "rat rat rat rattie rat";
public static void main( String args[] ) {
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT); // get a matcher object
int count = 0;
while(m.find()) {
count++;
System.out.println("Match number "+count);
System.out.println("start(): "+m.start());
System.out.println("end(): "+m.end());
}
}
}
......@@ -8,7 +8,7 @@ Variables
Variable assignment in awk is done with the -v option.
`awk -v filename="test-high-gc-1.fasq" -v linecount="4000" 'BEGIN{print filename, linecount}'`
`awk -v filename="test-high-gc-1.fastq" -v linecount="4000" 'BEGIN{print filename, linecount}'`
The shell function `eval` can be used for output e.g.,
......@@ -17,6 +17,9 @@ Integrating shell commands with awk for output.
`eval $(echo 2 3 | awk '{ result=$1+$2; printf "result='\'%s\''\n", result }')
`echo "$result"`
`eval $(wc -l ./test-high-gc-1.fastq | awk '{ linecount=$1; printf "linecount='\'%s\''\n", linecount }')
`echo "$linecount"`
Any global variables and their values being used by awk can be accessed by the --dump-variables command, with a default filename of awkvars.out.
`awk --dump-variables ''`
......@@ -60,7 +63,6 @@ Logical "or" is represented by `||`
Logical "not" is represented by `!`
Arithmetic Functions
===================-
......@@ -99,6 +101,3 @@ Conditional Statements
The awk language has a well-known control statement, if-then-else. If the condition is true, then-body is executed; otherwise, else-body is executed.
awk -v oddball=7 'BEGIN { if (oddball % 2 == 0) print "oddball is even"; else print "oddball is odd" }'
......@@ -6,10 +6,13 @@ It is good practise to enclose the regular expression in single quotes, to preve
e.g., `grep 'ATEK' gattaca.txt` rather than `grep ATEK gattaca.txt`
Grep with Metacharacters
------------------------
========================
Some examples of grep options with metacharacters.
* Print only the matched parts of a matching line with separate outline lines.
grep -o '"[^"]\+"' filename
* Count the number of empty lines in a file, '-c' is count.
grep -c '^$' filename
......@@ -20,7 +23,7 @@ grep -v '[aeiou]' /usr/share/dict/words
grep -iwn '^QANTAS' /usr/share/dict/*
Faster Grep
-----------
===========
* Localisation settings might slow down your grep search. Prefixing the search with LC_ALL forces applications to use the same language (US English). This speeds up grep in many cases because it is a singl-ebyte comparison rather than multi-byte (as is the case with UTF8). It is not appropriate if you are searching file with non-ASCII standard characters!
......
==============
Metacharacters
==============
Introduction
------------
============
Metacharacters are heavily used in regular expressions.
......@@ -10,14 +14,14 @@ Note that metacharacters will *vary* according to the context they are used. The
Like other *nix-like system this environment is case-sensitive. Metacharacters that are used outside of their context, even within a regular expression, will be treated as literal character.
Basic Examples
--------------
=============
Note that the examples use strong quotes for search term to prevent the possibility of inadvertant shell expansion. If shell expansion, variable substitution etc is desired then use weak quotes.
| Metacharacter | Explanation | Example |
|:--------------|:--------------------------------------|-----------------------------------------------|
| ^ | Beginning of line anchor | `grep '^row' /usr/share/dict/words` |
| $ | End of line anchor | `grep 'row$' /usr/share/dict/words` |
| ^ | Beginning of string anchor | `grep '^row' /usr/share/dict/words` |
| $ | End of string anchor | `grep 'row$' /usr/share/dict/words` |
| . | Any single character | `grep '^...row...$' /usr/share/dict/words` |
| * | Match zero plus preceding characters | `grep '^...row.*' /usr/share/dict/words` |
| [ ] | Matches one in the set | `grep '^[Pp].row..$' /usr/share/dict/words` |
......@@ -29,8 +33,17 @@ Note that the examples use strong quotes for search term to prevent the possibil
| \< \> | Beginning and end word boundaries | `grep '\<cat\>' /usr/share/dict/words` |
Remember that the `.` is _lazy_ (it matches _any_ character, including those you might not want or need) and the `*` is _greedy_ (it matches 0 or more preeceding characters).
Try the following:
grep '".*"' problem.txt
grep -o '"[^"]\+"' problem.txt
The second example (grep, matching only, negate set of any characters except quote, add match at least once) is much more complex, but gives the right answer.
Range Statements
----------------
================
A range statement. e.g., [[:alpha:]] is the equivalent of the range [A-Za-z], or [:punct:] is equivalent of '][!"#$%&'()*+,./:;<=>?@\^_`{|}~-].
......
......@@ -58,10 +58,10 @@ Adding the `w` option will print out the selection line to a new file.
sed 's/ELM/LUV/gw selection.txt' gattaca.txt
Quoting & Variables
===================
Quoting
=======
Generally strong quotes are recommended for regular expressions. However, sed often wants to use weak quotes to include (for example) variable substitution.
Generally strong quotes are recommended for regular expressions. However, sed often wants to use weak quotes to include (for example) variable substitution.
Consider for example, a job run which searches for a hypothetical element and comes with the result of Unbihexium. This is the equivalent of:
......@@ -72,6 +72,12 @@ A file where the search term UnknownElement exists could be replaced with Unbihe
`sed "s/UnknownElement/$UnknownElement/g" filename`
Another issue is conducting substitutions with single quotes in the stream, in which case double-quotes are an appropriate tool.
sed "s/ones/one's/" <<< 'ones thing'
Another method is to replace each embedded single quote with the sequence: '\'', i.e., quote backslash quote quote, which closes the string, appends an escaped single quote and reopens the string.
The sed metacharacter, `&` can be used as variable for the selection term. For example, a list of telephone numbers could have the first two digits selected and then surrounded by parantheses for readability.
`sed 's/^[[:digit:]][[:digit:]]/(&) /' phonelist.txt`
......@@ -114,10 +120,6 @@ One can add new material to a file in such a manner with the insert option:
`sed '1,2 i\foo' file` or `sed '1,2i foo' file`
Select duplicate words in a line and remove.
`sed -e 's/\b\([a-z]\+\)[ ,\n]\1/\1/g' file`
Multiple Commands
=================
......
Talking chamber foxtrot@example.com as shewing an it minutes. Trees fully of blind do. Exquisite favourite at do extensive listening. Improve up
musical welcome he. Gay attended vicinity prepared now diverted. Esteems it ye sending reached lima@example.com as. Longer lively her design settle
tastes advice mrs off who.indigo@example.com kilo@example.com May indulgence difficulty ham can put especially. Bringing remember echo@example.com for
supplied her why was confined. Middleton principle did she procuring extensive believing add. Weather adapted prepare oh is calling. bravo@example.com
Far advanced settling say finished raillery. Offered chiefly farther of my no colonel shyness. hotel@example.com juliet@example.com Inhabit hearing
perhaps on ye do no. It maids decay as there he. Smallest on suitable disposed do although blessing he juvenile in. Society or if excited forbade.
Here name off yet delta@example.com she long sold easy whom. Differed oh cheerful procured pleasure securing suitable in. Hold rich on an he oh fine. vpac.org
Chapter ability shyness alpha@example.com Inquietude simplicity terminated she compliment remarkably few her nay. The weeks are ham mike@example.com. asked
jokes. Neglected perceived shy nay concluded. Not mile draw plan snug charlie@example.com ext all. Houses latter an valley be indeed wished mere
golf@example.com In my. Money doubt oh drawn every or an china
========================
Java Regular Expressions
========================
The java.util.regex Package
===========================
Java uses the java.util.regex package for regular expressions and pattern-matching.
The package consists of three main classes:
* Pattern Class. A Pattern object that is a compiled representation of a regular expression.
* Matcher Class. A Matcher object that interprets the pattern and performs match operations.
* PatternSyntaxException. An exception for noting syntax errors in a regular expression.
Typically Java regex will start with:
```
import java.util.regex.Matcher;
import java.util.regex.Pattern;
```
Metacharacters & Sequences
==========================
Metacharacters
--------------
The following RegEx metacharacters are used in Java. They are very similar to those used in other languages.
. Matches any character
^ Matches the beginning of the line.
$ Matches the end of the line.
. Matches any single character except newline. Using m option allows it to match the newline as well.
[abc] Matches characters in in a set (e.g., a or b or c)
[ab][cd] Matches characters in a set (e.g., a or b followed by c or d)
[.] Matches any single character in brackets.
[^.] Matches any single character not in brackets.
a|b Matches a or b
Sequences
---------
\A Beginning of the entire string.
\z End of an entire string.
\Z End of an entire string except allowable final line terminator.
re* Matches 0 or more occurrences of the preceding expression.
re+ Matches 1 or more of the previous expression.
re? Matches 0 or 1 occurrences of the previous expression.
re{ n} Matches exactly n number of occurrences of the preceding expression.
re{ n,} Matches n or more occurrences of the preceding expression.
re{ n, m} Matches at least n and at most m occurrences of the preceding expression.
(re) Groups regular expressions and remembers the matched text.
(?: re) Groups regular expressions without remembering the matched text.
(?> re) Matches the independent pattern without backtracking.
\w Matches word characters.
\W Matches nonword characters.
\s Matches whitespace. Equivalent to [\t\n\r\f].
\S Matches nonwhitespace.
\d Matches digits. Equivalent to [0-9].
\D Matches nondigits.
\A Matches the beginning of the string.
\Z Matches the end of the string, not including the newline.
\z Matches the end of the string.
\G Matches the point where the last match finished.
\n Back-reference to capture group number "n".
\b Matches the word boundaries when outside brackets. Matches the backspace (0x08) when inside brackets.
\B Matches the nonword boundaries.
\n, \t, etc. Matches newlines, carriage returns, tabs, etc.
\Q Escape (quote) all characters up to \E.
\E Ends quoting begun with \Q.
Quantifiers
-----------
* Matches zero or more elements.
+ Matches one or more elements. Equivalent to {1,}
? Matches none or one elements. Equivalent to {0.1}
{X} Matches X number of elements. e.g., \d{3}, matches three digits.
{X,Y} Matches between X and Y elememts. e.g., \d{1,4} matches at least 1 and up to 4 digits.
*? Non-greedy quantifier. Use of ? after a quantifier means it attempst the smallest first match.
Methods
=======
The start() and end() method
----------------------------
Example;
`javac $(pwd)/RegexMatches.java; java -cp . RegexMatches`
The start() and end() method example uses a word boundary to ensure that an entire word and not a substring is being counted (e.g., rattie)
Java Strings have support for regexes built in, with four methods, matches(), split(), replaceFirst(), replaceAll() methods.
import re
f = open('gattaca.txt', 'r')
object= re.search(r'ATVK', f.read())
print(object)
print(object.start(), object.end())
print(object.span())
print(object.string)
print(object.group())
f.close()
print "Please enter your firstname and surname (two names only)";
chop ($name = <STDIN>);
if ($name =~ /^\s*(\S+)\s+(\S+)\s*$/) {
# Ignore preceding white space, capture non-whitespace, have at least one whitespace character, capture second non-white space, ignore trailing whitespace.
print "Hi $1. Your Surname is $2.";
# print non-white space group $1, print non-whitespace group $2
} else {
print "Sorry, I could not work out your name. I am just a little script and I am not very smart.";
}
print "\n";
This diff is collapsed.
import re
regexsearch = re.compile('CGCCTGCCCCCTCCGCGCCGGCCTGCCGGTGATAAAGTCG', re.IGNORECASE)
print regexsearch
====================================
Perl-Compatiable Regular Expressions
====================================
Perl Compatible Regular Expressions (PCRE) is a C library that implements a regular expression engine that attempts to match the capabilities and syntaxc of Perl.
PCRE's syntax is more powerful and flexible that the POSIX regular expressions flavors.
Some Features
=============
1. Extended character classes. Like Perl, PCRE offers extension to POSIX classes so that `\d` is the equivalent to [[:digit:]].
2. Minimal matching. Like Perl, PCRE offers "ungreedy" or minimal matches. A `?` placed after any repetition quantifier indicates that the shortest match should be used.
3. Unicode matching.
4.
Special Characters
==================
Character Meaning
\n newline
\r carriage return
\t tab
\0 null character
\YYY octal character YYY
\xY hexadecimal character YY
\cY control character Y (e.g., \cC)
Assertions
==========
Character Meaning
^ Start of a string
$ End of a string
\b Word boundary
\B Non-word boundary
(?=..) Positive lookahead
(?!..) Negative lookahead
(?<=..) Positive lookbehind
(?<!..) Negative lookbehind
\Q..\E Remove metacharacter meaning
Quantifiers
===========
Character Meaning
* O or more
+ 1 or more
? 0 or 1
{2} exactly 2
{2,5} between 2 and 5
{2,} 2 or more
Classes
=======
PCRE uses character classes and POSIX classes e.g.,
Character Meaning
[ab-d] One character of a,b,c,d
[^ab-d] One character that is not a,b,c,d
\d One digit
\D One non-digit
\s One whitespace
\S One non-whitespace
Differences from Perl
=====================
Perl and PCRE are very close but not quite the same in the way they express and manage regular expressions. They are quite advanced! These are *some* of the differences:
Good news! Since PCRE 7.x and Perl 5.9.x the two projects are working together to maximise compatibility!
1. PCRE uses a subset of Perl's Unicode support and Perl escape sequences \p, \P, and \X are supported only if PCRE2 is built with Unicode support (the default).
2. PCRE uses atomicity for recursive matches and Perl does not. So (?!a){3} in PCRE does not mean match if the next three characters are not a, but rather is that the character is not a three times. A more obscure example (taken from Wikipedia):
"<<!>!>!>><>>!>!>!>" =~ /^(<(?:[^<>]+|(?3)|(?1))*>)()(!>!>!>)$/ will match in Perl but not in PCRE.
3. The following escape sequences are not supported in Perl, but not PCRE: \F, \l, \L, \u, \U, and \N when followed by a character name.
4. PCRE supports \Q .. \E for escaping substrings, and characters are treated as literals. In Perl `$` and `@` variable subsitution occurs. e.g.,
Pattern PCRE2 matches Perl matches
------------------|-----------------|------------------------------------
\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
\QA\B\E A\B A\B
\Q\\E \ \\E
Taken and derived from: pcre2compat man page by Philip Hazel
https://www.pcre.org/current/doc/html/pcre2compat.html
PCRE Tester
===========
Test your code!
`https://regex101.com/`
<?php
$filename = "hidden.txt";
$handle = fopen($filename, "r"); //open file in read mode
$contents = fread($handle, filesize($filename)); //read file
if (preg_match("/\.org/i", $contents)) // Marked as case-insentive for illustrative purposes.
{
echo "Found an org\n\n";
}
else
{
echo "Did not find an org\n\n";
}
$contents = preg_replace("/\.org/i", ".org.au", $contents);
print $contents;
$contents = preg_split ("/(\.com | \.org)/", $contents);
print "$contents[0]\n";
print "$contents[1]\n";
print "$contents[2]\n";
// etc.
fclose($handle); //close file
?>
# Basics of Perl
Basics of Perl
==============
Simple variables always begin with a dollar sign, and can hold a number or text.
......@@ -6,22 +7,22 @@ Comments begin with # and continue for the rest of the line.
Variables can appear within a double-quoted string.
Perl has loops, conditionals etc.
# Perl special characters
Perl special characters
========================
\t tab character
\n newline character
\r carriage return character
\s matches any “whitespace” character (space, tab, newline, etc)
\S anything not \s
\w [a-zA-Z0-9R] Useful as in \w+ to match a word.
\s matches any one whitespace character (space, tab, newline, etc)
\S matches any one not \s
\w matches any one word; [a-zA-Z0-9R] Useful as in \w+ to match a word.
\W anything not \w , i.e., [ˆa-zA-Z0-9R]
\d [0-9] , a digit
\d [0-9] , any one digit
\D anything not \d , i.e., ! [ˆ0-9]
# Perl Metacharacters
Perl Metacharacters
===================
In Perl regular expressions, the following are metacharacters:
......@@ -42,6 +43,7 @@ In Perl regular expressions, the following are metacharacters:
| [^..] | Matches characters not in the brackets |
| | | Matches either, an "or" statement |
| \b | Matches a word boundary |
| \B | Matches where \b{} doesn't match |
| \w | Matches alphanumeric and underscore |
| \W | Matches non-alphanumeric |
| \s | Matches whitespace |
......@@ -50,22 +52,34 @@ In Perl regular expressions, the following are metacharacters:
| ^ | Matches beginning of a line or string |
| $ | Matches end of a line or string |
See metaperl.pl
See perl -w metaperl.pl
Word Boundaries
===============
The metacharacter \b is like an anchor, matching a zero-length boundary before the first character in a word, after the last character in a ord, or between two characters where one character is word and the other is not.
When seeking to match whole words, one should use \bword\b ; an old method was /(^|\W)word(\W|$)/), which is equivalent.
The position of the \b can alter the search term.
# Perl Matching RegEx
See perl -w radio.pl
Perl Matching RegEx
===================
A variable, for example '$searchterm' is linked, '=~', to a pattern match, for example 'm/^[0-9]+$/'.
The "/i" modifier makes the test case-insensitive appended after closing a match operator.
After a successful match, Perl provides the variables $1 , $2 , $3 , etc., which hold the text matched by their respective ( ) parenthesized subexpr essions in the regex. Sub-expressions can be nested.
After a successful match, Perl provides the variables $1 , $2 , $3 , etc., which hold the text matched by their respective ( ) parenthesized subexpressions in the regex. Sub-expressions can be nested.
The non-capturing parentheses "(?: )" will group without capturing. For example '(?:ab)+' will repeat the match "ab" without creating any separate sub-expressions.
Review the code tempconv.pl for an example of the above.
Review the code in names.pl and tempconv.pl for an example of the above.
# Repetition
Repetition
==========
Any expression (a single character, a marked sub-expression, or a character class) can be repeated with the *, +, ?, and {} operators.
......@@ -85,7 +99,8 @@ Expressions can also be repeated:
Don't use 'a(*)'!
# Perl Substitution RegEx
Perl Substitution RegEx
=======================
A variable, for example '$searchterm' is linked, '=~', to a substitution value, for example 's/regex/replacement/'.
......@@ -97,13 +112,12 @@ Perl has the capability to "lookaround" for patterns in a regex and insert repla
Consider also the population of Australia, 25323300 (2016) estimate. If one wanted to present this as comma-separated in groups of three (25,323,300)the following could be used: '$pop =~ s/(?<=\d)(?=(\d\d\d)+(?!\d))/,/g;'
# Lookahead and lookbehind summary table:
Lookahead and lookbehind summary table
======================================
| Type | Regex | Success conditions |
|-----------------------|---------------|---------------------------------------|
| Positive Behind | (?<=..) | if regex matches to the left |
| Negative Behind | (?<!..) | if regex does not match to the left |
| Postivfe Ahead | (?=..) | if regex matches to the right |
| Postive Ahead | (?=..) | if regex matches to the right |
| Negative Ahead | (?!..) | if regex does not match to the right |
===================================
PHP: Hypertext Preprocessor RegExes
===================================
PHP is a general purpose programming language often used in database-driven websites. It has two types of regular expressions, one for POSIX-extended, and Perl-compatible. The Perl-compatible regexes must be enclosed in delimiters, but are more powerful.
Note that the POSIX methods are deprecated from php5.3+ and removed from php7+
String Functions
================
PHP has a number of string functions some of which can mimic some of the basic stream editing functions. Note that these are literal and do not have special metacharacters, apart from usual PHP syntax rules.
For example:
str_replace(find,replace,string,count)
Replaces characters with specified characters. Count is optional for the number of replacements.
Example; simple.php
str_ireplace() acts the same but is case-insensitive in the search.
POSIX-type Metacharacters
=========================
Quantifiers
-----------
. Matches any single character
+ Matches any string containing at least one of the preceeding expression.
* Matches any string containing zero or more of the preceeding expression.
? Matches any string containing zero or one of the preceeding expression.
{N} Matches any string containing a sequence of N of the preceeding expression.
{2,3} Matches any string containing a sequence of two or three of the preceeding expression.
{2, } Matches any string containing a sequence of at least two of the preceeding expression.
$ Matches any string with the preceeding expression at the end.
^ Matches any string with the following expresssion at the beginning.
Sets
----
[0-9] Matches any digit 0 through 9.
[a-z] Matches any lower-case character from a to z.
[A-Z] Matches any upper-case character from A to Z.
[a-Z] Matches any character from lowercase a to uppercase Z.
Range Statements
----------------
[[:alpha:]] Matches any alphabetic characters.
[[:digit:]] Matches any digits.
[[:alnum:]] Matches any alphanumeric characters.
[[:space:]] Matches any space.
Set and Quantifier Examples
---------------------------
p.p Matches any string containing p, followed by any character, in turn followed by another p.
^.{5}$ Matches any string containing five characters exactly.
<b>(.*)</b> Matches any string enclosed within <b> and </b>.
POSIX Methods
=============
Search
------
ereg() The ereg() function searches a string specified by string for a string specified by pattern, returning true if the pattern is found, and false otherwise.
Search, Case Insensitive
------------------------
eregi()
The eregi() function searches throughout a string specified by pattern for a string specified by string. The search is not case sensitive.
Replace
-------
ereg_replace()
The ereg_replace() function searches for string specified by pattern and replaces pattern with replacement if found.
Replace, insensitive
--------------------
eregi_replace()
The eregi_replace() function operates exactly like ereg_replace(), except that the search for pattern in string is not case sensitive.
Split
-----
split()
The split() function will divide a string into various elements, the boundaries of each element based on the occurrence of pattern in string.
Split, insensitive
------------------
spliti()
The spliti() function operates exactly in the same manner as its sibling split(), except that it is not case sensitive.
For a simple example, see posix.php
Perl-Compatible Regular Expressions
===================================
You should use these now.
The POSIX-style syntax can be used almost interchangeably with PCRE regular expression functions, including any of the quantifiers listed above.
PCRE Metacharacters and Modifiers
=================================
Metacharacters
--------------
Metacharacter Meaning
. A single character
\s A whitespace character (space, tab, newline)
\S A non-whitespace character
\d Any digit (0-9)
\D Any non-digit
\w Any "word" character (a-z, A-Z, 0-9, _)
\W Any non-word character
Sets
----
[aeiou] Matches a single character in the set
[^aeiou] Matches a single character not in the set
(foo|bar|baz) "Or" statement; match any alternative.
Modifiers
---------
Modifier Description
i Case insensitive match
m If the string has newline or carriage return characters, the ^ and $ operators match against a newline boundary, instead of a string boundary
o Evaluates the expression only once
s Use . to match a newline character
x Use white space in the expression
g Globally find all matches
cg Continue search even after a global match fails
PCRE Methods
============
Match
-----
preg_match()
Searches for a pattern, returning true if pattern exists, and false otherwise.
Match All
---------
preg_match_all()
Matches all occurrences of pattern in string.
Replace
-------
preg_replace()
Searches for a pattern and replaces pattern with replacement if found. Unlike ereg_replace(), regular expressions can be used in the pattern and replacement parameters.
Split
-----
preg_split()
Divides a string into various elements, the boundaries based on the pattern. Unlike split(), regular expressions are accepted as pattern parameters.
Grep
----
preg_grep()
Searches all elements of input_array, returnings all elements matching the RegEx.
Quote
------
preg_quote()
Quote regular expression characters
For a simple example, see pcre.php
<?php
$filename = "hidden.txt";
$handle = fopen($filename, "r"); //open file in read mode
$contents = fread($handle, filesize($filename)); //read file
$searchorg = ereg("(\.)(org)", $contents);
if( $searchorg == true )
{
echo "Found a .org\n\n";
} else {
echo "Could not find a .org\n\n";
}
$contents = ereg_replace("(\.)(org)", ".org.au", $contents);
print $contents;
fclose($handle); //close file
?>
Suppose you want to grep for all content in quotes such as "string one" and "string two". How would you do this?
......@@ -3,7 +3,8 @@ The re Package and Metacharacters
Python's re package can be used for regular expressions, and can be tested with similar metacharacters to those used in POSIX. See `startend.py`
The following is a list of the most common metacharacters used in `re`.
Metacharacters
--------------
Metacharacter | Meaning
:---------------|----------------------------------------------------------:
......@@ -23,32 +24,44 @@ $ End of a string
(...) Match RegEx inside parantheses as a group. Use \ to escape and match literal parantheses. or enclose them in a class.
Special Sequences
-----------------
The following is a list of common special sequences
\d Matches any decimal digit, equivalent to the class [0-9].
\D Matches any non-digit character, equivalent to the class [^0-9].
\s Matches any whitespace character, equivalent to the class [ \t\n\r\f\v].
\S Matches any non-whistespace.
\w Matches any alphanumeric, equivalent to the class [a-zA-Z0-9_].
\W Matches any non-alphanumeric
\d Matches any decimal digit, equivalent to the class [0-9].
\D Matches any non-digit character, equivalent to the class [^0-9].
\s Matches any whitespace character, equivalent to the class [ \t\n\r\f\v].
\S Matches any non-whistespace.
\w Matches any alphanumeric, equivalent to the class [a-zA-Z0-9_]. \W Matches any non-alphanumeric
Pattern Objects
===============
Pattern and Match Objects
=========================
In Python RegExes are compiled into pattern objects with `re.compile` which have various methods. Flags can be added (e.g., IGNORECASE)
In Python RegExes are compiled into pattern objects with `re.compile` which have various methods.)
```
import re
regexsearch = re.compile('CGCCTGCCCCCTCCGCGCCGGCCTGCCGGTGATAAAGTCG', re.IGNORECASE)
print regexsearch
```
Various compilation flags can be added to re.compile, with the re.IGNORECASE example above. These include:
DOTALL, S The `.` matches any character, including newlines
IGNORECASE, I Case-insensitive matches
LOCALE, L Locale-aware match
MULTILINE, M Multi-line matching, the affects ^ and $
VERBOSE, X Enable verbose RegExes.
UNICODE, U Be aware that several escapes like \w, \b, \s and \d dependent on the Unicode character database.
Once a regular expression is compiled into a pattern object functions (see below) can be applied.
Likewise RegEx matches are also expressed as objects which can have functions applied to them. The example `match_objects.py` has print statements to show that there is an object, a .start() and .end() to show what characters are in the match - which is the same as .span(), the string that was passed to the function (gattaca.txt in this case), and the part of the string where there was a match.
Backslash Issues in Python
==========================
As the norm with RegExes, the backslash is used to escape metacharaters, and a double backslash is used to for a literal backslash. However, Python uses the same character for the same purpose in string literals.
As the norm with out RegExes, the backslash is used to escape metacharaters, and a double backslash is used to for a literal backslash. However, Python uses the same character for the same purpose in string literals.
Thus, in Python, to have a regular expression that matches, say, \documentclass (used in LaTeX), the backslash has to be escaped for re.compile() and then both backslashes have to be escaped for a string literal - resulting in *four* backslashes.
......@@ -60,18 +73,19 @@ A solution to this issue is to express string literals with a 'r', raw notation.
"\\\\documentclass" becomes r"\\documentclass"
"\\w+\\s+\\1" becomes r"\w+s\1"
Methods
========
Functions
=========
The package has the following search methods:
The re package has the following functions:
findall() Returns a list containing all matches
search() Returns a Match object if there is a match anywhere in the string
split() Returns a list where the string has been split at each match
sub() Replaces one or many matches with a string
The findall() Function
======================
----------------------
The findall() function will return a list which contains all functions, or an empty list if no match was found. It can be used on a string or a file.
......@@ -84,7 +98,7 @@ f.close()
```
The search() Function
=====================
---------------------
The search() function searches the string for a match, and returns a Match Object if there is one. If there is more than one, only the first occurrence is returned.
......@@ -113,7 +127,7 @@ print(searchseq.group())
```
The split() function
====================
--------------------
The split() function returns a list where the string has been split at each match.
......@@ -128,6 +142,13 @@ print(splitseq)
f.close()
```
Note that the output also includes the null lines. You can remove these (for example) with a little bit of sed (recall rules for single quotes).
```
python split.py > output
sed -i "s/'', //g" output
```
By default, the output will be based on all instances. However, the maximum number of characters can be controlled by settin a maxsplit parameter.
```
......@@ -135,8 +156,12 @@ splitseq = re.split("", contents, 1)
print(splitseq)
```
The sub() Function
------------------
The sub() function conducts a search-and-replace with matches against the regular expression. See `substitute.py`
The number of replacements can be controlled by specifying the count parameter at the end of the sub() function. See `substitute2.py`
References
==========
......
# From: https://stackoverflow.com/questions/12680767/perl-regular-expression-matching-on-large-unicode-code-points
# Replace all with double quotes
" fullwidth
“ left
” right
„ low
" normal
# Replace all with single quotes
' normal
‘ left
’ right
‚ low
‛ reverse
` backtick
# Derived from: https://www.perlmonks.org/?node_id=73692
# Now includes new sentences, the new method, and boundary positions
# Lev Lafayette Dec 2019
my @sentence_list = (
'radiohead',
'turn up your radio',
'the radiology machine is broken',
'telegraph is preradio',
'it is nonradioactive',
);
print "\nOld word boundary method\n";
foreach $sentence (@sentence_list){
if ($sentence =~ /(^|\W)radio(\W|$)/){
print "Found in : $sentence\n";
} else {
print "Not found\n";
}
}
print "\nNew word boundary method\n";
foreach $sentence (@sentence_list){
if ($sentence =~ m/\bradio\b/){
print "Found in : $sentence\n";
} else {
print "Not found\n";
}
}
print "\nEnd word boundary\n";
foreach $sentence (@sentence_list){
if ($sentence =~ m/radio\b/){
print "Found in : $sentence\n";
} else {
print "Not found\n";
}
}
print "\nBeginning word boundary method\n";
foreach $sentence (@sentence_list){
if ($sentence =~ m/\bradio/){
print "Found in : $sentence\n";
} else {
print "Not found\n";
}
}
import re
f = open('test-high-gc-1.fastq', 'r')
searchseq= re.findall(r'AAAGT', f.read())
print(searchseq)
f.close()
<?php
$file = 'hidden.txt';
file_put_contents($file,str_replace('mike@...','mike@example.com',file_get_contents($file)))
?>
===========================
Simple Regex Language (SRL)
===========================
Introduction
============
SRL replaces terse abbreviations and various meta-characters with high-level syntax for readability.
Available on github for several languages
`https://github.com/SimpleRegex`
The file derived almost entirely from https://simple-regex.com/documentation/ and Linux Magazine, June 2017.
Example RegEx to SRL
====================
Compare:
`/^(?:[0-9]|[a-z]|[\._%\+-])+(?:@)(?:[0-9]|[a-z]|[\.-])+(?:\.)[a-z]{2,}$/i`
With:
```
begin with any of (digit, letter, one of "._%+-") once or more,
literally "@",
any of (digit, letter, one of ".-") once or more,
literally ".",
letter at least 2 times,
must end, case insensitive
```
This is an email validator.
Language Rules
==============
SRL is case-insensitive, although the content in a string is case sensitive. i.e, LITERALLY "TEST" is not the same as literally "test". Commas may be used to separate statements for readability.
Strings are interpreted literally and can be defined with single or double quotation marks.
Only use parantheses when building a sub-query, e.g., a capture group to apply a quantifier.
Characters
----------
Characters are everything that matches the string directly, including "letter", "digit" and "literally".
The syntax of a character is as follows:
`character [specification] [quantifier] [others]`
Some examples:
`literally "string"`
The keyword "literally" means that the string will be exactly as requested, except that backslash is an escape character.
`literally "sample"`
`/(?:sample)/`
`one of "characters"`
Matches on one of the supplied characters as a set.
`one of "a%1"`
`/[a%1]/`
`letter [from a to z]`
Matches one of characters with a set with a specified range.
`letter from a to f exactly 4 times`
`/[a-f]{4}/`
`uppercase letter [from A to Z]`
Matches a set of uppercase characters with a specified range.
`uppercase letter from A to F`
`/[A-F]/`
`any character`
As it says, matches any character (i.e., A to Z, 0 to 9 and _.
`starts with any character once or more, must end`
`/^\w+$/`
`no character`
This will match everything except a to z, A to Z, 0 to 9 and _.
`starts with no character once or more, must end`
`/^\W+$/`
`digit [from 0 to 9]`
Each didgit matches only one digit. Use a quantifier to extend. e.g.,
`starts with digit from 5 to 7 exactly 2 times, must end`
`/^[5-7]{2}$/`
`anything`
Matches anything except newline-breaks, which are matched with `new line`.
`/./`
`whitespace`
Matches any whitespace character. Negative is `no whitespace`. Specific whitespace can be matched with `tab`.
`/\s/`
`/\t/`
`backslash`
An alternative to using `literally "\\"`, since a backslash is an escaping character.
`/\\/`
`raw "expression`
This character rules allows the inclusion of a standard regex.
`literally "an", whitespace, raw "[a-zA-Z]"`
`/(?:an)\s[a-zA-Z]/`
Quantifiers
-----------
Quantifiers determine how often a statement is allowed to occur.
`exactly x times`
Short forms exist for "once" and "twice" e.g.,
`digit exactly 3 times, letter twice`
`/[0-9]{3}[a-z]{2}/`
`between x and y times`
Specific range of repetitions.
`optional`
Will match if the character type is there, ignore it otherwise.
`digit optional, letter twice`
`/[0-9]?[a-z]{2}/`
`once or more`, `never or more`
Will match if something exists once (or never). May exist multiple times.
`starts with letter once or more, must end`
`/^[a-z]+$/`
`at least x times`
Will match if query matches at least according to quantifier.
`letter at least 10 times`
`/[a-z]{10,}/`
Groups
------
Groups combine matches of characters and quantifiers, applying quantifiers to entire patters or part of an expression. Groups are like sub-queries.
`capture (condition) [as "name"]`
A capture group allows any condition to be named and returned to an engine.
`capture (anything once or more) as "first", literally " - ", capture "second part" as "second"`
/(?<first>.+)(?: - )(?<second>(?:second part))/
`any of (condition)`
Every statement supplied in a sub-query, could be a match. Use if you are unsure what part of a condition might match.
`capture (any of (literally "sample", (digit once or more)))`
`/((?:(?:sample)|(?:[0-9]+)))/`
`until (condition)`
Match or capture a specific expression until some other condition meets.
`begin with capture (anything once or more) until "m"`
`/^(.+?)(?:m)/`
Lookarounds
-----------
Lookarounds define a group to match under certain conditions.
`if [not] followed by`
Match if follwed by a condition; a lookahead.
`capture (digit) if not followed by (anything once or more, digit)`
`/([0-9])(?!.+[0-9])/`
Match if preced by a condition; a lookabehind.
`capture "bar" if already had "foo"`
`/(?<=(?:foo))((?:bar))/`
Flags and Anchors
-----------------
Flags apply to an entire query in a particular mode e.g., "case insensitive".
`case insensitive`
Regular expressions are case sensitive.
`letter from a to b twice, case insensitive`
`/[a-b]{2}/i`
`multi line`
Used to match more than one line, supply the multi line flag.
`all lazy`
Matching in regular expression is greedy by default, meaning it will try to match the last occurrence. "All lazy" will force this on a single quantifier.
`capture(letter once or more) all lazy`
`/([a-z]+)/U`
Anchors define whether a string starts and ends.
`begin/starts with`
Ensures that a string starts with a match of some sort (e.g., the '@' of a domain).
`starts with literally "@"`
`/^(?:@)/`
`must end`
Ensure that a string must end with a match of some sort (e.g., the '.' at the end of a standard sentence).
`literally "." must end`
`/(?:\.)$/`
Website Tests
=============
You can build SRL queries on the website and test input.
`https://simple-regex.com/build`
import re
f = open('test-high-gc-1.fastq', 'r')
contents = f.read()
splitseq = re.split("Ignore this line\n", contents)
print(splitseq)
f.close()
import re
string1 = "The quick brown fox jumps over the lazy frog"
startend = re.search("^The.*dog$", string1)
if (startend):
print("String starts with 'The' and ends with 'dog'")
else:
print("String does not start with 'The' and end with 'dog'")
import re
f = open('gattaca.txt', 'r')
braf= re.sub(r'ATVK', 'ATEK', f.read())
print(braf)
f.close()
import re
f = open('gattaca.txt', 'r')
braf= re.sub(r'Q', '%', f.read(), 4)
print(braf)
f.close()
#!/bin/perl
use warnings;
# Modified from Friedl's "Mastering Regular Expressions"
# Inclusion of perl path, kelvin values, initialised values, use warnings
# by Lev Lafayette. 2019
use warnings;
my ($celsius, $fahrenheit, $kelvin) = 0;
print "Enter a temperature (e.g., 32F, 100C, 373.15K):\n";
......@@ -15,8 +15,9 @@ if ($input =~ m/^([-+]?[0-9]+(?:\.[0-9]*)?)\s*([CFK])$/i) {
# $1 is the first number with the whitespace class (s)
# $2 is "C" or "F" or "K", case insensitive.
# The notation (?:..) will group, but not capture.
$InputNum = $1; # Save to named variables to make the ...
$type = $2; # ... rest of the program easier to read.
$InputNum = $1; # Save to named variables to make the ...
$type = $2; # ... rest of the program easier to read.
if ($type =~ m/c/i) {
# Match c, case insensitive
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment