Commit 0427871f authored by root's avatar root

Update for 19/12

parent 99118f0a
#SBATCH --partition=physical
#SBATCH --constraint== physg4
#SBATCH --constraint=physg4
#SBATCH --ntasks=72
# Load modules, show commands etc
echo $(hostname ) $SLURM_JOB_NAME running $SLURM_JOBID >> hostname.txt
......@@ -5,7 +5,6 @@
#SBATCH --nodes=1
#SBATCH --ntasks=4
module load ORCA/3_0_3-linux_x86-64-OpenMPI-1.6.5
source ~/.bashrc
module load ORCA/4_1_0-linux_x86-64-OpenMPI-3.1.3
$EBROOTORCA/orca 1> orcaNEW3.out
......@@ -67,15 +67,19 @@ and compare to
sed -E 's/QQLQ?//' gattaca.txt
Regular expressions can also backreference (which is technically beyond being a regular language), that is match a previous sub-expression with the values \1, \2, etc. The following is a useful example where one can search for any word (case-insensitive, line-numbered) and see if that word is repeated, catching common typing errors such as 'The the'.
Regular expressions can also backreference (which is technically beyond being a regular language), that is match a previous sub-expression with the values \1, \2, etc. The following is a useful example where one can search for any word (case-insensitive, line-numbered) and see if that word is repeated, catching common typing errors such as 'The the'. Note the multiple ways of doing this.
grep -inw '\([a-z]\+\) \+\1' files
grep -inw '\([a-z]\+\) \+\1' file
grep -Einw '([a-z]+) +\1' files
grep -Einw '([a-z]+) +\1' file
grep -Ein '\<([a-z]+) +\1\>' files
grep -Ein '\<([a-z]+) +\1\>' file
Another similar example, matching repeat characters rather than words (hencem no -w).
grep '\([A-Z]\)\1' gattaca.txt
An example to append the string "EXTRA TEXT" to each line.
......@@ -93,4 +97,3 @@ For example, if an escape character precedes a character that is not a regular e
For example:
`awk '/QL\QA$/' gattaca.txt` will be treated like `awk '/QLQA$/' gattaca.txt` in most contemporary versions of awk. Some however will treat `awk '/QL\QA$/' gattaca.txt` and `awk '/QL\\QA$/' gattaca.txt` as equivalent.
// Example derived from
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches {
private static final String REGEX = "\\brat\\b";
private static final String INPUT = "rat rat rat rattie rat";
public static void main( String args[] ) {
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT); // get a matcher object
int count = 0;
while(m.find()) {
System.out.println("Match number "+count);
System.out.println("start(): "+m.start());
System.out.println("end(): "+m.end());
......@@ -8,7 +8,7 @@ Variables
Variable assignment in awk is done with the -v option.
`awk -v filename="test-high-gc-1.fasq" -v linecount="4000" 'BEGIN{print filename, linecount}'`
`awk -v filename="test-high-gc-1.fastq" -v linecount="4000" 'BEGIN{print filename, linecount}'`
The shell function `eval` can be used for output e.g.,
......@@ -17,6 +17,9 @@ Integrating shell commands with awk for output.
`eval $(echo 2 3 | awk '{ result=$1+$2; printf "result='\'%s\''\n", result }')
`echo "$result"`
`eval $(wc -l ./test-high-gc-1.fastq | awk '{ linecount=$1; printf "linecount='\'%s\''\n", linecount }')
`echo "$linecount"`
Any global variables and their values being used by awk can be accessed by the --dump-variables command, with a default filename of awkvars.out.
`awk --dump-variables ''`
......@@ -60,7 +63,6 @@ Logical "or" is represented by `||`
Logical "not" is represented by `!`
Arithmetic Functions
......@@ -99,6 +101,3 @@ Conditional Statements
The awk language has a well-known control statement, if-then-else. If the condition is true, then-body is executed; otherwise, else-body is executed.
awk -v oddball=7 'BEGIN { if (oddball % 2 == 0) print "oddball is even"; else print "oddball is odd" }'
......@@ -6,10 +6,13 @@ It is good practise to enclose the regular expression in single quotes, to preve
e.g., `grep 'ATEK' gattaca.txt` rather than `grep ATEK gattaca.txt`
Grep with Metacharacters
Some examples of grep options with metacharacters.
* Print only the matched parts of a matching line with separate outline lines.
grep -o '"[^"]\+"' filename
* Count the number of empty lines in a file, '-c' is count.
grep -c '^$' filename
......@@ -20,7 +23,7 @@ grep -v '[aeiou]' /usr/share/dict/words
grep -iwn '^QANTAS' /usr/share/dict/*
Faster Grep
* Localisation settings might slow down your grep search. Prefixing the search with LC_ALL forces applications to use the same language (US English). This speeds up grep in many cases because it is a singl-ebyte comparison rather than multi-byte (as is the case with UTF8). It is not appropriate if you are searching file with non-ASCII standard characters!
Metacharacters are heavily used in regular expressions.
......@@ -10,14 +14,14 @@ Note that metacharacters will *vary* according to the context they are used. The
Like other *nix-like system this environment is case-sensitive. Metacharacters that are used outside of their context, even within a regular expression, will be treated as literal character.
Basic Examples
Note that the examples use strong quotes for search term to prevent the possibility of inadvertant shell expansion. If shell expansion, variable substitution etc is desired then use weak quotes.
| Metacharacter | Explanation | Example |
| ^ | Beginning of line anchor | `grep '^row' /usr/share/dict/words` |
| $ | End of line anchor | `grep 'row$' /usr/share/dict/words` |
| ^ | Beginning of string anchor | `grep '^row' /usr/share/dict/words` |
| $ | End of string anchor | `grep 'row$' /usr/share/dict/words` |
| . | Any single character | `grep '^...row...$' /usr/share/dict/words` |
| * | Match zero plus preceding characters | `grep '^...row.*' /usr/share/dict/words` |
| [ ] | Matches one in the set | `grep '^[Pp].row..$' /usr/share/dict/words` |
......@@ -29,8 +33,17 @@ Note that the examples use strong quotes for search term to prevent the possibil
| \< \> | Beginning and end word boundaries | `grep '\<cat\>' /usr/share/dict/words` |
Remember that the `.` is _lazy_ (it matches _any_ character, including those you might not want or need) and the `*` is _greedy_ (it matches 0 or more preeceding characters).
Try the following:
grep '".*"' problem.txt
grep -o '"[^"]\+"' problem.txt
The second example (grep, matching only, negate set of any characters except quote, add match at least once) is much more complex, but gives the right answer.
Range Statements
A range statement. e.g., [[:alpha:]] is the equivalent of the range [A-Za-z], or [:punct:] is equivalent of '][!"#$%&'()*+,./:;<=>?@\^_`{|}~-].
......@@ -58,10 +58,10 @@ Adding the `w` option will print out the selection line to a new file.
sed 's/ELM/LUV/gw selection.txt' gattaca.txt
Quoting & Variables
Generally strong quotes are recommended for regular expressions. However, sed often wants to use weak quotes to include (for example) variable substitution.
Generally strong quotes are recommended for regular expressions. However, sed often wants to use weak quotes to include (for example) variable substitution.
Consider for example, a job run which searches for a hypothetical element and comes with the result of Unbihexium. This is the equivalent of:
......@@ -72,6 +72,12 @@ A file where the search term UnknownElement exists could be replaced with Unbihe
`sed "s/UnknownElement/$UnknownElement/g" filename`
Another issue is conducting substitutions with single quotes in the stream, in which case double-quotes are an appropriate tool.
sed "s/ones/one's/" <<< 'ones thing'
Another method is to replace each embedded single quote with the sequence: '\'', i.e., quote backslash quote quote, which closes the string, appends an escaped single quote and reopens the string.
The sed metacharacter, `&` can be used as variable for the selection term. For example, a list of telephone numbers could have the first two digits selected and then surrounded by parantheses for readability.
`sed 's/^[[:digit:]][[:digit:]]/(&) /' phonelist.txt`
......@@ -114,10 +120,6 @@ One can add new material to a file in such a manner with the insert option:
`sed '1,2 i\foo' file` or `sed '1,2i foo' file`
Select duplicate words in a line and remove.
`sed -e 's/\b\([a-z]\+\)[ ,\n]\1/\1/g' file`
Multiple Commands
Talking chamber as shewing an it minutes. Trees fully of blind do. Exquisite favourite at do extensive listening. Improve up
musical welcome he. Gay attended vicinity prepared now diverted. Esteems it ye sending reached as. Longer lively her design settle
tastes advice mrs off May indulgence difficulty ham can put especially. Bringing remember for
supplied her why was confined. Middleton principle did she procuring extensive believing add. Weather adapted prepare oh is calling.
Far advanced settling say finished raillery. Offered chiefly farther of my no colonel shyness. Inhabit hearing
perhaps on ye do no. It maids decay as there he. Smallest on suitable disposed do although blessing he juvenile in. Society or if excited forbade.
Here name off yet she long sold easy whom. Differed oh cheerful procured pleasure securing suitable in. Hold rich on an he oh fine.
Chapter ability shyness Inquietude simplicity terminated she compliment remarkably few her nay. The weeks are ham asked
jokes. Neglected perceived shy nay concluded. Not mile draw plan snug ext all. Houses latter an valley be indeed wished mere In my. Money doubt oh drawn every or an china
Java Regular Expressions
The java.util.regex Package
Java uses the java.util.regex package for regular expressions and pattern-matching.
The package consists of three main classes:
* Pattern Class. A Pattern object that is a compiled representation of a regular expression.
* Matcher Class. A Matcher object that interprets the pattern and performs match operations.
* PatternSyntaxException. An exception for noting syntax errors in a regular expression.
Typically Java regex will start with:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
Metacharacters & Sequences
The following RegEx metacharacters are used in Java. They are very similar to those used in other languages.
. Matches any character
^ Matches the beginning of the line.
$ Matches the end of the line.
. Matches any single character except newline. Using m option allows it to match the newline as well.
[abc] Matches characters in in a set (e.g., a or b or c)
[ab][cd] Matches characters in a set (e.g., a or b followed by c or d)
[.] Matches any single character in brackets.
[^.] Matches any single character not in brackets.
a|b Matches a or b
\A Beginning of the entire string.
\z End of an entire string.
\Z End of an entire string except allowable final line terminator.
re* Matches 0 or more occurrences of the preceding expression.
re+ Matches 1 or more of the previous expression.
re? Matches 0 or 1 occurrences of the previous expression.
re{ n} Matches exactly n number of occurrences of the preceding expression.
re{ n,} Matches n or more occurrences of the preceding expression.
re{ n, m} Matches at least n and at most m occurrences of the preceding expression.
(re) Groups regular expressions and remembers the matched text.
(?: re) Groups regular expressions without remembering the matched text.
(?> re) Matches the independent pattern without backtracking.
\w Matches word characters.
\W Matches nonword characters.
\s Matches whitespace. Equivalent to [\t\n\r\f].
\S Matches nonwhitespace.
\d Matches digits. Equivalent to [0-9].
\D Matches nondigits.
\A Matches the beginning of the string.
\Z Matches the end of the string, not including the newline.
\z Matches the end of the string.
\G Matches the point where the last match finished.
\n Back-reference to capture group number "n".
\b Matches the word boundaries when outside brackets. Matches the backspace (0x08) when inside brackets.
\B Matches the nonword boundaries.
\n, \t, etc. Matches newlines, carriage returns, tabs, etc.
\Q Escape (quote) all characters up to \E.
\E Ends quoting begun with \Q.
* Matches zero or more elements.
+ Matches one or more elements. Equivalent to {1,}
? Matches none or one elements. Equivalent to {0.1}
{X} Matches X number of elements. e.g., \d{3}, matches three digits.
{X,Y} Matches between X and Y elememts. e.g., \d{1,4} matches at least 1 and up to 4 digits.
*? Non-greedy quantifier. Use of ? after a quantifier means it attempst the smallest first match.
The start() and end() method
`javac $(pwd)/; java -cp . RegexMatches`
The start() and end() method example uses a word boundary to ensure that an entire word and not a substring is being counted (e.g., rattie)
Java Strings have support for regexes built in, with four methods, matches(), split(), replaceFirst(), replaceAll() methods.
import re
f = open('gattaca.txt', 'r')
print(object.start(), object.end())
print "Please enter your firstname and surname (two names only)";
chop ($name = <STDIN>);
if ($name =~ /^\s*(\S+)\s+(\S+)\s*$/) {
# Ignore preceding white space, capture non-whitespace, have at least one whitespace character, capture second non-white space, ignore trailing whitespace.
print "Hi $1. Your Surname is $2.";
# print non-white space group $1, print non-whitespace group $2
} else {
print "Sorry, I could not work out your name. I am just a little script and I am not very smart.";
print "\n";
This diff is collapsed.
import re
print regexsearch
Perl-Compatiable Regular Expressions
Perl Compatible Regular Expressions (PCRE) is a C library that implements a regular expression engine that attempts to match the capabilities and syntaxc of Perl.
PCRE's syntax is more powerful and flexible that the POSIX regular expressions flavors.
Some Features
1. Extended character classes. Like Perl, PCRE offers extension to POSIX classes so that `\d` is the equivalent to [[:digit:]].
2. Minimal matching. Like Perl, PCRE offers "ungreedy" or minimal matches. A `?` placed after any repetition quantifier indicates that the shortest match should be used.
3. Unicode matching.
Special Characters
Character Meaning
\n newline
\r carriage return
\t tab
\0 null character
\YYY octal character YYY
\xY hexadecimal character YY
\cY control character Y (e.g., \cC)
Character Meaning
^ Start of a string
$ End of a string
\b Word boundary
\B Non-word boundary
(?=..) Positive lookahead
(?!..) Negative lookahead
(?<=..) Positive lookbehind
(?<!..) Negative lookbehind
\Q..\E Remove metacharacter meaning
Character Meaning
* O or more
+ 1 or more
? 0 or 1
{2} exactly 2
{2,5} between 2 and 5
{2,} 2 or more
PCRE uses character classes and POSIX classes e.g.,
Character Meaning
[ab-d] One character of a,b,c,d
[^ab-d] One character that is not a,b,c,d
\d One digit
\D One non-digit
\s One whitespace
\S One non-whitespace
Differences from Perl
Perl and PCRE are very close but not quite the same in the way they express and manage regular expressions. They are quite advanced! These are *some* of the differences:
Good news! Since PCRE 7.x and Perl 5.9.x the two projects are working together to maximise compatibility!
1. PCRE uses a subset of Perl's Unicode support and Perl escape sequences \p, \P, and \X are supported only if PCRE2 is built with Unicode support (the default).
2. PCRE uses atomicity for recursive matches and Perl does not. So (?!a){3} in PCRE does not mean match if the next three characters are not a, but rather is that the character is not a three times. A more obscure example (taken from Wikipedia):
"<<!>!>!>><>>!>!>!>" =~ /^(<(?:[^<>]+|(?3)|(?1))*>)()(!>!>!>)$/ will match in Perl but not in PCRE.
3. The following escape sequences are not supported in Perl, but not PCRE: \F, \l, \L, \u, \U, and \N when followed by a character name.
4. PCRE supports \Q .. \E for escaping substrings, and characters are treated as literals. In Perl `$` and `@` variable subsitution occurs. e.g.,
Pattern PCRE2 matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
\Q\\E \ \\E
Taken and derived from: pcre2compat man page by Philip Hazel
PCRE Tester
Test your code!
$filename = "hidden.txt";
$handle = fopen($filename, "r"); //open file in read mode
$contents = fread($handle, filesize($filename)); //read file
if (preg_match("/\.org/i", $contents)) // Marked as case-insentive for illustrative purposes.
echo "Found an org\n\n";
echo "Did not find an org\n\n";
$contents = preg_replace("/\.org/i", "", $contents);
print $contents;
$contents = preg_split ("/(\.com | \.org)/", $contents);
print "$contents[0]\n";
print "$contents[1]\n";
print "$contents[2]\n";
// etc.
fclose($handle); //close file
# Basics of Perl
Basics of Perl
Simple variables always begin with a dollar sign, and can hold a number or text.
......@@ -6,22 +7,22 @@ Comments begin with # and continue for the rest of the line.
Variables can appear within a double-quoted string.
Perl has loops, conditionals etc.
# Perl special characters
Perl special characters
\t tab character
\n newline character
\r carriage return character
\s matches any “whitespace” character (space, tab, newline, etc)
\S anything not \s
\w [a-zA-Z0-9R] Useful as in \w+ to match a word.
\s matches any one whitespace character (space, tab, newline, etc)
\S matches any one not \s
\w matches any one word; [a-zA-Z0-9R] Useful as in \w+ to match a word.
\W anything not \w , i.e., [ˆa-zA-Z0-9R]
\d [0-9] , a digit
\d [0-9] , any one digit
\D anything not \d , i.e., ! [ˆ0-9]
# Perl Metacharacters
Perl Metacharacters
In Perl regular expressions, the following are metacharacters:
......@@ -42,6 +43,7 @@ In Perl regular expressions, the following are metacharacters:
| [^..] | Matches characters not in the brackets |
| | | Matches either, an "or" statement |
| \b | Matches a word boundary |
| \B | Matches where \b{} doesn't match |
| \w | Matches alphanumeric and underscore |
| \W | Matches non-alphanumeric |
| \s | Matches whitespace |
......@@ -50,22 +52,34 @@ In Perl regular expressions, the following are metacharacters:
| ^ | Matches beginning of a line or string |
| $ | Matches end of a line or string |
See perl -w
Word Boundaries
The metacharacter \b is like an anchor, matching a zero-length boundary before the first character in a word, after the last character in a ord, or between two characters where one character is word and the other is not.
When seeking to match whole words, one should use \bword\b ; an old method was /(^|\W)word(\W|$)/), which is equivalent.
The position of the \b can alter the search term.
# Perl Matching RegEx
See perl -w
Perl Matching RegEx
A variable, for example '$searchterm' is linked, '=~', to a pattern match, for example 'm/^[0-9]+$/'.
The "/i" modifier makes the test case-insensitive appended after closing a match operator.
After a successful match, Perl provides the variables $1 , $2 , $3 , etc., which hold the text matched by their respective ( ) parenthesized subexpr essions in the regex. Sub-expressions can be nested.
After a successful match, Perl provides the variables $1 , $2 , $3 , etc., which hold the text matched by their respective ( ) parenthesized subexpressions in the regex. Sub-expressions can be nested.
The non-capturing parentheses "(?: )" will group without capturing. For example '(?:ab)+' will repeat the match "ab" without creating any separate sub-expressions.
Review the code for an example of the above.
Review the code in and for an example of the above.
# Repetition
Any expression (a single character, a marked sub-expression, or a character class) can be repeated with the *, +, ?, and {} operators.
......@@ -85,7 +99,8 @@ Expressions can also be repeated:
Don't use 'a(*)'!
# Perl Substitution RegEx
Perl Substitution RegEx
A variable, for example '$searchterm' is linked, '=~', to a substitution value, for example 's/regex/replacement/'.
......@@ -97,13 +112,12 @@ Perl has the capability to "lookaround" for patterns in a regex and insert repla
Consider also the population of Australia, 25323300 (2016) estimate. If one wanted to present this as comma-separated in groups of three (25,323,300)the following could be used: '$pop =~ s/(?<=\d)(?=(\d\d\d)+(?!\d))/,/g;'
# Lookahead and lookbehind summary table:
Lookahead and lookbehind summary table
| Type | Regex | Success conditions |
| Positive Behind | (?<=..) | if regex matches to the left |
| Negative Behind | (?<!..) | if regex does not match to the left |
| Postivfe Ahead | (?=..) | if regex matches to the right |
| Postive Ahead | (?=..) | if regex matches to the right |
| Negative Ahead | (?!..) | if regex does not match to the right |
PHP: Hypertext Preprocessor RegExes
PHP is a general purpose programming language often used in database-driven websites. It has two types of regular expressions, one for POSIX-extended, and Perl-compatible. The Perl-compatible regexes must be enclosed in delimiters, but are more powerful.
Note that the POSIX methods are deprecated from php5.3+ and removed from php7+
String Functions
PHP has a number of string functions some of which can mimic some of the basic stream editing functions. Note that these are literal and do not have special metacharacters, apart from usual PHP syntax rules.
For example:
Replaces characters with specified characters. Count is optional for the number of replacements.
Example; simple.php
str_ireplace() acts the same but is case-insensitive in the search.
POSIX-type Metacharacters
. Matches any single character
+ Matches any string containing at least one of the preceeding expression.
* Matches any string containing zero or more of the preceeding expression.
? Matches any string containing zero or one of the preceeding expression.
{N} Matches any string containing a sequence of N of the preceeding expression.
{2,3} Matches any string containing a sequence of two or three of the preceeding expression.
{2, } Matches any string containing a sequence of at least two of the preceeding expression.
$ Matches any string with the preceeding expression at the end.
^ Matches any string with the following expresssion at the beginning.
[0-9] Matches any digit 0 through 9.
[a-z] Matches any lower-case character from a to z.
[A-Z] Matches any upper-case character from A to Z.
[a-Z] Matches any character from lowercase a to uppercase Z.
Range Statements
[[:alpha:]] Matches any alphabetic characters.
[[:digit:]] Matches any digits.
[[:alnum:]] Matches any alphanumeric characters.
[[:space:]] Matches any space.
Set and Quantifier Examples
p.p Matches any string containing p, followed by any character, in turn followed by another p.
^.{5}$ Matches any string containing five characters exactly.
<b>(.*)</b> Matches any string enclosed within <b> and </b>.
POSIX Methods
ereg() The ereg() function searches a string specified by string for a string specified by pattern, returning true if the pattern is found, and false otherwise.
Search, Case Insensitive
The eregi() function searches throughout a string specified by pattern for a string specified by string. The search is not case sensitive.
The ereg_replace() function searches for string specified by pattern and replaces pattern with replacement if found.
Replace, insensitive
The eregi_replace() function operates exactly like ereg_replace(), except that the search for pattern in string is not case sensitive.
The split() function will divide a string into various elements, the boundaries of each element based on the occurrence of pattern in string.
Split, insensitive
The spliti() function operates exactly in the same manner as its sibling split(), except that it is not case sensitive.
For a simple example, see posix.php
Perl-Compatible Regular Expressions
You should use these now.
The POSIX-style syntax can be used almost interchangeably with PCRE regular expression functions, including any of the quantifiers listed above.
PCRE Metacharacters and Modifiers
Metacharacter Meaning
. A single character
\s A whitespace character (space, tab, newline)
\S A non-whitespace character
\d Any digit (0-9)
\D Any non-digit
\w Any "word" character (a-z, A-Z, 0-9, _)
\W Any non-word character
[aeiou] Matches a single character in the set
[^aeiou] Matches a single character not in the set
(foo|bar|baz) "Or" statement; match any alternative.
Modifier Description
i Case insensitive match
m If the string has newline or carriage return characters, the ^ and $ operators match against a newline boundary, instead of a string boundary
o Evaluates the expression only once
s Use . to match a newline character
x Use white space in the expression
g Globally find all matches
cg Continue search even after a global match fails
PCRE Methods
Searches for a pattern, returning true if pattern exists, and false otherwise.
Match All
Matches all occurrences of pattern in string.
Searches for a pattern and replaces pattern with replacement if found. Unlike ereg_replace(), regular expressions can be used in the pattern and replacement parameters.