Discovering conserved DNA

Discovering conserved DNA

Welcome to lecture 2: Feeling at home in *nix IGERT Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry & Biochemistry, UCLA Last time We covered a bit of material Try to keep up with the reading its all in there! Hows it coming along?

BioKnoppix Remote logins, navigation Unix / linux concepts? General questions? The CLI and YOU Most of bioinformatics is accomplished through command-line tools Command line interaction is easily batched Command line interaction is easily integrated Command line interaction is a form of PROGRAMMING Its therefore worthwhile to become familiar with your *nix environment in a non-graphical interface

Commands In Bioinformatics, we are mostly concerned with TEXT PROCESSING the CLI is well suited for this type of work Specific commands are used to perform functions in the shell Each command is itself a program and takes command line arguments The syntax order is program [-options] filename For help on a specific command type: man command; apropos topic; command --help Some review of system tools

Who W Uname Pwd Find Top Another example of a pipe file

Command 1 (cut) Pipe cut d: -f1 < /etc/passwd | Command 2 (sort) Stdout sort

The file /etc/passwd stores information about users accounts on the system Lets get a sorted listing of all user names Example: redirecting STDOUT STDIN Command Or Program STDOUT OUTPUT_FILE

cut d: -f1 < /etc/passwd | sort > output_file more output_file redirection operator Process Control Each specific job / command is called a process Each process runs in a shell BEFORE: prompt available DURING: prompt NOT available

AFTER: prompt available Control keys CTRL-C -> stop current command CTRL-D -> end of input Two Ways to monitor Processes top Lists all jobs Uses a table format Dynamically changes ps man ps

static content Command options What are you doing, Dave? Background / Foreground Commands running in foreground prevent prompt from being used until command completes Commands can also run in BACKGROUND Backgrounded commands DO NOT AFFECT the prompt Two Ways to Background jobs &

Running a command with & automacically sends it to the background Backgrounded commands return the prompt bg Once a command is run from the prompt Stop the command Then background it Starts the command again Returns the prompt for use

File System Navigation Absolute filepaths begin with the root / Relative filepaths dont have a preceding slash; they begin from the cwd What is the absolute path to cd from john to mary? What is the relative path to cd from john to mary? Once you are in mary, and your username is john, what are two ways to return to your home directory? The society for anti-defamation of computer mouses opposes this slide Theres very little reason to leave the CLI Most tasks can be written within the shell The user-friendliness becomes self-limiting

Lets take an example Suppose you wanted to do some biological analysis like motif searching through a database of biological sequences What do you need to do this? You need to retrieve the sequences You need to describe the motif You need to search the sequences I want to search for zinc-finger motifs genomically in yeast (S.c.) Im going to need the genomic sequence for Saccharomyces cerevisiae (

Im going to need the motif that describes the zinc finger Id like to search for (ProSite). Im going to need do do this search many times across every chromosome. A brief overview of some databases / biological information repositories

NCBI Genome-specific databases (SGD) SMD The Stanford Microarray Database. Repository of microarray analysis from a wide variety. PROSITE Used to rapidly search your protein sequences for catalogued motifs. SWISSPROT SWISSPROT is a "one stop shop" for protein sequence information. Use it to extend your knowledge of your proteins. PDB: The Protein Databank The Protein Data Bank is the single worldwide archive of structural data of biological

macromolecules. Structure implies function in general. PFAM: This database is a collection of protein motifs. PRODOM PRODOM is similar to PFAM in that it is a set of curated protein domain families. However, the underlying computational engine is different. BLOCKS Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro. COG COG stands for Clusters of Orthologous Groups of proteins. This is a tool for phylogenetic classification of proteins encoded in complete genomes. COGs were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages.

Retrieving data Retrieving data You dont have to leave the CLI. Really. If you need to do something, chances are theres a utility to do so Debian is your friend (search packages FIRST!!!) Introducing wget: >wget /hypothetical_peptides/*.gz Of course you can use ftp: >ftp -login anonymous; use your email address as passwd

-traverse filesystem like any linux CLI -bin, get, prompt, mget A note about file archives Most files will be compressed. Usually using gunzip. Most files will be agglomerative, using TAR. Introducing gunzip: >gunzip *.gz Introducing tar (tape archive): >tar xvf *.tar Or to create a tar >tar cvf output.tar *.*

A brief note about the biological file format called FASTA In bioinformatics, FASTA format is a file format used to exchange information between genetic sequence databases. Its format looks like this: >SEQUENCE_1 ;comment line 1 (optional) MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREK GLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFV ENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKE AEE It consists of a header line (beginning with a '>') which gives a name and/or a unique identifier for the sequence. Many different sequence databases use FASTA files. After the header line and comments, one or more sequence lines may follow. Sequences may be protein sequences or DNA sequences they must be shorther than 80 characters and can contain gaps or

alignment characters FASTA format files often have file extensions like .fa or .fsa The simple format of FASTA files makes them easy to manipulate using text processing tools and scripting languages like Perl. *From ProSite motif Describing the motif - GREP GREP searches contents of a file or directory of files Get Regex uses regular expressions File wildcards can be used like with ls grep 1sq ~/DATA/*.CEL -> array type used

We explored this last time (briefly!) Regular expressions A regular expression, often called a pattern, is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements. For example, the set containing the three strings Mike, Mark, and Matt can be described by the pattern M((ike|(ark|att))?)"

Alternatively, it is said that the pattern M((ike|(ark|att))?)" matches each of the three strings. There are usually multiple different patterns describing any given set. Most formalisms provide the following operations to construct regular expressions. Formalisms of regular expressions alternation A vertical bar separates alternatives. For example, "gray|grey" matches grey or

gray. grouping Parentheses are used to define the scope and precedence of the operators. For example, "gray|grey" and "gr(a|e)y" are different patterns, but they both describe the set containing gray and grey. quantification A quantifier after a character or group specifies how often that preceding

expression is allowed to occur. The most common quantifiers are ?, *, and +: ? The question mark indicates that the preceding character may be present at most once. For example, "colou?r" matches color and colour. *

+ The asterisk indicates that the preceding character may be present zero, one, or more times. For example, "0*42" matches 42, 042, 0042, etc. The plus sign indicates that the preceding character must be present at least once. For example, "go+gle" matches the infinite set gogle, google, gooogle, etc. (but not ggle). These constructions can be combined to form arbitrarily complex expressions, very much like one can construct arithmetical expressions from the numbers and the operations +, -, * and /. *From

The real world is fuzzy and complex What if we just want to search for a string in the format of a phone number; E.g. 825 8901 213 487 0353 No area code Area code Obviously we cant check for each possible phone number

(some 1010 possibilities makes for a very long set of statements). This is where regular expressions come in Regular expressions describe generalised patterns of strings instead of exact strings. >grep /([0-9]{3} ){0,1}[0-9]{3} [0-9]{4}/) filename (clearly this is a little more complex as an example) Special characters (metacharacters)

. is a wildcard and matches any character >grep .ed filename If file contains -will find If file contains -will find If file contains -will not find If file contains -will find bed red

head edward Special characters (metacharacters) * means zero or more of the previous character. >grep be*d filename If file contains -will find If file contains -will not find If file contains -will find If file contains

-will find bed red beeeed bd Special characters (metacharacters) + means one or more of the previous character. >grep be+d filename If file contains -will find If file contains

-will not find If file contains -will find If file contains -will not find bed red beeeed bd Start and end of line ^ is designates the start of the line, $ the end. >grep bed filename

If file contains bed -will find If file contains bedbed -will find If file contains xxxbedxxx - will find >grep ^bed$ filename Iff file contains bed on line by itself -will find If file contains bedbed -will not find If file contains

xxxbedxxx will not find Grouping with parentheses Parentheses group characters >grep (bed)+ filename If file contains bed -will find If file contains bedbed -will find If file contains beddd -will not find Character classes The square brackets are used to denote whole

groups of characters >grep [brf]ed filename If file contains bed -will find If file contains red -will find If file contains led -will not find Character classes (cont) A hyphen designates a range: >grep [a-z]ed filename

If file contains bed -will find If file contains fed -will find If file contains Bed -will NOT find (why not?) Character class shortcuts Some character classes are so common there are in-built shortcuts: [0-9] [A-Za-z0-9] [\f\t\n\r ] =

= = \d \w \s Quantifying Curly brackets quantify repeats better than * (0+) or + (1+) a{3,5} = three, four or five as.

>grep la{3,5} If file contains laaaad -will find If file contains laaaaaaad -will not find Referencing Back-slashes match the substring previously matched by the nth parenthesized subexpression of the regular expression. The back-reference is denoted `\n', where n is a single digit >grep (a)\1

If file contains laaaad -will find If file contains lad -will not find Back to our ProSite motif We can use regular expressions to describe the motif The motif is actually a REGULAR EXPRESSION! >grep -n E -color B2 C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ]. [DNSAGE][YFVLI].[LIVFM]C.{2}C *.fsa chr04.peptides.20040928.fsa-4202->Annotated|04:1356055:1357359| frame 1; YDR448W/ADA2;

Verified; this gene contains 1 exon chr04.peptides.20040928.fsa:4203:MSNKFHCDVCSADCTNRVRVSCAICPEYDLCVPCFSQGSY TGKHRPYHDYRIIETNSYPILCPDWGADEELQLIKGAQTL Did it work? Lets try this Download the genomic DNA sequence from SGD Search for any variant of the TATA box promoter

TATAAA TATAAT TATATT TAATAA TAATAT More more more Many MS tools allow for wildcard searching The shell allows variables; interpolation; control structures For example, attempt to find a palindrome of length 4 within genomic sequences (hint: use backreferences!) Variables allow for persistence and control structures >myVar=`grep -n E -color

C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ]. [DNSAGE][YFVLI].[LIVFM]C.{2}C *.fsa` [email protected]:~$ echo $myVar chr04.peptides.20040928.fsa:4203:MSNKFHCDVCSADCTNRVRVSCAICPEYDLCVPCFSQGSY TGKHRPYHDYRIIETNSYPILCPDWGADEELQLIKGAQTL A better variable interpolation The variable is allowed to change We can set the variable to the Prosite Pattern [email protected]:~$ myVar=C\.{2}C\.{4,8}[RHDGSCV][YWFMVIL]\.[CS]\.{2,5}[CHEQ]\.[DNSAGE] [YFVLI]\.[LIVFM]C\.{2}C [email protected]:~$ echo $myVar C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C [email protected]:~$ grep -n -E --color $myVar *.fsa chr04.peptides.20040928.fsa:4203:MSNKFHCDVCSADCTNRVRVSCAICPEYDLCVPCFSQGSY

TGKHRPYHDYRIIETNSYPILCPDWGADEELQLIKGAQTL Variables can be overwritten The variable is allowed to change We can set the variable to the Prosite Pattern [email protected]:~$ function afun { > for i in 1 2 3 4 5 > do > echo $i > echo $myVar > done >} [email protected]:~$ afun 1 C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C

2 C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C 3 C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C 4 C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C 5 C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C Functions What if we wanted to search every ProSite pattern against our genomic database? Wed have to repeatedly do our search This is called a loop We have to write this so the computer knows exactly

what to repeat, how many times to repeat, and where to find the next ProSite pattern to match We would store the what and where in VARIABLES We would utilize a CONTROL STRUCTURE to handle the how Control structures All out programs so far have run from start to finish. Each line has been executed in turn. What if we only want to run some lines some of the time? This is where control structures come in. Control structures Programming languages generally have a

number of control structures. Basic structures: if while for & foreach There are others (e.g. unless) for example >afunction() { for i in 1 2 3 4 5 do echo "Looping ... number $i" done }

Variables can interpolated The command is substituted from the system Its like a pipe, but we are allowed to operate [email protected]:~$ afun() { > myvar=$(ls -1 *.fsa) > for i in $myvar > do > echo $i > done >} [email protected]:~$ afun chr01.fsa chr01.peptides.20040928.fsa chr02.peptides.20040928.fsa

chr03.peptides.20040928.fsa chr04.peptides.20040928.fsa chr05.peptides.20040928.fsa chr06.peptides.20040928.fsa chr07.peptides.20040928.fsa chr08.peptides.20040928.fsa chr09.peptides.20040928.fsa chr10.peptides.20040928.fsa chr11.peptides.20040928.fsa The while control structure (combined with opening files) The while control stucture keeps looping while a given condition is satisfied

while and open files go together very well: [email protected]:~$ afun() { > while read f > do > echo $f > done >} [email protected]:~$ afun < chrmt.peptides.20040928.fsa >Notannotated|mt:385:459| frame 1 MNYILLLLLIKLLIIINMKLIKIL Editors Shell programming is like a batch file Commands are linked together in a procedure

The procedure is accessed via a file We need an editor that will allow us to construct that file Well use Emacs (or you can use vi, pico, ) Comprehensive, extensible working environment Complete (arguable!) IDE Integration Extensible (elisp)

Emacs Invoking Emacs is easy: emacs nw filename In many cases, Emacs will work out the mode appropriate for your file (.cpp, .pl, etc) The mode allows Emacs to become sensitive to the task There is a biomode for reverse complement, etc. You can write your own! Emacs has many tools Search, replace, cut, paste, mail File navigation, ftp, remote shells The Emacs survival guide

Notation Emacs uses the control key and escape key heavily. We write it like this: C-x Pronounced "Control-x Hold down the Ctrl key (usually in the lower left corner of the keyboard) while pressing the x key. Both Ctrl and x must be down at the same time. M-x Pronounced "Meta-x"

Press the Esc key (usually in the upper left corner of the keyboard), release it, then press the x key. Esc and x should not be down at the same time. So C-x C-f means hold down the control key, then type x and then f while holding it down. (This is the command to load a file into emacs). Typing Just type. All the regular keys, arrow keys, delete, backspace, and page up/down keys should work. Alternatively, you can try these commands: C-f cursor forward, C-b cursor back, C-p previous line, C-n next line, M-v page up, C-v page down. Exiting Type C-x C-c. If you have any unsaved work, emacs will ask you if you want to save it. Type y. Other commands Most control or escape sequences are commands. Usually a prompt appears in the command line at the bottom of the window. Here are a few: C-x C-f Load file, prompt for filenameC-x C-s Save file without exiting C-x C-c Exit, prompt to save files C-s Search forward, prompt for search string C-r Search backward, prompt for

search string C-h ?Show help options, prompt for choice C-h t Start emacs tutorial If you make a mistake or change your mind you can always escape: C-g Abandon command and resume typing Command line editing Learning the keybindings can be difficult But it will increase your speed Faster than using a mouse Transferable! The keybindings for command line editing from Emacs is the default set of commands for line editing in the Bash Shell! Lets try it Open up the file that we found contained the

ProSite Motif Open a second window Goto the line that contains the motif (hint: use grep with n!) Copy and paste that line into a new file Save and close that file AWK is your pre-perl friend Use to print a subset of fields Default field delimiter is (white space)

Useful for grabbing a subset of fields Useful for rearranging fields field1 filed2 field3 field4 . . . $1 $2 $3 $4 . . . . Using AWK pipe | awk F {print $1} |

awk F {print $1 | awk F {print $1\t$2} \t = TAB \n = newline $2} Overwrite versus Append > OVERWRITE delete and replace >> APPEND add to end of existing file

Example: microarray data tracking grep 1sq ~/DATA/*.CEL (gives array info) grep 1sq ~/DATA/*.CEL | awk {print $12} gives array type only grep 1sq ~/DATA/*.CEL | awk {print $12} > arrayTypes.txt (store results in file) ls ~/DATA/*.DAT | wc (gives a count)

Recently Viewed Presentations

  • Chapter 11: Accounting Periods and Methods

    Chapter 11: Accounting Periods and Methods

    ACCOUNTING PERIODS AND METHODS (1 of 2) Accounting periods. Overall accounting methods. ... and PSCs can choose a fiscal year if deferral is 3 months or less (ยง444 election) ... Overall accounting method used in one trade or business not...
  • How to Read a Measuring Tape Coach Ketcham

    How to Read a Measuring Tape Coach Ketcham

    Imperial units/imperial system. Imperial units. or the . imperial system. is a system of units, first defined in the British Weights and Measures Act of 1824, later refined (until 1959) and reduced. Systems of imperial units are sometimes referred to...
  • Communication, Disclosure and Persuasion in Strategic Settings

    Communication, Disclosure and Persuasion in Strategic Settings

    Heuristic proof. Ex post optimality of worst action. In KG example for message (posterior ) r optimal ex post. D: is worst action if ... Value function in the persuasion problem is concave. A geometric tool to find optimal message...
  • Using Technology to Improve Fleet Compliance Ruth Waring

    Using Technology to Improve Fleet Compliance Ruth Waring

    Licence Bureaux, DVLA online service. Fitness to Drive. Are drivers fit? Confirm via App. Wellness Apps. Apps to calculate "morning after" effects. Sleep apps. Drug and alcohol interlock devices. Mobile phone use. Are phones being used causing distractions? Apps which...
  • PP06 Continuous Random Variables

    PP06 Continuous Random Variables

    T18-07 Seasonally Adjusted Linear Trend Forecast Purpose Allows the analyst to create and analyze a "Seasonally Adjusted Linear Trend" forecast.
  • Chap.4 Conceptual Modules Fishbane

    Chap.4 Conceptual Modules Fishbane

    Solving the equations for the unknowns gives a = 0.68 m/s2 and FT = 10,500 N. * Demos for inclined planes, pulleys Normal force and tension Example problems Atwoods machine, Einstein in an elevator or accelerating scale, blocks accelerating *...
  • Common Core State Standards: Content

    Common Core State Standards: Content

    College and Career Readiness Standards. The Common Core State Standards (CCSS) are intended to measure student readiness for postsecondary education, reflecting the knowledge and skills that young people need for success in college, careers, and life.
  • Leadership in Management

    Leadership in Management

    Sports. Leaders make sure basic operations are running smoothly. They also have several qualities ... Must be good at human relations - how people interact in the workplace and how communication can be improved. ... This is the most highly...