辅导案例-BIOL60201

BIOL60201 Project 1 Simon Hubbard Bioinformatics M.Sc Module: BIOL60201 Project 1: A Python Proteome Analysis Pipeline This project is designed to be completed by a group of 4 people, but 3 should still be able to manage it comfortably with some help. The project is divided into a series of Tasks, each of which will require some development of a simple Python program or will require some execution of the code supplied for groups smaller than 4. The Tasks will lead in to, or follow on from, another task, and hence it is essential that you think carefully about the formats your programs will be able to cope with, and that you draw up sensible specifications before you all start coding. Overview You will be given a genome sequence, from a bacterial organism, which has just been sequenced. An experimental group working on this organism wishes to carry out some proteomics experiments to identify proteins from the microbe. However, they need a set of proteins in a FASTA formatted database in order to search against with their mass spectra. They then want to know which is the best digestive enzyme to use, so that they get the best chance of identifying as many proteins as possible. In other words, they want to find the enzyme which produces the lowest number (on average) of peptides per protein. This is because this will be more diagnostic, since the peptides are rarer. However, their mass spectrometer can only measure peptides which have a mass (or strictly speaking, mass-to-charge ratio) between 1000 and 1500 Da and it has a maximum mass accuracy of 0.2 Da. Genome Nucleotide Fasta file l ti t fil Predicted protein ORFs Fasta file i t t i t fil Digested peptides (custom format) i t ti t f t Convert sequence to mass (m/z) t t / 1 2 3 Peptide mass data & stats ti t t t 4 Figure 1. Flow chart of proteome project The tasks are described in more detail below. Task 1 – An ORF finder You will need to write a Python program which can read in a Fasta format file containing nucleotide sequence (DNA sequence, containing ACGT) and find all the possible open reading frames (known as ORFs, beginning with ATG1, and ending with a stop codon, TGA, TAA, TAG). You can also assume that 1 Note: >90% of codons in bacterial genes start with ATG but around 10% don’t. However, we’ll assume they all do for the purposes of this exercise. BIOL60201 Project 1 Simon Hubbard the longest ORF will be used, and not have to worry about any internal ones. You will have to worry about all 6 open reading frames though, and predict ORFs in all 6. Your program should accept standard Fasta format as input, of the form: >Genome sequence AATTCAGTTACTTATTCCCCTTATTGGCAGTATTAACGCATAGGACTGCG ATAGACTTTACTAAGCTCAGTATCATTCCCTTATTGGCATAGTTACGCAT CCACCAATAGACTTTACTATTCCCTTCATGGCATATTTACGCAGTAGGAC ATAGACT And should output protein ORFs with the following fasta format: >orf_name frame length start MAKSKDFPVKADFAAHVAIQSEFAHGV >orf_name frame length start MHILDECCAKSKDFPDFAAHVAYIQSEFAA LLILPQWNSDAAHGV Orf_name should be a unique name for the ORF, which might include the name of the genome/organism, frame in which the ORF is found, and some number (ie. if the organism is called I.claudius you could call your ORFs something CLAUD_F1_0017 (for the 17th ORF in frame 1 in I.claudius). Output should be written to standard output. Please take note: this genome, like most, contains a few bases which were not called unambiguously. This means that they are given the letter ‘n’ instead of A, C, G or T. Obviously, these can not be formally translated in an amino acid in most cases. I suggest you ignore all ORFs that contain a codon with a ‘N’ in it. You should also think carefully about the minimum size of an ORF. I suggest you add an option which controls this, and I would use 50 amino acids as the minimum size, or perhaps even 100 amino acids (= 300 nucleotides). Bonus tasks: Deal with more than one sequence, and more than one file supplied on the command line as arguments Add some command line options in, such as the ability to write to a specified output file instead of standard out with –o filename, or restriction to just one frame with –f frame_number. Provide a short “usage” output if the user puts a wrong option or supplies incorrect arguments, or perhaps uses the –q option. Error checking: spot improper formats, unknown amino acids, and correctly deal with “gaps” (there normally won’t be any of course). Only taking longest ORF when the ones in different frames overlap – this is actually very tricky. BIOL60201 Project 1 Simon Hubbard Task 2 – A protein digester This python script should read in the output from task 1. ie. it should be able to cope with one or more protein sequences, and then digest them into smaller peptide sequences, reporting these back one per line. The script should be able to cope with a user-defined choice of digesting enzyme, from the following set: Trypsin: cuts at Lysine (Lys, K) or Arginine (Arg, R) unless the next amino acid is Proline (Pro, P). Endoproteinase Lys-C: cuts at Lysine (Lys, K) unless the next amino acid is Proline (Pro, P). Endoproteinase Arg-C: cuts at Arginine (Arg, R) unless the next amino acid is Proline (Pro, P). V8 proteinase (Glu-C): cuts at Glutamic acid (Glu, E) unless the next amino acid is Proline (Pro, P) The program should read in Fasta formatted protein sequence e.g.: >protein1 ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD And report the individual peptides after digestion (eg. For trypsin) >protein1 peptide 1 ALTAMCMNVWEITYHK >protein1 peptide 2 GSDVNR >protein1 peptide 3 R >protein1 peptide 4 ASFAQPPPQPPPPLLAIKPASDASD Bonus tasks Adapt your code so the program will also consider missed cleavages. These are instances when the enzyme misses out on a putative cutsite. You can allow up to 1 missed cleavage, for example, so that AHLKNMILQEWRMAALI Would produce AHLKNMILQEWR and NMILQEWRMAALI on top of the standard peptides AHLK NMILQEWR MAALI (ie. its inclusive, so that you calculate up to 1 missed cleavage). This should be supported by a command line option to allow the user to select the number of missed cleavages. Add some error trapping so the program spots “unusual” (B,O,U,J,Z) or unknown amino acids (X) Task 3 – A mass analyser In this task you should read in a standard Fasta format file and convert the sequence information into masses, or strictly speaking, mass-to-charge values. You will find the masses of the standard amino acids to 4 decimal places here (http://www.matrixscience.com/help/aa_help.html). The program should cope with both monoisotopic and average masses, and should calculate the mass-to-charge value as if they were all ions of +1 charge (Hence, you need to add 1 to all values, for 1 proton). You also need to add the mass of water to the mass of the summed amino acid values, as these are for a single residue, so you need to add the masses of the terminating groups (e.g. H at the N-terminus BIOL60201 Project 1 Simon Hubbard and OH at the C-terminus). These values are: 18.0106 for monoisotopic masses, and 18.0153 for average masses. Your output should be something of the form. Prot_name peptide mass-to-charge z p sequence ABC1_YEAST 1 1234.5678 1 0 ALTAMCMNVWEITYHK ABC1_YEAST 2 894.3019 1 0 SSLPMAR ABC1_YEAST 3 1031.6791 1 0 MNASFFAAQL Where “peptide” is the number of the peptide in the protein, z is the charge of the ion (+1 by default), and p is the number of missed cleavages (you could calculate this yourself, or perhaps get Task 2 to pass this information on ?). All of the options should be provided on the command line. Bonus tasks Consider some standard post-translational modifications, such as blocking of reduced cysteines, oxidation of methionines and phosphorylation of Ser, Tyr and Thr. The script should be able to report only N-terminal peptides (ie the first one in a protein) or jus
t the C-terminal ones (the last one in a protein), selectable from the command line as an option. Task 4 – the ion statistics calculator This task should read in the output from task 3 and calculate some statistics and export the results, to the screen or a file. It should be able to report the number of peptides in a given mass range. Ie. calculate the average number of m/z values found within a m/z range (or set of m/z bins) specified by the user on the command line. For the binned histogram data, the bin sizes should be user-definable and would allow you to model the mass accuracy of the instrument. It should also be able to selectively count ions if they match a certain sequence pattern (ie. only ions which contain a Cys, or ions which start with a Met). The utility of this simple script will be self-evident when you read what you have to do in the next section. The whole team should be involved to help design the question this tool needs to be able to answer. So think carefully (as a team) as to what makes an enzyme more (or less) useful for proteomics? Although you may end up plotting the data yourself with other tools, or possibly using Python Notebooks, the script itself should just calculate the basic numbers and export CSV or tab-delimited text (not plots or pictures) which could be used by other tools. Bonus tasks Adapt the script so that it can cope with mass accuracy supplied as parts per million (ppm) instead of Da. Overall aims The overall aims of this project are to use your scripts to calculate the most effective identification system for a proteomics experiment under given (user-specified) conditions. A more effective system will be obtained if the proteomics researchers have a smaller number of peptides in the given mass (or more strictly, m/z) range, as any peptides they do find are more likely to be completely diagnostic. They might even be totally unambiguous – ie. there is only one protein which has a certain peptide with a certain m/z under certain conditions (Enzyme = Arg-c, missed cleavages = 1, N- terminal amino acid = Ser etc). You should conduct some focussed investigations with your programs, to answer the specific question: BIOL60201 Project 1 Simon Hubbard Which is the best enzyme to use considering singly charged ions (+1), and m/z ratio between 1000 and 1500 Da, and up to 1 missed cleavage (ignore all post-translational modifications, additional sequence information etc.). As well as more general investigations to see how the number of peptides per protein is affected by changing the full range of options. Data The genome sequence you need to work with will be uploaded on to the CoCalc environment, as well as the Blackboard site, in the file genome.fasta Write-up The write-up for this project is intended to be short, with two components – a group report on the pipeline as a whole (see below) and the individual programs alongside a man page description of what they are and how to use them. This should therefore be a single page of description for each script, which will be a Unix “Man page” style description of the operation of the script, its aims, options and results. This should be followed by the actual script itself (in a fixed width font please, which can be quite small if you like). The script should have lots of comments to aid readability. Each person should prepare the “Man page” for their code, and the authors should be clearly marked. You should also prepare (as a group) a short report (3 pages max) on which is the most effective enzyme, and why, as well as any general observations about what makes proteome protein identification easier (or harder) based on your observations. Please submit all work via Blackboard. This should include all the individual reports (man page + script), as well as a single report per group. There should be two assignments on Blackboard to support this, one for each of you to submit individual work, and one to submit the group report. Assessment The marks for this work will be awarded as follows: 40 marks for each script awarded to the author – this must be your own work. 10 marks as a group mark for the report and general performance of the scripts. Each person has a mark out of 50 by summing the individual and group marks Deadline: 4:00 pm Thursday Nov 14th Project Submission Guideline and Requirements – please read carefully over the page BIOL60201 Project 1 Simon Hubbard Project Submission Guideline and Requirements – please read carefully Please read the following concerning your project submission. This describes the expectation on your code and how it will be assessed. 1) Submit the code and project report on Blackboard – I will make a submission area for individuals, and one for each group. Please submit the code/project in a text file or word doc so we can annotate them and return them to you for feedback. 2) Please also upload your code to CoCalc.com. You will each have been assigned a directory with your group name (i.e. Group 1 has a folder called “group01”), so you should be able to copy your code into this folder. It should be a standard alone piece of python code, please use the following naming convention: group1_task2.py. In total there will therefore be four pieces of code in the folder. If you like, you can also include some of the actual output you have generated when running it. This might be useful for us when we are checking we have run it correctly. 3) Please upload a separate “man page” document to CoCalc in your relevant group folder – one for each task. This could be a plain text file or a word doc, but please use a fixed width font like Courier, so it looks like a real one. 4) We will evaluate the code by opening up a terminal window on CoCalc and running the code from the command line. We will expect it to be python3 compliant and run something like: $ python3 group3_task2.py –m 1 –i input.fa –o results.out or something very similar (this example is obviously for task2). There should be command line options for the user and we will also expect that there is some error reporting and/or usage if no options are supplied, or incorrect/incompatible parameters are given. We would not expect there to be a need for direct user interaction (ie. Responding to questions at the command line) when the scripts are run and all input/parameters are dealt with by the program from command line input. This would in principle allow the whole pipeline to be automated, run in the background, or scheduled as a job – a useful feature in bioinformatics. 5) Marks are awarded for the whole pipeline, based on functionality, ease of use, appropriate results, warnings/error reporting and indeed whether we can generate output from task 4 that can be plotted/analysed further. For your information, this is mark scheme we will use: Individual scripts Code features Appropriate use of basic constructs (strings, lists, dicts, loops, conditionals, etc) …………………………. 10 Readability, comments, structure, modularity …………………………………………………………………………. 10 Functionality, few or no errors, options, warnings, error trapping ……………………………………………… 15 Man page: correct use, description of function, options, example……………………………………………….. 5 Total individual mark ……………………………………………………………………………………………………………. 40 Group report marking scheme (marks possible) Report (content, results, conclusions, presentation) ………………………………………………………… out of 5 Program suite functionality (does whole pipeline work? Options? etc) ……………………………….. out of 5 Total group mark ……………………………………………………………………………………………………….. out of 10 Overall mark ……………..
………………………………………………………………………………………………. out of 50

辅导案例-BIOL60201

Related

Previous Post辅导案例-IERG4210

Next Post辅导案例-CSCC11

Author admin