Parsing GenBank to FASTA with yield in Python (x, y) -


For now I have tried to define and document my own work, but I have to test the code I am facing issues, I do not really know that this is correct. I have found some solutions with biopython, again or others, but I really want to make this work with yield.

  FASTA def for parse_GB_to_FASTA (line): #set Default label curr_label = None #set default sequence curr_seq = "For lines in lines: # if the line starts with ACCESSION , Then it should be saved as the beginning of the line if line.startswith ('ACCESSION'): # if the label has already been changed, there is no curr_label: #output label and sequence yield curr_label, curr_seq '' ' If the label starts with ACCESSION, then immediately replace the next label with the next label and the next Brake 'Continue column' #strip and leave the number curr_label = '& gt;' + Line.strip () [12:] # Check the surviving column elif line.startswith ('ORGANISM'): # Add the name of the animal to the label line # curr_label = curr_label + "" + line.strip () [12: ] # Check if the area of ​​the sequence starts elif line.startswith ('ORIGIN'): # till the end of the line reaches, then line.startswith ('' '' '') is incorrect: empty space # Curr_seq + = line.upper () without numbers and. [12:] Translation (None, '1234567890') # If there is no other line, yield the last label and sequence curr_label, curr_seq  
let me often Worked with Danny Jenbank files and found (years ago) that the bioption parser was very brittle, making it very brittle to make it. 100 records of 100 (on time) without crashing on an unusual record.

I wrote a refined python (2) function to return it to the next complete record from an open file, to read in 1k part, and to prepare the file pointer to get the next digits. Hkr. I think it is fastened with a simple iterator that uses this function, and Fasta (themselves) to obtain a Jenbank Fasta version record class method.

YMMV, but the function that gets the next record is here as you would like to use in any Iterator scheme. As far as Fasta is converted, you can use the above logic as your access and orientation, or you can use the text of sections (eg, origins) in this way:

  sectionTitle = 'ORIGIN' searchRslt = Re.search (r '^ (% s. +?) ^ \ S'% sectionTitle, gBRText, re.MULTILINE | RE.DOTALL) section text = searchRslt.groups (0) [0]  
In the sub-sections like ORGANISM, the pad on the left of the 5 spaces is required.

Here's my solution to the main problem:

  def getNextRecordFromOpenFile (fHandle): "" "Cisize" "Look for the next GenBank record return text in the file = 1024 recFound = False recChunks = [] Try: fHandle.seek (-1,1) except for IOError: pass sPos = fHandle.tell () gbr = None is true: cPos = fHandle.tell () c = fHandle .read (cSize) if c == '': Return none if not reckoned: LokSPS = c.find ('\ nLOCUS') if sPos == 0 and c. Starstswi ('Locus'): Locus POS = 0 elim Locus POS == -1: If Locuspos & gt; 0: Locus piece + = 1C = C [Placepoint:] Rikoundound = true second: Locus POS = 0 if (lane (recapture)> continue and ((c.startswith ('// \ n') And recChunks [-1] .endswith ('\' ')) or (c.startswith (' \ n ') and recChunks [-1] .endswith (' ('\' // ')) or (c.startswith ('/ \ N') and Rick Chanchal [-1] .endswith ('\ n /')))): eorPos = 0 else: eorPos = c.find ('\ N // \ n', locusPos) if ERPos == -1: recChunks.append (c) else: recChunks.append (c [: (eorPos + 4)]) gbrText = ''. Join (Rick Chynx) fHandle GBText  
< / Div> (cPos-locusPos + eorPos)

Comments

Popular posts from this blog

apache - 504 Gateway Time-out The server didn't respond in time. How to fix it? -

c# - .net WebSocket: CloseOutputAsync vs CloseAsync -

c++ - How to properly scale qgroupbox title with stylesheet for high resolution display? -