#fasta #sequence #characters #input #removing #input-file #cleaner

app fasta-cleaner

Transform fasta files by upper-casing all sequence characters and removing non-ACGT sequence characters

3 releases (stable)

1.0.1 Dec 12, 2024

#192 in Text processing

Download history 319/week @ 2024-12-10 17/week @ 2024-12-17

336 downloads per month

BSD-2-Clause

15KB
290 lines

Fasta Cleaner

Cleans fasta files. All sequences of newlines and carriage returns are replaced by single newlines. The Record headers are left unchanged, and the sequences are transformed into upper case, and all characters that are not A, C, G or T are removed.

While characters are removed, the line width of the input file is left intact. It is guessed from the width of the first input sequence line, and all subsequent sequence strings are adjusted accordingly. The adjustment happens via moving line breaks, and not via removing valid sequence characters.

Example

Input:

\r>WGCaC\n\nAACCcxXAA\naacc\n.ef34\nCGG\ntgtcgcgtagcgtgatcgtgtagtcgtag\r.\r>f\nTTT

Output

>WGCaC\nAACCCAAAA\nCCCGGTGTC\nGCGTAGCGT\nGATCGTGTA\nGTCGTAG\n>f\nTTT\n
```

## Known Issues

If the first sequence line is shorter than the line width of the input fasta file, then the sequence lines in the output fasta file will be adjusted accordingly.

Dependencies

~2–9.5MB
~81K SLoC