Posts Tagged ‘Bioinformatics’

This recent paper (An update on DNA barcoding: low species coverage and numerous unidentified sequences; published in Cladistics) on an update of the Global DNA barcoding effort should be a real eye-opener to all people who love the NCBI Genbank and the process and openness of science, and especially to taxonomists.

DNA sequence based identification of organisms started during the 1980’s and is still an ongoing process. It is based on an idea that:

  1. If a hitherto identified specimen or organism gets its DNA portion sequenced and is made publicly accessible
  2. Other researchers could sequence their samples and check against the database to identify their sample, provided this second researcher lacks taxonomic expertise.

However this necessitates that the first researcher to know how to identify the specimen unambiguously.

Idea is old, but the name is new!

Recently during the early half of the last decade an international effort to “barcode” all organisms on earth has started based on the above said idea, which in turn is based on years of fine tuning by biologists and computer scientists (who developed BLAST and similar applications).

These researchers propose that sequencing a 650 base pair long region of the mitochondrial DNA could hold good to identify all the animals due to the peculiarities of the sequence. They claim to be the first ones to develop the idea, ignoring the efforts by earlier researchers, and their followers say that they have a “father of DNA barcoding”. I agree that they were the first ones to propose the NAME, but I wonder how it could be their NOVEL idea when the original BLAST algorithm (proposed in 1990) and the idea of sequence similarity was there already before this “barcoding” business.

Let’s come to the point

So the paper published in cladistics, looks at the claims of these “barcoders” and find some problems. They check whether:

  1. This project lived up to its initial speech act? (species coverage problem)
  2. Is it progressing scientifically? (“taxonomy” wise is it 100% percent right?)

Well, the answers are in the negative.

They find ~60,000 “metazoa” species’ barcodes in the NCBI database, which is well below the number of 10-20 million total species on earth (some claims are less but see the link). This is despite having substantial funding from the governments for the barcoding initiative. This paper says that they (Barcoding consortium) received $80 million from the Canadian government, we know about many other sources where every small barcoder gets tens of millions.

They (in this paper) looked for the keyword “barcoding” in the genbank records (of COI sequences) and remove all the COI records with that keyword, and find that only 16,000 (species) records get reduced from the list of 60,000 (species numbers not total COI records). This means that the rest are sequenced by general systematics projects and most probably not funded by any barcoding initiative.

Fishes and Birds had to be completely barcoded by 2012, according to their initial proposal, however when we look in the fish-bol website they say that barcoding for ~8500 have been completed, out of the ~31000 species in total. In the case of Fishes only ~4200 species are present in NCBI, so they have closed access to almost 4000 species.

The second distressing finding is that there are many “unidentified species” in the NCBI records. Out of 5,71,997 COI records in NCBI only 26% had proper names, or were identified up to the species level. That means a very high number of 74% were not identified to species level, so 3/4th of the barcodes produced are useless and squanders public money right*?

The paper highlights a case where a record of Diptera sp., has 1000 sequences with a genetic distance of 1% or less in the NCBI, which was produced by barcoding projects, what a waste of public money.

Readers of zoospooks are also requested to read that blog by Roderic M. Page, to understand the problem of having sequences without proper scientific names in public databases, and to get the idea about what these sequences without names means and how it is found out. He is one of the biggest scientists in my field and I am just a budding blogger/scientist, thus you would benefit better by reading his blog.

In short, DNA barcoding has performed below par, and their quest to barcode all species has failed at least until now. The main problems could be that they did not have trained taxonomists in their ranks. They are against taxonomy using morphological identification, thus these taxonomists distance themselves from barcoding, and barcoders know little taxonomy to correctly identify a species to its specific level. If barcoders say that they found cryptic diversity that was deposited as “sp.” in databases, then why 1000 specimens (with <1% identity), and I would also ask those people to read better about species delimitation methods.

To save itself, Barcoding needs

  1. Proper taxonomists (with proven credential) in each and every project (even if small) that they initiate.
  2. Deposit photographs of ALL the “barcoded” specimen in their website, individual researchers’ website and public access.
  3. Barcoders should put all their data in NCBI or make BOLD open access.
  4. Unwanted sequence deposition should be avoided (un-identified species).
  5. Sequencing unidentified specimen should be discouraged.

These are mere suggestions, by me, but for barcoding to be useful for public they need to clean up a lot, (1) use proper expertise and (2) open up their data and try for another 5 years and lets see what changes from this initial 5 year phase of their project. Regarding the title of this post, barcoding unidentified specimen and introducing errors to a precious database like NCBI should be discouraged and barcoders should understand that although it is a “people’s” choice technology, it has certain responsibilities towards the society and fellow scientists. Indeed I agree that it is very much useful to catalog the biodiversity, I also suggest that it should be done in a better way and in an open manner so that more people benefit and less human effort is lost. Also read my post on the new Pristolepis to see what happens when bad taxonomy and sequencing technology join forces.

(*This is my opinion and has nothing to do with the paper cited)


Shiyang Kwong, Amrita Srivathsan, Rudolf Meier. (2012). An update on DNA barcoding: low species coverage and numerous unidentified sequences Cladistics DOI: 10.1111/j.1096-0031.2012.00408.x (more…)