MS Excel’s Science SNAFUs -- Bug or User Error?
Microsoft Excel causes problems in genetic research. That’s the claim of three researchers from an Australian institute, who discovered almost 20% of data sets contained errors introduced by Excel.
The problem is Excel being “clever”—guessing the type of data it’s being asked to import. Unfortunately, some gene names look like dates. And other common data looks like floating-point numbers.
But so what? You don’t do genetics research, but what can we learn? In today’s IT Newspro, we look at the implication for IT pros—and it’s not good.
What’s the craic? Aunty BBC’s anonymous scribblers say http://www.bbc.co.uk/news/technology-37176926:
Researchers…claim that the spreadsheet software automatically converts the names of certain genes into dates. [They] claimed the problem is present in “[20%] of papers” that collated data in Excel.
“Excel is able to display data and text in many different ways. Default settings are intended to work in most day-to-day scenarios,” a spokeswoman for [Microsoft said]. “Excel offers a wide range of options, which customers…can use to change the way their data is represented.”
The Excel gene renaming issue…was first cited by the scientific community back in 2004. … The problem has “increased at an annual rate of 15%” over the past five years.
Please, let’s not call it Excelgate, OK? So Simon Sharwood snarkily says Excel hell:
The Baker IDI Heart & Diabetes Institute [published] a paper titled Gene name errors are widespread. … It’s not hard to change the default format of Excel cells. … Much of the problem…is therefore between scientists’ ears, rather than…Excel.
The paper offers two workarounds. One is to use Google Sheets. … The authors also cooked up scripts to find Excel errors…on Sourceforge.
It’s not just mis-identifying gene names as dates. As noted by Peter Gothard—More ham-fisted fun with everyone’s favourite spreadsheet:
RIKEN identifiers are automatically converted…into floating point numbers. … For example from accession 2310009E13 to 2.31E+13.
The problem is obvious: you don’t mess with genetics. [It can lead] to Jurassic Park. … We could be looking at an army of giant Clippys…if we’re not too careful.
Bad for academia, but what about enterprise IT? A collection of commentators comment thuswise:
What surprises me most is that YYYY-MM-DD still isn’t the commonly used/accepted format. I work in a [European] company that does a lot of business with US companies. … DD-MM-YY v. MM-DD-YY format ****ups have resulted in problems [very] often. And even now there are many forms…where DD-MM-YY is the only accepted format.
Another example: exported lists of usernames and passwords. … The passwords were 8-character random hex strings. It worked fine 99.9% of the time – except for ones which happened to be like “123456e8” which Excel had munged into floating-point.
60,000 distinct 14-digit ID numbers…all coming up as 5.42342E14 and when reformatted as number…as 54234200000000. You mean that’s not what was desired?
In my…work I get spammed by Excel worksheets.
1. Excelsheets with minutes from meetings! (use word)
2. Excelsheets with Rack Diagrams (use Rackbuilder)
3. Excelsheets with asset inventories (use some sort of Database)
Don’t even get me started on phone numbers [in] international format—with a ‘+’ at the beginning. Excel helpfully [sees] that as a formula. … Clippy is dead, and come back to haunt us as a spreadsheet poltergeist.
So what do the researchers have to say for themselves? G’day, Mark Ziemann, Yotam Eren, and Assam El-Osta—Gene name errors are widespread:
Inadvertent gene symbol conversion is problematic. … Supplementary [Excel] files are an important resource…that are frequently reused. … There is no way to permanently deactivate automatic conversion…in MS Excel…LibreOffice Calc or Apache OpenOffice Calc. [But] Google Sheets did not convert any gene names to dates or numbers.
There are undoubtedly many more instances of…errors in journals outside of the 18 we screened. … Conversion errors…should be easy to avoid if researchers, reviewers, editorial staff and database curators remain vigilant.
Excel considered harmful. Sky falling. Film at 11. But “This isn’t an Excel error,” comments cschulz:
It is an operator error. … Any data submitted for publication should be double-checked … This study indicates lack of diligence [that] should have been caught ahead of publication.
It must be time for a “get off my lawn” riposte. Hear hearme0 out:
This problem increasing 15% is [a] result of a 15% increase in non-thinkers and dumb*****. … There is no excuse! … Sadly, that’s how the youth is today.
I do fault Excel itself because these errors are pervasive. They could have better structured the data imports…or asked users for more feedback or have the import do a pass over the entire datasets checking for outliers that may suggest a different type.
When a critical mass is “doing it wrong,” it becomes pointless…to point fingers at users. Tools are supposed to be useful.
More great links from Petri, IT Unity, Thurrott and abroad:
- Microsoft Delivers Another Bad Update, Breaks PowerShell with Latest Patch
- Filtering PowerShell with the Where Method
- Extracting text from an image in OneNote
- No, Laptops Using Intel’s Next Budget Processors Do Not Suffer from One Big Flaw
- Microsoft, It’s Time for a Reliable Computing Initiative
- Dropbox prompting users to reset passwords
- Previously: Microsoft Windows 10 vs. EFF Privacy Voice — Fight!
You have been reading IT Newspro by Richi Jennings, who curates the best bloggy bits, finest forums, and weirdest websites… so you don’t have to. Hatemail may be directed to @RiCHi or [email protected]. Ask your doctor before reading. Your mileage may vary. E&OE.
Main image credit: Christoph Bock (cc:by-sa)