Showing headlines posted by Bob_Mesibov
« Previous ( 1 2 3 4 5 6 7 8 ... 10 ) Next »There's data missing - please explain
What should data enterers do about missing data? The database software might only help a little. For example, MySQL users enter "NULL" whenever a data item doesn't have a known value. NULL gets special treatment in MySQL: it's ignored, or understood to equal FALSE. An empty string can also be a data item in MySQL, and it's treated differently from NULL. An empty string is known to be a blank, while NULL isn't known to be anything. But does blank really mean blank?
The curious world of check digits
In many standardised numerical codes, one or more digits are special. They're called check digits and they can be used to check that the code hasn't changed due to human or computer error. This post shows how they work and includes an AWK function for checking one kind of numerical code.
Batch triangulation on the command line
Nobody does triangulations anymore? Actually, some people do, and this code will help. It locates a point given the distances to two other, already located points.
CSV to table, table to CSV
The title of this post might sound a little strange. A data CSV is already a table, isn't it? With fields separated by commas? True, but the data ops demonstrated here aren't so simple. They're easily done, however, with GNU datamash and AWK.
The little museum and its data
A small room in the basement was cleared out to make room for a server rack and a desk and chair. The City hired an IT person to network the various museum computers, connect them to the Internet and build a museum website with nice words and pictures. The IT person did not actually work very hard. The servers all ran Linux and had years of uptime between cable renewals. When the IT person was in the server room, which was not often, she sat reading sysadmin blogs and learning new computer languages.
The Incrementing Fill-Down Error
WGS84 is the geodetic datum used in most GPS units. Updates to that datum aren't called WGS85, WGS86 or WGS87; they're all still WGS84. So why do I see those larger version numbers in my data audits? I blame spreadsheets and a particular kind of filling-down. Here's how to find this error on the command line.
Mojibake madness
This is the fifth blog post in a series about character encoding mishaps. The samples explained here are truly impressive gibberish.
A data checker's checklist
Comments welcome on this draft list of topics. A lot can go wrong when a "simple" data table is compiled in plain text, but I think I've listed most of the pitfalls.
Building a molar mass calculator
There are parsers available in various programming languages for breaking chemical formulas into elements and their proportions. Here I do the job with AWK and calculate the formula's mass per mole.
How to fix one2many data issues
The problem is that each entry in a certain field should have only one corresponding entry in a second field. Instead, the second field has the wrong entry or no entry at all. This post explains how to find and repair these errors on the command line.
Hunting Excel date twins
Excel files can sometimes contain the same record with dates four years and one day apart. These pseudo-duplicates are the result of Microsoft's choice of starting date for serial date numbers. The issue is explained in this post and I demonstrate one way to find Excel date pairs with command-line tools.
DIY primary/foreign key relationships, again
One table has a primary key field and the other has a foreign key field that should refer back to the first table. This isn't always the case, because the tables don't't always come from a database with referential integrity. Sometimes the two tables have their primary and foreign keys entered manually. This post describes a shell script that checks for this and related problems.
Four kinds of data anomalies
Datasets sometimes contain perfectly well-formed items that really don't belong with the other items in their field. These anomalous items are typically out of range, out of place, out of match or out of date, as illustrated below, but they can be detected with command-line tools.
A sunset surprise
If the days are getting shorter after the summer solstice, why are the sunsets getting later? Data graphics help with the explanation.
Converting a list to a presence/absence table
Structured and tidy data is great to work with, because on the command line you can reliably convert one structure into another. This post demonstrates how you can turn comma-separated lists into a presence/absence pivot table.
How to find the missing parts of a series
My wife has a data table with a unique serial number for each of its records. The table gets frequent edits and she wanted to check which of the numbers might be missing as a result of past deletions. Finding the missing numbers on the command line wasn't hard, but what if the series is more complicated than just plain numbers?
ASCII score bars and a gorblimey command
The problem is to build a string of asterisks and hyphens for a score out of 10. if the score is 6 the string is ******----. If the score is 4 the string is ****------. Three command-line solutions are demonstrated here.
Spreadsheet annoyance no. 3: quotes have priority
Spreadsheets confuse non-dates with dates, and automatically interpret certain number strings with 2 colons as [h]:mm:ss. Grrr. I've now found another example of spreadsheets doing unexpected things with plain-text data. By default, spreadsheet programs see quotes (") as string delimiters. If there's a quote at the start of a data item but none at the end, the results are... interesting.
Form text and placeholders
The three miscellaneous demonstrations in this post all have something to do with form text and placeholders. Form text is text saved as a template and variable text is added as needed. A good example is a form letter. The basic text of the letter is stored as a document, and the date, the addressee and the salutation are variable bits that get added before the letter is sent. Sometimes the missing text is marked with a placeholder like "{DATE}". Form text and placeholders in plain-text documents are easy to deal with on the command line, as shown below.
How to build a multi-file fields concordance
Fields in a data table can be specified with either a field name or a field number. Name and number don't have to be the same in different files, so that a "FirstName" field could be field 3 in table A and field 4 in table B. Here I show how AWK can be used to build a concordance table that lists the field numbers for the same field names in two or more tables.