Showing headlines posted by Bob_Mesibov
« Previous ( 1 ... 3 4 5 6 7 8 9 ... 10 ) Next »Dog and cat data
The Australian government's open data portal has a surprisingly large amount of data on dogs and cats. In this post I look at five of the datasets with command-line tools.
How to choose special characters, revisited
There's no euro symbol on my keyboard, but I can enter that character in any document or in my terminal with Ctrl + Shift + u +20ac. I can do the same with "umlaut a" (00e4) and "cedilla c" (00e7) and the degree symbol (00b0) and... Wait! Who am I kidding? There's no way I can remember all those Unicode code points. For this reason I wrote a script for quick and easy retrieval of my most-often-used special characters from a GUI.
The trouble with Windows CRLF
Windows line endings are in a pain in the ... terminal. They muck up the operations of AWK, comm, diff, echo, grep, join, paste, read, rev, sed and tr.
Data with bulges
Data analysis sometimes turns up unexpected "bulges" in the value of data items. Forensic auditors are trained to look for such things in business and banking accounts, because a bulge might be evidence of fraud or embezzlement. In the following three examples, bulges appear for more innocent reasons.
Two special data validations
All data validations are special cases. You can always identify data "of the wrong sort" that you want to exclude from data processing, but how do you define "right" and "wrong"? It depends! This post explains two "special" validations.
Data from dingbats: copying down
Copying down is easy in a spreadsheet, but it's also possible on the command line. In this post, copying down is used to repair a messy table.
Fancy numbering of records
With the "nl" and "uuidgen" commands and AWK, you can number a list of records any way you like on the command line.
Getting data out of Excel safely
Excel is perfectly OK for what it does, and millions of people happily use Excel every day. But when Excel data get exported for use in various other applications, sometimes Bad Things Happen.
Comparing fields across two tables
It's a shell-user's axiom: if you find yourself typing certain commands again and again, script them. This script saves me time when checking if the contents of a field have changed when data are moved from one data table to another.
Reformatting a list, cleverly
A recent Stack Overflow problem was solved with ingenious commands from two AWK experts. In this post I explain the solutions in detail.
Parsing scientific names
Scientific names like "Hoplatessara luxuriosa (Silvestri, 1895)" are much harder to parse than personal names, but "gnparser" can do the job on the command line.
Horizontal sorting within a field
There are two different ways to sort a field "horizontally" on the command line, but neither of them is simple.
Drugs on the command line
A publicly available dataset on registered drugs from the US Food and Drug Administration is a low-quality mess.
Changing the month format: a fairly general solution
The same month can have different but perfectly valid formats, like September, Sep, 9, 09 and ix. Conversions between formats are easier with a simple table of equivalents.
Has the rainfall pattern in my hometown changed?
From 1916 to 2015 there were only minor ups and downs in the number and intensity of rainfall events. Interesting swings in event length might explain why older locals say "The rain's changed".
How many fruits in 5 apples, 3 oranges, 1 pear and 17 lemons?
On the command line, you can do sums like this either by looking just at the numbers, or by ignoring the parts that aren't numbers — and those aren't quite the same thing.
Putting information into a table from the table's filename
Example: how to extract a date from a filename and add it to each record in the file. An example of this data-processing task would be grabbing the date part of a date-stamped filename and adding it to the table records (assuming they don't have a date), so that the files can be combined for a time-series study.
Finding changepoints in a list, revisited
Use a simple AWK command to locate the places in a list where the value of a data item suddenly changes.
Unwrap your fasta
FASTA is a plain-text file format for DNA sequences, but the sequences are often wrapped to a fixed line length. This post explains 3 Linux command-line methods for joining the sequence lines end-to-end.
Avoiding senior moments with command-line functions
The trick is to make the documentation available on the CLI. Also, how to get a "yes" or "no" answer from grep.