Sorted-list or Needs List Operating Modes

This website runs in two basic modes. Sorted list mode takes the bird names you enter and prints a taxonomically-sorted list of them based on the taxonomy you selected. Needs list mode treats the names you enter as the birds you’ve seen and prints the names from the taxonomy that you haven’t yet seen. Select the operating mode by enabling or disabling the Make a Needs List option.

Except for the following, the checkbox options work the same in both modes. (These options are fully described later.)

Sorted-list mode ignores these options:

Include Subspecies
Include Other Taxa
Mark Instead of Exclude

Needs list mode ignores these options:

Keep Duplicate Lines (though duplicate unknown names still respond to this setting)
Use Taxonomy Names

Operations Common to Both Modes

I’ll first describe those things common to both operating modes.

Select Your Taxonomy

Select which taxonomy you want with the Select Taxonomy pull-down menu. The Taxonomy Info tab in the navigation bar at the top has information about each taxonomy. Each time you click the Sort button, I store the selected taxonomy as a cookie so the next time you load this page from scratch, I’ll default to that taxonomy.

Enter Your Bird Names

Either type or paste your list of bird names into the Input Names text field. Each name should appear on its own line. Blank lines are allowed but ignored as are leading or trailing spaces or tabs around the names. Duplicate names are allowed, and how they are treated depends on the operating mode and whether the Keep Duplicate Lines option is enabled or not. See below.

What Bird Names to Use

The text of each bird name you entered is matched against the text of the names in the taxonomy you selected. Here are the rules for name matching.

Matching is case insensitive.
The taxonomy files use plain apostrophes. If you enter a curly apostrophe, it is converted to a plain apostrophe before matching. If you enabled the Use Taxonomy Names option, then you’ll get plain apostrophes in the output even if you entered curly ones. If the option is disabled, then curly ones pass through to the output.
Even though the taxonomies are in English, they can contain characters not usually found in English. For example, ñ appears within Marañon Crescentchest in the eBird-Clements taxonomy. You must enter the ñ (but see Fuzzy Matching below).

If your taxonomy has entries which are not simple species names (e.g., eBird-Clements has Black x Eastern Phoebe (hybrid)), then you’ll have to enter the text as listed in the taxonomy. Again, fuzzy matching may help you.

CSV Data Support

The bird names you enter may contain CSV data. (CSV data is almost always exported from a spreadsheet or database and represents the data as a table of rows and columns but without any of the formatting or formulas. See this Wikipedia article for more details.) I require that your bird names appear in the first column (before the first comma). Any data after the first comma is passed to the output unchanged. If your CSV input data’s first row is a header row, then enable the First Row is a Header option, and the first row is passed to the first output row unchanged.

Click Sort

Once you have entered your bird names and enabled the options you want, click Sort. I try to match bird names you entered to those in the selected taxonomy (including fuzzy matches if enabled) and, based on the operating mode, write the results to the Output Names text field. Any names I cannot match go to the Unknown Names text field.

Including Taxonomy Numbers in the Output

You can include the taxonomy sequence number for each name in the Output Names by enabling the Include Taxonomy Numbers option. When enabled, I print a taxonomy number followed by a comma before each output line. If you’ve enabled the First Row is a CSV Header option, then I add Taxnum, before the CSV header to account for the extra column of sequence numbers I add.

The taxonomy number is just the line number of the bird’s entry in the selected taxonomy file and doesn’t follow any other taxonomy sequence numbering. Between different taxonomy files, the same species will almost certainly have different numbers.

Line Counts

Clicking on Sort updates counts below each text field.

For the Input Names field, this is the count of the number of non-blank input lines. Note lines and not names. Any unknown names are counted as well, and this doesn’t depend on whether fuzzy matching is enabled. The only exception to counting input lines is if you’ve enabled the First Row is a CSV Header option in which case the header row is not counted.

The counts for the Output Names field depend on the operating mode, so see each mode’s description below.

For the Unknown Names field, a simple simple count of lines in the field lies below the field.

Sorted-list Mode

In this mode, the input names are matched and sorted according to the selected taxonomy and sent to the output. Unmatched names go to the Unknown Names text field.

Enabling the Use Taxonomy Names option replaces your matching input names with the name from the taxonomy. This is mostly to fix upper-/lower-case irregularities. For example, with this option enabled, entering SORA,seen would give you Sora,seen. This option only affects the bird name and not any CSV data that follows the name, so SORA,SEEN gives you Sora,SEEN.

Fuzzy Matching

If you’ve enabled the Use Fuzzy Matching option, then when I can’t match an input name to any taxonomy name, I make a guess at what you really meant to type using close-enough or fuzzy matching. So if you meant to enter Sora but accidentally entered Sorra, I guess you meant to enter Sora. So you know which output names were fuzzily matched, I add your input in square brackets after the name from the taxonomy. Hence the typo above yields Sora [Sorra]. This works for CSV data as well. For example, Sora [Sorra],heard only.

Because fuzzy matching isn’t perfect, you need to review the fuzzy matches.

The fuzzy matching works by finding names in the taxonomy that are textually similar to what you entered. If no name is similar enough, then I don’t declare a fuzzy match, and the input name goes to the Unknown Names field. There’s a balance between too few correct fuzzy matches and too many incorrect ones. Unfortunately, there’s no additional information used, such as location, to make the guess better. There’s a section at the end of this page with more detail on how this works.

If the Use Taxonomy Names option is not enabled and there is a fuzzy match, I try to pass the case of your input to the output. For example, if you entered Sorra, you get Sora [Sorra], but it you entered SORRA, you get SORA [SORRA]. I never change the case of what you entered when your entry goes into square brackets. Hence, if this option is enabled, you get Sora [SORRA] for the second input.

If the taxonomy name has accented characters which you don’t enter, fuzzy matching might fix that. For example, if you enter Ruppell's Chat, fuzzy matching finds the correct spelling of Rüppell's Chat, but if you also made too many other spelling errors, this fuzzy match might not be found or a different match may be made.

Because fuzzy matching is a lot more work, expect sorting with this option enabled to take more time.

Correcting Unknown Names

You can select each unknown name and use your browser’s search function to find it in your input and correct the name before sorting again.

It may be handy to switch to the Taxonomy Info page and view your taxonomy if you can’t figure out your misspelling. If you have several names to look up in the taxonomy, I’d open a second browser window dedicated to viewing the taxonomy.

Output Ordering

First, the names are sorted into taxonomic order for the taxonomy you selected. If there is more than one output line for a given species, then I make a best-effort attempt at retaining the order found in the input. However, there are certain edge cases where the ordering is not maintained. Below are some examples. Assume the Fuzzy Matching option is enabled. I show the Keep Duplicate Lines option since it affects the output as well.

Input	Keep Duplicate Lines	Output	Notes
Sora Sorra Sora Soraa Sora	don’t keep	Sora Sora [Sorra] Sora [Soraa]	The input order is preserved after the duplicate Sora entries are gathered into their first appearance.
Sorra Sora Sora Soraa Sora	don’t keep	Sora [Sorra] Sora Sora [Soraa]	Same as above.
SORRA Soora Sorra SORAA	don’t keep	SORA [SORRA] Sora [Sorra] Sora [Soora] SORA [SORAA]	SORRA and Sorra are considered the same due to case- insensitive matching, thus they get moved together in the output. Still, the all-caps SORRA comes out before the mixed-case Sorra which reflects their input order. The order of the other lines is preserved.
Sorra Sora Sora Soraa Sora	keep	Sora [Sorra] Sora Sora Sora Sora [Soraa]	All the Soras are grouped together at the location of their first appearance. The other lines remain in order.
Sora,1 Sora,2 Sora,1	keep	Sora,1 Sora,2 Sora,1	Unlike fuzzy matches, the same names with the same CSV data don’t group together.

For the unknown names, the order is what you entered if the Keep Duplicate Lines option is disabled. If it is enabled and there are duplicate unknown entries, they are grouped together.

More on Duplicates

For CSV data, detection of duplicate lines uses the entire input line and not just the bird name. Thus

  Sora,seen
  Sora,heard only

are not considered duplicates, but

  Sora,seen
  Sora,seen

are considered duplicates.

Here are more cases to consider for duplicate detection.

If you entered two names that differ only in upper or lower case, how these two appear in the output depends on the settings of the Keep Duplicate Lines and Use Taxonomy Names options as shown in the table below.

Input	Keep Duplicate Lines	Use Taxonomy Names	Output
Sora SORA	enabled or disabled	disabled	Sora SORA
	disabled	enabled	Sora
	enabled	enabled	Sora Sora

A fuzzy match and a non-fuzzy match are never considered duplicates. For example, if you enter Sora and Sorra, and Sorra is fuzzily matched to Sora, both still go to the output since the fuzzy version’s name is actually Sora [Sorra], and that string doesn’t match Sora. I also want to make it clear a fuzzy match occurred which might have matched to the wrong bird.
If you misspell the same bird two different ways, these are never considered duplicates. For example, if Soora and Sorra both appear in the input, both appear in the output regardless of the Keep Duplications and Use Taxonomy Names options.
If you misspell the same bird the same way, these two are considered duplicates. For example, if Sorra and Sorra both appear in the input, both appear in the output only if the Keep Duplicate Lines option is enabled.

Counting Output Lines

The count below the Output Names field has two parts. The first is the total number of lines in the field. The second is the number of those that are fuzzy matches.

If a CSV header row is present, it it not counted in the output.

Needs List Mode

Use this option to print a needs list. That is, print out the entire taxonomy except for the birds you entered. For example, you might enter a list of all the birds you have photographed, and your needs list has the remaining birds in the taxonomy you need to photograph.

There are too many odd corner cases that I didn’t want to handle to allow fuzzy matching while making a needs list, so any misspellings you enter always go to the Unknown Names area. I suggest you sort your names in sorted-list mode with fuzzy matching enabled to fix all misspellings before generating a needs list. If the Use Fuzzy Matching option is enabled, it is ignored while making a needs list.

Duplicates are ignored except for unknown names which do follow the Keep Duplicate Lines option’s setting.

The Include Subspecies option means you’re interested in tracking the subspecies you’ve seen. It’s kind of complicated, so the table below shows some examples with the Shikra which has African and Asian subspecies.

Input	Include Subspecies	Output	Notes
Shikra not entered	disabled	Shikra	You need to see a Shikra.
Shikra	disabled	nothing	By nothing in the output, I mean related to this bird. Other species may appear based on inputs relating to them.
Shikra (African)	disabled	nothing	Even though you entered a subspecies, the unqualified name does not appear in the output since if you’ve seen a subspecies, you’ve seen the unqualified species too. The Include Subspecies option only affects the output, so a name you enter with a subspecies isn’t an unknown name as long as it’s in the taxonomy.
Shikra not entered	enabled	Shikra Shikra (African) Shikra (Asian)	You haven’t seen any example of this bird, so you need all.
Shikra	enabled	Shikra (African) Shikra (Asian)	With the option enabled, I start putting subspecies in the output. Neither subspecies is in the input, so you need to see both.
Shikra Shikra (African)	enabled	Shikra (Asian)	You’ve seen the unqualified species and one subspecies, and there’s still a subspecies left to see.
Shikra (Asian)	enabled	Shikra (African)	As in the second entry of this table, seeing a subspecies means you’ve seen the unqualified name, so it’s not printed, but you still need to see one subspecies.

The Include Other Taxa option is similar to the option just described above but affects other taxa by which I mean anything in the taxonomy that is not an unqualified bird name or a subspecies name. Many of these are found in the eBird-Clements taxonomy. A few examples:

  Snow x Canada Goose (hybrid)
  Northern Flicker (Yellow-shafted x Red-shafted)
  Brewster's Warbler (hybrid)
  Hepatic/Summer Tanager
  Dark-eyed Junco/Pine Warbler
  sparrow/warbler sp. (trilling song)
  Mallard (Domestic type)
  bird sp.

If you leave this option disabled, then you don’t get these other taxa in your output. If enabled, then you do.

If you leave both options disabled and enter nothing in the input field, then the output is just the selected taxonomy stripped of everything but unqualified species names. Might be useful.

If you enable the Mark Instead of Exclude option, instead of excluding matching names I mark them with a preceding haskmark (#, also known as an octothorp—good bit of trivia that).

The Include Taxonomy Numbers option works as described under Sorted-list Mode above.

The Use Taxonomy Names option is ignored.

If both the Mark Instead of Exclude and the Include Taxonomy Numbers options are enabled, then the hashmark is printed after the taxonomy sequence number and comma like so:

2751,#Sora

Counting Needs List Lines

Basically, the count for a needs list is how many names you still need to see and does not always match the number of names in the output field. For example, if you have the Mark Instead of Exclude option enabled, then a bird you have seen is in the output with a haskmark, but it’s not counted.

Possible Improvements

Here are some additional features I’ve thought of if folks express an interest. If you have comments on these or anything else, please use the Feedback form.

Allow common names in other languages.
Allow for custom taxonomies.
Keep a list of renamed birds as an option for fuzzy matching. Going back how far?
Does anybody want an alphabetic sort in addition to the existing taxonomic sort?

How Fuzzy Matching Works

Before it gets too complicated, the first things I try are some common errors that I found when looking at web-accessable bird lists such as mixing up Eurasian, European, and Common. I also check for ue entered in place of ü. Same for ae and oe.

If none of those work it’s time for the big gun of Levenshtein distance minimization. The Levenshtein distance between any two strings is the number of changes required to turn one string into the other. For example, the distance between Sorra and Sora is 1 since we only have to delete the extra r. A more complicated case is changing RodyDuck to Ruddy Duck since we have to make 3 changes.

Any string can be converted to any other string with enough changes such as Bald Eagle to Somali Ostrich, but that’s not very useful for our use, so there’s a limit to how many changes I allow. If there are too many changes then I don’t declare a fuzzy match. Presently, the limit is 4 changes. I played around with this limit, and 4 changes provide a good balance between matching many misspelling examples I found on the web (often just a single typo) and not getting matches that were way off base. Since figuring the Levenshtein distance is computationally expensive, I don’t even attempt it if the entered name is too long or too short compared to the taxonomic name that might be a match.

What happens if the misspelled name is, say, 3 changes away from two different taxonomy names? There’s no good way for me to pick between the two names, so I use the first in taxonomic order. The most important thing is to show that the misspelled name is not in the taxonomy, and a fuzzy match is an extra perk. It’s nice if I get the name right, but the user still knows to fix the name.