How to convert any sequence-like format to 'generic-seq'#

A goal of crowsetta is to make it easier to share annotations for a dataset of animal vocalizations or other bioacoustics data. One way to achieve this is to convert the annotations to a single flat csv file, which is easy to share and work with, e.g., using the pandas library. For sequence-like annotations, this can be done by converting them to the 'generic-seq' format.

This how-to walks you through converting annotations to the 'generic-seq' format and then saving those annotations as a csv file. As suggested by its name, it is meant to be a generic sequence-like format that all other sequence-like formats can be converted to.

Workflow#

Here’s the general workflow. We’ll see a few different ways to achieve it below.

  1. Load annotations in your format

  2. Convert those to crowsetta.Annotation instances

  3. Make a crowsetta.formats.seq.GenericSeq from those Annotations.

  4. Save to a csv file using the crowsetta.formats.seq.generic.GenericSeq.to_file method

This works because crowsetta represents a set of annotations in generic-seq format as a list of crowsetta.Annotation instances where each Annotation has a crowsetta.Sequence. Since all sequence-like formats have a to_annot method, they can all be converted to 'generic-seq'. In turn, this means that any sequence-like format can be converted to a flat .csv file, by creating a 'generic-seq' instance with the Annotations produced by calling to_annot and then calling the to_file method of the 'generic-seq' instance.

Converting a sequence-like format with a single annotation file per annotated file#

The first example we show is for possibly the most common case, where each annotated file has a single annotation file. This is likely to be the case if you are using apps like Praat or Audacity. An example of such a format is the Audacity standard label track format, exported to .txt files, that you would get if you were to annotate with
region labels. This format is represented by the crowsetta.formats.seq.AudSeq class in crowsetta.

As described above, all you need to do is load your sequence-like annotations with crowsetta, and then call the to_annot method to convert them to a crowsetta.Annotation instance. When working with a format where there’s one annotation file per annotated file, this does mean you need to load each file and convert it into a separate annotation instance. (Below we’ll see an example of a format where annotations for multiple files are contained in a single annotation file, and so we only need to call to_annot once after loading it to get a list of crowsetta.Annotations.) For this first example, where we have multiple annotation files, we use a loop to load each one and convert it to a crowsetta.Annotation instance.

We use the same dataset we used in the Tutorial for this example, “Labeled songs of domestic canary M1-2016-spring (Serinus canaria)” by Giraudon et al., 2021, annotated with Audacity Labeltrack files.

cd ..
/home/docs/checkouts/readthedocs.org/user_builds/crowsetta/checkouts/stable/doc
/home/docs/checkouts/readthedocs.org/user_builds/crowsetta/envs/stable/lib/python3.10/site-packages/IPython/core/magics/osm.py:417: UserWarning: using dhist requires you to install the `pickleshare` library.
  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]

First we download and extract the dataset, if we haven’t already.

!curl --no-progress-meter -L 'https://zenodo.org/record/6521932/files/M1-2016-spring_audacity_annotations.zip?download=1' -o './data/M1-2016-spring_audacity_annotations.zip'
import shutil
shutil.unpack_archive('./data/M1-2016-spring_audacity_annotations.zip', './data/giraudon-et-al-2021')

Now we load the annotation files.

import pathlib
import crowsetta

audseq_paths = sorted(pathlib.Path('./data/giraudon-et-al-2021/audacity-annotations').glob('*.txt'))
# we make the list of ``Annotation``s "by hand" instead of getting it from a `to_annot` call
annots = []
for audseq_path in audseq_paths:
    annots.append(
        crowsetta.formats.seq.AudSeq.from_file(audseq_path).to_annot()
    )

print(
    f"Number of annotation instances from dataset: {len(annots)}"
) 
Number of annotation instances from dataset: 459

We create a set of annotations in the generic sequence format, by making an instance of the GenericSeq class, passing in our list of crowsetta.Annotation instances.

# pass in annots when creating generic-seq instance
generic = crowsetta.formats.seq.GenericSeq(annots=annots)
print("Created 'generic-seq' from annotations")
df = generic.to_df()
print("First five rows of annotations (converted to pandas.DataFrame)")
df.head()
Created 'generic-seq' from annotations
First five rows of annotations (converted to pandas.DataFrame)
label onset_s offset_s notated_path annot_path sequence annotation
0 SIL 0.000 0.350 None data/giraudon-et-al-2021/audacity-annotations/... 0 0
1 call 0.350 0.664 None data/giraudon-et-al-2021/audacity-annotations/... 0 0
2 SIL 0.664 1.359 None data/giraudon-et-al-2021/audacity-annotations/... 0 0
3 Z 1.359 2.412 None data/giraudon-et-al-2021/audacity-annotations/... 0 0
4 SIL 2.412 2.488 None data/giraudon-et-al-2021/audacity-annotations/... 0 0
print("Last five rows of annotations (converted to pandas.DataFrame)")
df.tail()
Last five rows of annotations (converted to pandas.DataFrame)
label onset_s offset_s notated_path annot_path sequence annotation
23303 SIL 23.283 23.379 None data/giraudon-et-al-2021/audacity-annotations/... 0 458
23304 K 23.379 23.616 None data/giraudon-et-al-2021/audacity-annotations/... 0 458
23305 V 23.616 25.127 None data/giraudon-et-al-2021/audacity-annotations/... 0 458
23306 SIL 25.127 25.161 None data/giraudon-et-al-2021/audacity-annotations/... 0 458
23307 O 25.161 25.402 None data/giraudon-et-al-2021/audacity-annotations/... 0 458

Converting a sequence-like format with multiple annotations per file#

Some formats contain multiple annotations per file, and the to_annot method of the corresponding class will return multiple crowsetta.Annotation instances. To convert this format to 'generic-seq', just pass in those Annotations when creating an instance of 'generic-seq'. We demonstrate that here with the format of the Birdsong-Recognition dataset, using sample data built into the crowsetta package.

Hide code cell content
import crowsetta

crowsetta.data.extract_data_files()
import crowsetta

example = crowsetta.data.get('birdsong-recognition-dataset')
birdsongrec = crowsetta.formats.seq.BirdsongRec.from_file(example.annot_path)
# we pass a fake samplerate to suppress a warning about not finding .wav files
annots = birdsongrec.to_annot(samplerate=32000)
print(
    f"Number of annotation instances in example 'birdsong-recognition-dataset' file: {len(annots)}"
) 

# pass in annots when creating generic-seq instance
generic = crowsetta.formats.seq.GenericSeq(annots=annots)
print("Created 'generic-seq' from annotations")
df = generic.to_df()
print("First five rows of annotations (converted to pandas.DataFrame)")
df.head()
Number of annotation instances in example 'birdsong-recognition-dataset' file: 135
Created 'generic-seq' from annotations
First five rows of annotations (converted to pandas.DataFrame)
label onset_s offset_s onset_sample offset_sample notated_path annot_path sequence annotation
0 0 1.070 1.154 34240 36928 /home/docs/.local/share/crowsetta/5.0.2.post1/... /home/docs/.local/share/crowsetta/5.0.2.post1/... 0 0
1 0 1.258 1.345 40256 43040 /home/docs/.local/share/crowsetta/5.0.2.post1/... /home/docs/.local/share/crowsetta/5.0.2.post1/... 0 0
2 0 1.467 1.555 46944 49760 /home/docs/.local/share/crowsetta/5.0.2.post1/... /home/docs/.local/share/crowsetta/5.0.2.post1/... 0 0
3 1 1.659 1.732 53088 55424 /home/docs/.local/share/crowsetta/5.0.2.post1/... /home/docs/.local/share/crowsetta/5.0.2.post1/... 0 0
4 0 1.814 1.878 58048 60096 /home/docs/.local/share/crowsetta/5.0.2.post1/... /home/docs/.local/share/crowsetta/5.0.2.post1/... 0 0
print("Last five rows of annotations (converted to pandas.DataFrame)")
df.tail()
Last five rows of annotations (converted to pandas.DataFrame)
label onset_s offset_s onset_sample offset_sample notated_path annot_path sequence annotation
7647 6 11.844 11.952 378992 382448 /home/docs/.local/share/crowsetta/5.0.2.post1/... /home/docs/.local/share/crowsetta/5.0.2.post1/... 0 134
7648 2 12.006 12.092 384176 386928 /home/docs/.local/share/crowsetta/5.0.2.post1/... /home/docs/.local/share/crowsetta/5.0.2.post1/... 0 134
7649 0 12.148 12.236 388752 391536 /home/docs/.local/share/crowsetta/5.0.2.post1/... /home/docs/.local/share/crowsetta/5.0.2.post1/... 0 134
7650 7 12.266 12.316 392528 394128 /home/docs/.local/share/crowsetta/5.0.2.post1/... /home/docs/.local/share/crowsetta/5.0.2.post1/... 0 134
7651 8 12.594 12.646 402992 404688 /home/docs/.local/share/crowsetta/5.0.2.post1/... /home/docs/.local/share/crowsetta/5.0.2.post1/... 0 134

To save these as a csv file, you can either call the pandas.DataFrame.to_csv method directly, or you can equivalently call the GenericSeq method to_csv.

df.to_csv('./data/birdsong-rec-pandas.csv', index=False)

generic.to_file('./data/birdsong-rec-generic-seq.csv')

Now you have seen two different ways to create a GenericSeq instance from a set of annotations, and then save them to a csv file so anyone can work with them!