How to convert any sequence-like format to `'generic-seq'`#

A goal of crowsetta is to make it easier to share annotations for a dataset of animal vocalizations or other bioacoustics data. One way to achieve this is to convert the annotations to a single flat csv file, which is easy to share and work with, e.g., using the pandas library. For sequence-like annotations, this can be done by converting them to the 'generic-seq' format.

This how-to walks you through converting annotations to the 'generic-seq' format and then saving those annotations as a csv file. As suggested by its name, it is meant to be a generic sequence-like format that all other sequence-like formats can be converted to.

Workflow#

Here’s the general workflow. We’ll see a few different ways to achieve it below.

Load annotations in your format
Convert those to crowsetta.Annotation instances
Make a crowsetta.formats.seq.GenericSeq from those Annotations.
Save to a csv file using the crowsetta.formats.seq.generic.GenericSeq.to_file method

This works because crowsetta represents a set of annotations in generic-seq format as a list of crowsetta.Annotation instances where each Annotation has a crowsetta.Sequence. Since all sequence-like formats have a to_annot method, they can all be converted to 'generic-seq'. In turn, this means that any sequence-like format can be converted to a flat .csv file, by creating a 'generic-seq' instance with the Annotations produced by calling to_annot and then calling the to_file method of the 'generic-seq' instance.

Converting a sequence-like format with a single annotation file per annotated file#

The first example we show is for possibly the most common case, where each annotated file has a single annotation file. This is likely to be the case if you are using apps like Praat or Audacity. An example of such a format is the Audacity standard label track format, exported to .txt files, that you would get if you were to annotate with
region labels. This format is represented by the crowsetta.formats.seq.AudSeq class in crowsetta.

As described above, all you need to do is load your sequence-like annotations with crowsetta, and then call the to_annot method to convert them to a crowsetta.Annotation instance. When working with a format where there’s one annotation file per annotated file, this does mean you need to load each file and convert it into a separate annotation instance. (Below we’ll see an example of a format where annotations for multiple files are contained in a single annotation file, and so we only need to call to_annot once after loading it to get a list of crowsetta.Annotations.) For this first example, where we have multiple annotation files, we use a loop to load each one and convert it to a crowsetta.Annotation instance.

We use the same dataset we used in the Tutorial for this example, “Labeled songs of domestic canary M1-2016-spring (Serinus canaria)” by Giraudon et al., 2021, annotated with Audacity Labeltrack files.

cd ..

/home/docs/checkouts/readthedocs.org/user_builds/crowsetta/checkouts/stable/doc

/home/docs/checkouts/readthedocs.org/user_builds/crowsetta/envs/stable/lib/python3.10/site-packages/IPython/core/magics/osm.py:417: UserWarning: using dhist requires you to install the `pickleshare` library.
  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]

First we download and extract the dataset, if we haven’t already.

!curl --no-progress-meter -L 'https://zenodo.org/record/6521932/files/M1-2016-spring_audacity_annotations.zip?download=1' -o './data/M1-2016-spring_audacity_annotations.zip'

import shutil
shutil.unpack_archive('./data/M1-2016-spring_audacity_annotations.zip', './data/giraudon-et-al-2021')

Now we load the annotation files.

import pathlib
import crowsetta

audseq_paths = sorted(pathlib.Path('./data/giraudon-et-al-2021/audacity-annotations').glob('*.txt'))
# we make the list of ``Annotation``s "by hand" instead of getting it from a `to_annot` call
annots = []
for audseq_path in audseq_paths:
    annots.append(
        crowsetta.formats.seq.AudSeq.from_file(audseq_path).to_annot()
    )

print(
    f"Number of annotation instances from dataset: {len(annots)}"
) 

Number of annotation instances from dataset: 459

We create a set of annotations in the generic sequence format, by making an instance of the GenericSeq class, passing in our list of crowsetta.Annotation instances.

# pass in annots when creating generic-seq instance
generic = crowsetta.formats.seq.GenericSeq(annots=annots)
print("Created 'generic-seq' from annotations")
df = generic.to_df()
print("First five rows of annotations (converted to pandas.DataFrame)")
df.head()

Created 'generic-seq' from annotations

First five rows of annotations (converted to pandas.DataFrame)

	label	onset_s	offset_s	notated_path	annot_path
0	SIL	0.000	0.350	None	data/giraudon-et-al-2021/audacity-annotations/...
1	call	0.350	0.664	None	data/giraudon-et-al-2021/audacity-annotations/...
2	SIL	0.664	1.359	None	data/giraudon-et-al-2021/audacity-annotations/...
3	Z	1.359	2.412	None	data/giraudon-et-al-2021/audacity-annotations/...
4	SIL	2.412	2.488	None	data/giraudon-et-al-2021/audacity-annotations/...

print("Last five rows of annotations (converted to pandas.DataFrame)")
df.tail()

Last five rows of annotations (converted to pandas.DataFrame)

	label	onset_s	offset_s	notated_path	annot_path	annotation
23303	SIL	23.283	23.379	None	data/giraudon-et-al-2021/audacity-annotations/...	458
23304	K	23.379	23.616	None	data/giraudon-et-al-2021/audacity-annotations/...	458
23305	V	23.616	25.127	None	data/giraudon-et-al-2021/audacity-annotations/...	458
23306	SIL	25.127	25.161	None	data/giraudon-et-al-2021/audacity-annotations/...	458
23307	O	25.161	25.402	None	data/giraudon-et-al-2021/audacity-annotations/...	458

Converting a sequence-like format with multiple annotations per file#

Some formats contain multiple annotations per file, and the to_annot method of the corresponding class will return multiple crowsetta.Annotation instances. To convert this format to 'generic-seq', just pass in those Annotations when creating an instance of 'generic-seq'. We demonstrate that here with the format of the Birdsong-Recognition dataset, using sample data built into the crowsetta package.

import crowsetta

example = crowsetta.data.get('birdsong-recognition-dataset')
birdsongrec = crowsetta.formats.seq.BirdsongRec.from_file(example.annot_path)
# we pass a fake samplerate to suppress a warning about not finding .wav files
annots = birdsongrec.to_annot(samplerate=32000)
print(
    f"Number of annotation instances in example 'birdsong-recognition-dataset' file: {len(annots)}"
) 

# pass in annots when creating generic-seq instance
generic = crowsetta.formats.seq.GenericSeq(annots=annots)
print("Created 'generic-seq' from annotations")
df = generic.to_df()
print("First five rows of annotations (converted to pandas.DataFrame)")
df.head()

Number of annotation instances in example 'birdsong-recognition-dataset' file: 135
Created 'generic-seq' from annotations

First five rows of annotations (converted to pandas.DataFrame)

	label	onset_s	offset_s	onset_sample	offset_sample	notated_path	annot_path
0	0	1.070	1.154	34240	36928	/home/docs/.local/share/crowsetta/5.0.2.post1/...	/home/docs/.local/share/crowsetta/5.0.2.post1/...
1	0	1.258	1.345	40256	43040	/home/docs/.local/share/crowsetta/5.0.2.post1/...	/home/docs/.local/share/crowsetta/5.0.2.post1/...
2	0	1.467	1.555	46944	49760	/home/docs/.local/share/crowsetta/5.0.2.post1/...	/home/docs/.local/share/crowsetta/5.0.2.post1/...
3	1	1.659	1.732	53088	55424	/home/docs/.local/share/crowsetta/5.0.2.post1/...	/home/docs/.local/share/crowsetta/5.0.2.post1/...
4	0	1.814	1.878	58048	60096	/home/docs/.local/share/crowsetta/5.0.2.post1/...	/home/docs/.local/share/crowsetta/5.0.2.post1/...

print("Last five rows of annotations (converted to pandas.DataFrame)")
df.tail()

Last five rows of annotations (converted to pandas.DataFrame)

	label	onset_s	offset_s	onset_sample	offset_sample	notated_path	annot_path	annotation
7647	6	11.844	11.952	378992	382448	/home/docs/.local/share/crowsetta/5.0.2.post1/...	/home/docs/.local/share/crowsetta/5.0.2.post1/...	134
7648	2	12.006	12.092	384176	386928	/home/docs/.local/share/crowsetta/5.0.2.post1/...	/home/docs/.local/share/crowsetta/5.0.2.post1/...	134
7649	0	12.148	12.236	388752	391536	/home/docs/.local/share/crowsetta/5.0.2.post1/...	/home/docs/.local/share/crowsetta/5.0.2.post1/...	134
7650	7	12.266	12.316	392528	394128	/home/docs/.local/share/crowsetta/5.0.2.post1/...	/home/docs/.local/share/crowsetta/5.0.2.post1/...	134
7651	8	12.594	12.646	402992	404688	/home/docs/.local/share/crowsetta/5.0.2.post1/...	/home/docs/.local/share/crowsetta/5.0.2.post1/...	134

To save these as a csv file, you can either call the pandas.DataFrame.to_csv method directly, or you can equivalently call the GenericSeq method to_csv.

df.to_csv('./data/birdsong-rec-pandas.csv', index=False)

generic.to_file('./data/birdsong-rec-generic-seq.csv')

Now you have seen two different ways to create a GenericSeq instance from a set of annotations, and then save them to a csv file so anyone can work with them!

How to convert any sequence-like format to 'generic-seq'

Contents

How to convert any sequence-like format to `'generic-seq'`#

Workflow#

Converting a sequence-like format with a single annotation file per annotated file#

Converting a sequence-like format with multiple annotations per file#

How to convert any sequence-like format to 'generic-seq'

Contents

How to convert any sequence-like format to 'generic-seq'#

Workflow#

Converting a sequence-like format with a single annotation file per annotated file#

Converting a sequence-like format with multiple annotations per file#

How to convert any sequence-like format to `'generic-seq'`#