How to convert any sequence-like format to 'generic-seq'#
A goal of crowsetta is to make it easier to share annotations
for a dataset of animal vocalizations or other bioacoustics data.
One way to achieve this is to
convert the annotations to a single flat csv file,
which is easy to share and work with,
e.g., using the pandas library.
For sequence-like annotations,
this can be done by converting them to the 'generic-seq' format.
This how-to walks you through converting
annotations to the 'generic-seq' format and
then saving those annotations as a csv file.
As suggested by its name,
it is meant to be a generic sequence-like format
that all other sequence-like formats can be converted to.
Workflow#
Here’s the general workflow. We’ll see a few different ways to achieve it below.
Load annotations in your format
Convert those to
crowsetta.AnnotationinstancesMake a
crowsetta.formats.seq.GenericSeqfrom thoseAnnotations.Save to a csv file using the
crowsetta.formats.seq.generic.GenericSeq.to_filemethod
This works because crowsetta represents a set of annotations in generic-seq format
as a list of crowsetta.Annotation instances
where each Annotation has a crowsetta.Sequence.
Since all sequence-like formats have a to_annot
method, they can all be converted to 'generic-seq'.
In turn, this means that any sequence-like format
can be converted to a flat .csv file,
by creating a 'generic-seq' instance with the
Annotations produced by calling to_annot
and then calling the to_file method of
the 'generic-seq' instance.
Converting a sequence-like format with a single annotation file per annotated file#
The first example we show is for possibly the most common case,
where each annotated file has a single annotation file.
This is likely to be the case if you are using apps like Praat or Audacity.
An example of such a format is the Audacity
standard label track format,
exported to .txt files, that you would get if you were to annotate with
region labels.
This format is represented by the
crowsetta.formats.seq.AudSeq
class in crowsetta.
As described above,
all you need to do is load your sequence-like annotations
with crowsetta,
and then call the to_annot method
to convert them to a crowsetta.Annotation instance.
When working with a format
where there’s one annotation file per annotated file,
this does mean you need to load each file
and convert it into a separate annotation instance.
(Below we’ll see an example of a format
where annotations for multiple files
are contained in a single annotation file,
and so we only need to call to_annot once
after loading it to get a list of
crowsetta.Annotations.)
For this first example,
where we have multiple annotation files,
we use a loop to load each one and convert it to a
crowsetta.Annotation instance.
We use the same dataset we used in the Tutorial for this example, “Labeled songs of domestic canary M1-2016-spring (Serinus canaria)” by Giraudon et al., 2021, annotated with Audacity Labeltrack files.
cd ..
/home/docs/checkouts/readthedocs.org/user_builds/crowsetta/checkouts/stable/doc
First we download and extract the dataset, if we haven’t already.
!curl --no-progress-meter -L 'https://zenodo.org/record/6521932/files/M1-2016-spring_audacity_annotations.zip?download=1' -o './data/M1-2016-spring_audacity_annotations.zip'
import shutil
shutil.unpack_archive('./data/M1-2016-spring_audacity_annotations.zip', './data/giraudon-et-al-2021')
Now we load the annotation files.
import pathlib
import crowsetta
audseq_paths = sorted(pathlib.Path('./data/giraudon-et-al-2021/audacity-annotations').glob('*.txt'))
# we make the list of ``Annotation``s "by hand" instead of getting it from a `to_annot` call
annots = []
for audseq_path in audseq_paths:
annots.append(
crowsetta.formats.seq.AudSeq.from_file(audseq_path).to_annot()
)
print(
f"Number of annotation instances from dataset: {len(annots)}"
)
Number of annotation instances from dataset: 459
We create a set of annotations in the generic sequence format, by making an instance of the GenericSeq class, passing in our list of crowsetta.Annotation instances.
# pass in annots when creating generic-seq instance
generic = crowsetta.formats.seq.GenericSeq(annots=annots)
print("Created 'generic-seq' from annotations")
df = generic.to_df()
print("First five rows of annotations (converted to pandas.DataFrame)")
df.head()
Created 'generic-seq' from annotations
First five rows of annotations (converted to pandas.DataFrame)
| label | onset_s | offset_s | notated_path | annot_path | sequence | annotation | |
|---|---|---|---|---|---|---|---|
| 0 | SIL | 0.000 | 0.350 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 0 |
| 1 | call | 0.350 | 0.664 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 0 |
| 2 | SIL | 0.664 | 1.359 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 0 |
| 3 | Z | 1.359 | 2.412 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 0 |
| 4 | SIL | 2.412 | 2.488 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 0 |
print("Last five rows of annotations (converted to pandas.DataFrame)")
df.tail()
Last five rows of annotations (converted to pandas.DataFrame)
| label | onset_s | offset_s | notated_path | annot_path | sequence | annotation | |
|---|---|---|---|---|---|---|---|
| 23303 | SIL | 23.283 | 23.379 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 458 |
| 23304 | K | 23.379 | 23.616 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 458 |
| 23305 | V | 23.616 | 25.127 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 458 |
| 23306 | SIL | 25.127 | 25.161 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 458 |
| 23307 | O | 25.161 | 25.402 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 458 |
Converting a sequence-like format with multiple annotations per file#
Some formats contain multiple annotations per file,
and the to_annot method of the corresponding class
will return multiple crowsetta.Annotation instances.
To convert this format to 'generic-seq',
just pass in those Annotations when
creating an instance of 'generic-seq'.
We demonstrate that here with the format of the Birdsong-Recognition dataset,
using sample data built into the crowsetta package.
import crowsetta
example = crowsetta.data.get('birdsong-recognition-dataset')
birdsongrec = crowsetta.formats.seq.BirdsongRec.from_file(example.annot_path)
# we pass a fake samplerate to suppress a warning about not finding .wav files
annots = birdsongrec.to_annot(samplerate=32000)
print(
f"Number of annotation instances in example 'birdsong-recognition-dataset' file: {len(annots)}"
)
# pass in annots when creating generic-seq instance
generic = crowsetta.formats.seq.GenericSeq(annots=annots)
print("Created 'generic-seq' from annotations")
df = generic.to_df()
print("First five rows of annotations (converted to pandas.DataFrame)")
df.head()
print("Last five rows of annotations (converted to pandas.DataFrame)")
df.tail()
To save these as a csv file, you can either call the pandas.DataFrame.to_csv method directly, or you can equivalently call the GenericSeq method to_csv.
df.to_csv('./data/birdsong-rec-pandas.csv', index=False)
generic.to_file('./data/birdsong-rec-generic-seq.csv')
Now you have seen two different ways to create a GenericSeq instance from a set of annotations, and then save them to a csv file so anyone can work with them!