How to convert any sequence-like format to 'generic-seq'
#
A goal of crowsetta is to make it easier to share annotations
for a dataset of animal vocalizations or other bioacoustics data.
One way to achieve this is to
convert the annotations to a single flat csv file,
which is easy to share and work with,
e.g., using the pandas library.
For sequence-like annotations,
this can be done by converting them to the 'generic-seq'
format.
This how-to walks you through converting
annotations to the 'generic-seq'
format and
then saving those annotations as a csv file.
As suggested by its name,
it is meant to be a generic sequence-like format
that all other sequence-like formats can be converted to.
Workflow#
Here’s the general workflow. We’ll see a few different ways to achieve it below.
Load annotations in your format
Convert those to
crowsetta.Annotation
instancesMake a
crowsetta.formats.seq.GenericSeq
from thoseAnnotation
s.Save to a csv file using the
crowsetta.formats.seq.generic.GenericSeq.to_file
method
This works because crowsetta
represents a set of annotations in generic-seq
format
as a list of crowsetta.Annotation
instances
where each Annotation
has a crowsetta.Sequence
.
Since all sequence-like formats have a to_annot
method, they can all be converted to 'generic-seq'
.
In turn, this means that any sequence-like format
can be converted to a flat .csv file,
by creating a 'generic-seq'
instance with the
Annotations
produced by calling to_annot
and then calling the to_file
method of
the 'generic-seq'
instance.
Converting a sequence-like format with a single annotation file per annotated file#
The first example we show is for possibly the most common case,
where each annotated file has a single annotation file.
This is likely to be the case if you are using apps like Praat or Audacity.
An example of such a format is the Audacity
standard label track format,
exported to .txt files, that you would get if you were to annotate with
region labels.
This format is represented by the
crowsetta.formats.seq.AudSeq
class in crowsetta.
As described above,
all you need to do is load your sequence-like annotations
with crowsetta,
and then call the to_annot
method
to convert them to a crowsetta.Annotation
instance.
When working with a format
where there’s one annotation file per annotated file,
this does mean you need to load each file
and convert it into a separate annotation instance.
(Below we’ll see an example of a format
where annotations for multiple files
are contained in a single annotation file,
and so we only need to call to_annot
once
after loading it to get a list of
crowsetta.Annotation
s.)
For this first example,
where we have multiple annotation files,
we use a loop to load each one and convert it to a
crowsetta.Annotation
instance.
We use the same dataset we used in the Tutorial for this example, “Labeled songs of domestic canary M1-2016-spring (Serinus canaria)” by Giraudon et al., 2021, annotated with Audacity Labeltrack files.
cd ..
/home/docs/checkouts/readthedocs.org/user_builds/crowsetta/checkouts/stable/doc
/home/docs/checkouts/readthedocs.org/user_builds/crowsetta/envs/stable/lib/python3.10/site-packages/IPython/core/magics/osm.py:417: UserWarning: using dhist requires you to install the `pickleshare` library.
self.shell.db['dhist'] = compress_dhist(dhist)[-100:]
First we download and extract the dataset, if we haven’t already.
!curl --no-progress-meter -L 'https://zenodo.org/record/6521932/files/M1-2016-spring_audacity_annotations.zip?download=1' -o './data/M1-2016-spring_audacity_annotations.zip'
import shutil
shutil.unpack_archive('./data/M1-2016-spring_audacity_annotations.zip', './data/giraudon-et-al-2021')
Now we load the annotation files.
import pathlib
import crowsetta
audseq_paths = sorted(pathlib.Path('./data/giraudon-et-al-2021/audacity-annotations').glob('*.txt'))
# we make the list of ``Annotation``s "by hand" instead of getting it from a `to_annot` call
annots = []
for audseq_path in audseq_paths:
annots.append(
crowsetta.formats.seq.AudSeq.from_file(audseq_path).to_annot()
)
print(
f"Number of annotation instances from dataset: {len(annots)}"
)
Number of annotation instances from dataset: 459
We create a set of annotations in the generic sequence format, by making an instance of the GenericSeq
class, passing in our list of crowsetta.Annotation
instances.
# pass in annots when creating generic-seq instance
generic = crowsetta.formats.seq.GenericSeq(annots=annots)
print("Created 'generic-seq' from annotations")
df = generic.to_df()
print("First five rows of annotations (converted to pandas.DataFrame)")
df.head()
Created 'generic-seq' from annotations
First five rows of annotations (converted to pandas.DataFrame)
label | onset_s | offset_s | notated_path | annot_path | sequence | annotation | |
---|---|---|---|---|---|---|---|
0 | SIL | 0.000 | 0.350 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 0 |
1 | call | 0.350 | 0.664 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 0 |
2 | SIL | 0.664 | 1.359 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 0 |
3 | Z | 1.359 | 2.412 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 0 |
4 | SIL | 2.412 | 2.488 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 0 |
print("Last five rows of annotations (converted to pandas.DataFrame)")
df.tail()
Last five rows of annotations (converted to pandas.DataFrame)
label | onset_s | offset_s | notated_path | annot_path | sequence | annotation | |
---|---|---|---|---|---|---|---|
23303 | SIL | 23.283 | 23.379 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 458 |
23304 | K | 23.379 | 23.616 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 458 |
23305 | V | 23.616 | 25.127 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 458 |
23306 | SIL | 25.127 | 25.161 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 458 |
23307 | O | 25.161 | 25.402 | None | data/giraudon-et-al-2021/audacity-annotations/... | 0 | 458 |
Converting a sequence-like format with multiple annotations per file#
Some formats contain multiple annotations per file,
and the to_annot
method of the corresponding class
will return multiple crowsetta.Annotation
instances.
To convert this format to 'generic-seq'
,
just pass in those Annotation
s when
creating an instance of 'generic-seq'
.
We demonstrate that here with the format of the Birdsong-Recognition dataset,
using sample data built into the crowsetta
package.
Show code cell content
import crowsetta
crowsetta.data.extract_data_files()
import crowsetta
example = crowsetta.data.get('birdsong-recognition-dataset')
birdsongrec = crowsetta.formats.seq.BirdsongRec.from_file(example.annot_path)
# we pass a fake samplerate to suppress a warning about not finding .wav files
annots = birdsongrec.to_annot(samplerate=32000)
print(
f"Number of annotation instances in example 'birdsong-recognition-dataset' file: {len(annots)}"
)
# pass in annots when creating generic-seq instance
generic = crowsetta.formats.seq.GenericSeq(annots=annots)
print("Created 'generic-seq' from annotations")
df = generic.to_df()
print("First five rows of annotations (converted to pandas.DataFrame)")
df.head()
Number of annotation instances in example 'birdsong-recognition-dataset' file: 135
Created 'generic-seq' from annotations
First five rows of annotations (converted to pandas.DataFrame)
label | onset_s | offset_s | onset_sample | offset_sample | notated_path | annot_path | sequence | annotation | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1.070 | 1.154 | 34240 | 36928 | /home/docs/.local/share/crowsetta/5.0.2.post1/... | /home/docs/.local/share/crowsetta/5.0.2.post1/... | 0 | 0 |
1 | 0 | 1.258 | 1.345 | 40256 | 43040 | /home/docs/.local/share/crowsetta/5.0.2.post1/... | /home/docs/.local/share/crowsetta/5.0.2.post1/... | 0 | 0 |
2 | 0 | 1.467 | 1.555 | 46944 | 49760 | /home/docs/.local/share/crowsetta/5.0.2.post1/... | /home/docs/.local/share/crowsetta/5.0.2.post1/... | 0 | 0 |
3 | 1 | 1.659 | 1.732 | 53088 | 55424 | /home/docs/.local/share/crowsetta/5.0.2.post1/... | /home/docs/.local/share/crowsetta/5.0.2.post1/... | 0 | 0 |
4 | 0 | 1.814 | 1.878 | 58048 | 60096 | /home/docs/.local/share/crowsetta/5.0.2.post1/... | /home/docs/.local/share/crowsetta/5.0.2.post1/... | 0 | 0 |
print("Last five rows of annotations (converted to pandas.DataFrame)")
df.tail()
Last five rows of annotations (converted to pandas.DataFrame)
label | onset_s | offset_s | onset_sample | offset_sample | notated_path | annot_path | sequence | annotation | |
---|---|---|---|---|---|---|---|---|---|
7647 | 6 | 11.844 | 11.952 | 378992 | 382448 | /home/docs/.local/share/crowsetta/5.0.2.post1/... | /home/docs/.local/share/crowsetta/5.0.2.post1/... | 0 | 134 |
7648 | 2 | 12.006 | 12.092 | 384176 | 386928 | /home/docs/.local/share/crowsetta/5.0.2.post1/... | /home/docs/.local/share/crowsetta/5.0.2.post1/... | 0 | 134 |
7649 | 0 | 12.148 | 12.236 | 388752 | 391536 | /home/docs/.local/share/crowsetta/5.0.2.post1/... | /home/docs/.local/share/crowsetta/5.0.2.post1/... | 0 | 134 |
7650 | 7 | 12.266 | 12.316 | 392528 | 394128 | /home/docs/.local/share/crowsetta/5.0.2.post1/... | /home/docs/.local/share/crowsetta/5.0.2.post1/... | 0 | 134 |
7651 | 8 | 12.594 | 12.646 | 402992 | 404688 | /home/docs/.local/share/crowsetta/5.0.2.post1/... | /home/docs/.local/share/crowsetta/5.0.2.post1/... | 0 | 134 |
To save these as a csv file, you can either call the pandas.DataFrame.to_csv
method directly, or you can equivalently call the GenericSeq
method to_csv
.
df.to_csv('./data/birdsong-rec-pandas.csv', index=False)
generic.to_file('./data/birdsong-rec-generic-seq.csv')
Now you have seen two different ways to create a GenericSeq
instance from a set of annotations, and then save them to a csv file so anyone can work with them!