How to use crowsetta with your own annotation format#
This section shows you how to use crowsetta for working with your own annotation format for vocalizations (or some other format not currently built into the library).
Steps to using crowsetta with your own annotation format#
Below we’ll walk through a case study for using crowsetta with your annotation format. Here’s an outline of the steps we’ll go through:
get your annotations into some variables in Python (maybe you already wrote code to do this)
write a class that represents your format
register your new format by calling
crowsetta.register_format
use your format with the
crowsetta.Transcriber
class to work with your annotations
If writing a class to represent your format sounds difficult, don’t worry. We show you how and crowsetta gives you tools that make it easy. If you’ve written some Python code to work with your annotations before then you’re probably already halfway done.
In this walkthrough we show you how to write a sequence-like format. By sequence-like, we mean a format that can be represented as a sequence of segments, where each segment has a start time, stop time, and label. Many common annotation formats for animal vocalizations are sequence-like.
Note
Sequence-like formats are one of the two types in crowsetta.
The other is bounding-box formats,
which annotate sound events with boxes, as the name implies.
The steps for working with your own bounding box-like format
would be exactly the same,
except below where we talk about crowsetta.Sequence
s,
you should replace that with crowsetta.Bbox
es.
Case Study: the BatLAB
format#
Let’s say you work in the Schumacher lab, studying bat vocalizations.
The lab research specialist, Alfred, has spent years writing an
application in Labview to capture bat calls, called SoNAR
(“Sound
and Neural Activity Recorder”). Alfred has also written a GUI in MATLAB
called BatLAB
that lets you interactively annotate audio files
containing the bats’ calls, and saves the annotations in .mat
(MATLAB data) files.
You’ve started to work with Python to analyze your data, because you
like the data science and machine learning libraries. However, you find
yourself writing the same code over and over again to unpack the
annotations from the .mat
files made by BatLAB
. Every time you
use the code for a new analysis, you have to modify it slightly. The
code has some weird, hard-to-read lines to deal with the complicated
MATLAB struct
s created by BatLAB
and how they load into
Python. The code also has several repetitive steps to deal with the
idiosyncracies of how SoNAR
and BatLAB
save data: unit
conversion, data types, etcetera. You can’t change BatLAB
or
SoNAR
though, because that’s Alfred’s job, and everyone else’s code
that was written ten years ago (and still works!) expects those
idiosyncracies.
You know that it’s a good idea to turn the code you wrote into a
function (because you took part in a Software
Carpentry workshop and then you
read this
paper.)
You figured out which bits of the code will be common to all your
projects and you make that into a function, called parse_batlab_mat
.
At first you just copy and paste it into all your projects. Then you
decide you also want to save everyone else in your lab the effort of
writing the same code, so you put the script on your lab’s Github page.
This is a step in the right direction, although parse_batlab_mat
gives you back a Python list
of dict
s, and you end up typing a
lot of things like:
labels = annot_list[0]['seg_type']
onsets = annot_list[0]['seg_onsets']
offsets = annot_list[0]['seg_offsets']
Typing all those very similar ['keys']
in particular gets kind of
annoying and makes you wonder if you should spend your vacation learning
how to use one of those hacker text editors like vim
.
But before you can worry about that, you get back reviews of your paper
in PLOS Comp. Bio. called “Pidgeon Bat: Emergence of Dialects in
Colonies of Multiple Bat Species”. Reviewer #3 doesn’t buy your
conclusions (and you are pretty sure from the way they write that it is
Oswald Cobblepot, professor emeritus of ethology at Metropolitan
University of Fruitville, Florida, and author of the seminal review from
1982, “Bat Calls: A Completely Innate Behavior Encoded Genetically”).
You want to share your data with the world, mainly to mollify reviewer
#3. The problem is that this reviewer (if he is who you think he is)
only knows how to write Fortran code and is definitely not going to
figure out how to copy and use your function parse_batlab_mat
so he
can run your analysis scripts and reproduce your figures for himself.
What you really want is to share your data and write your code in a way
that doesn’t depend on anyone knowing anything about BatLAB
orSoNAR
and how those programs save data and annotations. This is
where crowsetta comes to your rescue.
Okay, now that we’ve set up some background for our case study, let’s go through the steps we outlined above.
1. get your annotation into some variables in Python#
Let’s look at this complicated data structure that we have our
annotation in. For this tutorial you’ll need the
file bat1_annotation.mat
that you should be able to download
from this link
or by going to
vocalpy/crowsetta.
The BatLAB
GUI saves annotation into these annotation.mat
files, with two variables in each mat file:
filenames
: a vector where each element is the name of an audio fileannotations
: astruct
that has a record for each element infilenames
, and that record is the annotation corresponding to the audio file with the same index infilenames
The following snippet will let you load and inspect the data:
from scipy.io import loadmat
bat1_annotation = loadmat('bat1_annotation.mat')
print('Variables in .mat file:',
[var for var in list(bat1_annotation.keys())
if not var.startswith('__')],'\n'
)
print(f"First 3 filenames:\n{bat1_annotation['filenames'][:,:3]}\n")
print(f"First annotation:\n{bat1_annotation['annotations'][:,0]}")
Variables in .mat file: ['filenames', 'annotations']
First 3 filenames:
[[array(['lbr3009_0005_2017_04_27_06_14_46.wav'], dtype='<U36')
array(['lbr3009_0006_2017_04_27_06_14_57.wav'], dtype='<U36')
array(['lbr3009_0007_2017_04_27_06_15_07.wav'], dtype='<U36')]]
First annotation:
[array([[(array([[0.00297619, 0.279125 , 0.55564729, 0.62654167, 0.68429167,
0.73929167, 0.79429167, 0.85020833, 0.906125 , 0.96479167,
1.02345833, 1.07754167, 1.128875 , 1.19579167, 1.25354167]]), array([[0.14150433, 0.504625 , 0.59629167, 0.64945833, 0.70445833,
0.75945833, 0.83004167, 0.884125 , 0.94095833, 1.013375 ,
1.06654167, 1.11156764, 1.17654167, 1.23154167, 1.29020833]]), array([[1, 1, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]], dtype=uint8), array([[48000]]))]],
dtype=[('segFileStartTimes', 'O'), ('segFileEndTimes', 'O'), ('segType', 'O'), ('fs', 'O')]) ]
Below is the code you wrote to unpack the .mat
files. Like we said
above, the code has some weird, hard-to-read lines to deal with the way
that the complicated MATLAB struct
s created by BatLAB
load
into Python, such as calling tolist()
just to unpack an array, and
some logic to make sure the labels get loaded correctly into a numpy
array. And the code has several repetitive steps to deal with the
idiosyncracies of SoNAR
and BatLAB
, like converting the start
and stop times of the calls from seconds back to Hertz so you can find
those times in the raw audio files.
# %load parsebat.py
import numpy as np
from scipy.io import loadmat
def parse_batlab_mat(mat_file):
"""parse batlab annotation.mat file"""
mat = loadmat(mat_file, squeeze_me=True)
annot_list = []
# annotation structure loads as a Python dictionary with two keys
# one maps to a list of filenames,
# and the other to a Numpy array where each element is the annotation
# coresponding to the filename at the same index in the list.
# We can iterate over both by using the zip() function.
for filename, annotation in zip(mat['filenames'], mat['annotations']):
# below, .tolist() does not actually create a list,
# instead gets ndarray out of a zero-length ndarray of dtype=object.
# This is just weirdness that results from loading complicated data
# structure in .mat file.
seg_start_times = annotation['segFileStartTimes'].tolist()
seg_end_times = annotation['segFileEndTimes'].tolist()
seg_types = annotation['segType'].tolist()
if type(seg_types) == int:
# this happens when there's only one syllable in the file
# with only one corresponding label
seg_types = np.asarray([seg_types]) # so make it a one-element list
elif type(seg_types) == np.ndarray:
# this should happen whenever there's more than one label
pass
else:
# something unexpected happened
raise ValueError("Unable to load labels from {}, because "
"the segType parsed as type {} which is "
"not recognized.".format(filename,
type(seg_types)))
samp_freq = annotation['fs'].tolist()
seg_start_times_Hz = np.round(seg_start_times * samp_freq).astype(int)
seg_end_times_Hz = np.round(seg_end_times * samp_freq).astype(int)
annot_dict = {
'audio_file': filename,
'seg_types': seg_types,
'seg_start_times': seg_start_times,
'seg_end_times': seg_end_times,
'seg_start_times_Hz': seg_start_times_Hz,
'seg_end_times_Hz': seg_end_times_Hz,
}
annot_list.append(annot_dict)
return annot_list
When it runs on a file, you end up with an annot_list
where each
item in the list is an annot_dict
that contains the annotations for
a file, like this:
annot_dict = {
'seg_types': array([1, 1, 5, 2, ...]),
'seq_start_times': array([0.00297619, 0.279125, 0.55564729,... ]),
... # end times, start and end times in Hertz
}
Again, as we said above, you turned your code into a function to make it easier to use across projects:
import numpy as np
from scipy.io import loadmat
def parse_batlab_mat(mat_file):
"""parse batlab annotation.mat file"""
# code from above
return annot_list
As we’ll see in a moment, all you need to do is take this code you
already wrote, and instead of returning your list
of dict
s,
you return a list of Sequence
s.
2. write a class that represents your format#
Since we are dealing with a sequence-like format,
what we ultimately want to do is convert the format into a crowsetta.Sequence
,
so that it acts like all the other sequence-like formats.
So before we even worry about writing a class,
let’s see how to make some Sequence
s.
Use one of the Sequence
“factory functions” to conveniently turn annotations in your format into Sequence
s#
First, to get the Sequence
, we’ll use a “factory function”.
That means it’s a function built into the Sequence
class that gives
us back a new instance of a Sequence
. One such factory function is
Sequence.from_keyword
. Here’s an example of using it with our parsebat
code from above:
from parsebat import parse_batlab_mat
from crowsetta.sequence import Sequence
# you, using the function you already wrote
annot_list = parse_batlab_mat(mat_file='bat1_annotation.mat')
# you have annotation from one file in an "annot_dict"
annot_dict = annot_list[0]
a_sequence = Sequence.from_keyword(labels=annot_dict['seg_types'],
onsets_s=annot_dict['seg_start_times'],
offsets_s=annot_dict['seg_end_times'],
onset_samples=annot_dict['seg_start_times_Hz'],
offset_samples=annot_dict['seg_end_times_Hz'])
print("a_sequence:\n", a_sequence)
a_sequence:
<Sequence with 15 segments>
Okay now that we saw how to make a crowsetta.Sequence
from our annotations
loaded into Python and numpy
data types, let’s actually start writing our class.
Start to write the class#
Note
If the idea of writing a class is completely new to you, we suggest reading up on that first.
A very good place to start would be Think Python by Alan Downey, in particular the chapter on Classes and objects
The first thing we want to do is just sketch out a class
that will represent the annotations loaded into good old-fashioned Python data types.
That way when we make a new instance of this class,
it will contain the annotations we loaded from a single file.
To make a class that holds our data, we use
the attrs
library.
Then we later add a couple methods to this class that do things
like turn the annotations into crowsetta.Sequence
s and crowsetta.Annotation
s.
To start writing the class, copy one of the existing classes in crowsetta, by looking at its code.
Here’s a stub of a Batlab
class that we wrote by
copying the 'yarden'
format and changing a couple things – we’ll explain what we changed below.
# %load -r 1-10,14-48 batlab.py
import pathlib
from typing import ClassVar
import attr
import numpy as np
import scipy.io
from crowsetta import Sequence, Annotation
from crowsetta.typing import PathLike
import crowsetta
@crowsetta.interface.SeqLike.register
@attr.define
class Batlab:
"""Example custom annotation format"""
name: ClassVar[str] = 'batlab'
ext: ClassVar[str] = '.mat'
annotations: np.ndarray = attr.field(eq=attr.cmp_using(eq=np.array_equal))
audio_paths: np.ndarray = attr.field(eq=attr.cmp_using(eq=np.array_equal))
annot_path: pathlib.Path = attr.field(converter=pathlib.Path)
@classmethod
def from_file(cls,
annot_path: PathLike):
"""load BatLAB annotations from .mat file
Parameters
----------
mat_path : str, pathlib.Path
"""
annot_path = pathlib.Path(annot_path)
crowsetta.validation.validate_ext(annot_path, extension=cls.ext)
annot_mat = scipy.io.loadmat(annot_path, squeeze_me=True)
audio_paths = annot_mat['filenames']
annotations = annot_mat['annotations']
if len(audio_paths) != len(annotations):
raise ValueError(
f'list of filenames and list of annotations in {mat_path} do not have the same length'
)
return cls(annotations=annotations,
audio_paths=audio_paths,
annot_path=annot_path)
This might seem overwhelming at first, but we only changed a few things.
To tell crowsetta how to handle our format,
we need to change exactly two of the class attributes:
(1) the name
and (2) the ext
.
The name
attribute is the shorthand string name that we use to refer to our format,
for example when we call crowsetta.format.by_name
or we make a new Transcriber
,
passing in this name
as the format
argument (like so: scribe = crowsetta.Transcriber(format='name')
).
The ext
attribute tells crowsetta what a valid file extension is for this annotation format:
is it a .mat
file or a .csv
file? Can it be both ('txt', 'csv')
?
We then use this attribute in other places in the class,
like when we write a from_file
method, to validate the file name that gets passed into that method.
Re-writing the from_file
method is the other thing we changed.
Notice that this from_file
method is basically
the few lines from our parse_batlab
function,
where we unpack the filenames
and the annotations
from the .mat
file.
We don’t loop through them to put them in a Python dict
though.
Instead we assign them to the class’ attributes audio_paths
and annotations
.
You don’t have to completely understand what’s going on here;
basically we are writing our own “factory function” (like the one we used for Sequence
s above)
that gives us a new instance of our Batlab
class
that will have the specific annotations from a file loaded into it.
(Writing a factory method is consistent with advice
from the attrs
docs).
To achieve this we use the @classmethod
decorator and pass in cls
as the first argument (by convention)
that we can then call to create a new instance.
(To learn more about classmethod
s see https://realpython.com/instance-class-and-static-methods-demystified/).
Now write a to_seq
method that converts the annotations to crowsetta.Sequence
s#
Again, you pretty much already wrote this. Just take your
parse_batlab_mat
function from above and change a couple lines.
First, you’re going to return a list of sequences instead of your
annot_list
from before. You probably want to make that explicit in
your function.
# %load -r 51-96 batlab.py
def to_seq(self):
"""unpack BatLAB annotation into list of Sequence objects
example of a function that unpacks annotation from
a complicated data structure and returns the necessary
data as a Sequence object
Returns
-------
seqs : list
of Sequence objects
"""
seqs = []
# annotation structure loads as a Python dictionary with two keys
# one maps to a list of filenames,
# and the other to a Numpy array where each element is the annotation
# coresponding to the filename at the same index in the list.
# We can iterate over both by using the zip() function.
for filename, annotation in zip(self.audio_paths, self.annotations):
# below, .tolist() does not actually create a list,
# instead gets ndarray out of a zero-length ndarray of dtype=object.
# This is just weirdness that results from loading complicated data
# structure in .mat file.
onsets_s = annotation['segFileStartTimes'].tolist()
offsets_s = annotation['segFileEndTimes'].tolist()
labels = annotation['segType'].tolist()
if type(labels) == int:
# this happens when there's only one syllable in the file
# with only one corresponding label
seg_types = np.asarray([seg_types]) # so make it a one-element list
elif type(labels) == np.ndarray:
# this should happen whenever there's more than one label
pass
else:
# something unexpected happened
raise ValueError("Unable to load labels from {}, because "
"the segType parsed as type {} which is "
"not recognized.".format(audio_path,
type(seg_types)))
samp_freq = annotation['fs'].tolist()
seq = Sequence.from_keyword(labels=labels,
onsets_s=onsets_s,
offsets_s=offsets_s)
seqs.append(seq)
return seqs
Then at the end of your main loop, instead of making your
annot_dict
, you’ll make a new Sequence
from each file using the
from_keyword
factory function, append the new Sequence
to your
seq_list
, and then finally return that list
of Sequence
s.
# %load -r 92-95 batlab.py
seq = Sequence.from_keyword(labels=labels,
onsets_s=onsets_s,
offsets_s=offsets_s)
seqs.append(seq)
If this still feels too wordy and repetitive for you, you can put
segFileStartTimes
, segFileEndTimes
, et al., into a Python
dict
with keys
corresponding to the parameters for
Segment.from_keyword
:
annot_dict = {
'file': filename,
'onsets_s': annotation['segFileStartTimes'].tolist(),
'offsets_s': annotation['segFileEndTimes'].tolist()
'labels': seg_types
}
Note here that you only have to specify the onsets an offsets of segments either in seconds or in sample number (but you can specify both if you want).
and then use another factory function, Sequence.from_dict
, to
create the Sequence
.
seq_list.append(Sequence.from_dict(annot_dict))
Now you have a Batlab
class with a to_seq
function,
that takes annotation files and return
Sequence
s.
You want to and put this in a
file that ends with .py
, e.g., batlab.py
(otherwise known as a Python module).
To see the entire example, check out the batlab.py file
(and compare it with parsebat.py).
3. Register your new format by calling crowsetta.register_format
#
To make it so that crowsetta knows about your format,
you call the crowsetta.register_format
function
and pass in the class you have written.
After you do this, you should see that the shorthand string name
that you defined for the class appears in the list
returned when you call crowsetta.formats.as_list()
.
import crowsetta
import batlab
crowsetta.register_format(batlab.Batlab)
crowsetta.formats.as_list()
formats_list = crowsetta.formats.as_list()
assert batlab.Batlab.name in formats_list # no AsssertionError, because `batlab` is in the list
Use crowsetta.register_format
as a decorator#
Instead of calling crowsetta.register_format
“manually”,
you can use it as a decorator,
that causes it to get registered automatically when you import the module.
To use crowsetta.register_format
as a decorator,
we write it with the @
symbol at the top of our class:
@crowsetta.formats.register_format
@crowsetta.interface.SeqLike.register
@attr.define
class Batlab:
"""Example custom annotation format"""
name: ClassVar[str] = 'batlab'
ext: ClassVar[str] = '.mat'
(If you’re unfamiliar with decorators, check out this primer: https://realpython.com/primer-on-python-decorators/)
This works because decorators are executed at import time.
Notice we also used other decorators, another from crowsetta
that registers our class as sequence-like,
and one from attrs
that helps us easily define a class.
You need to make sure that reigster_format
the outermost decorator,
because decorators are executed “inside-out”.
Here’s just the top few lines from an updated version of our class
where we apply crowsetta.register_format
as a decorator.
# %load -r 1-22 batlab.py
import pathlib
from typing import ClassVar
import attr
import numpy as np
import scipy.io
from crowsetta import Sequence, Annotation
from crowsetta.typing import PathLike
import crowsetta
@crowsetta.formats.register_format
@crowsetta.interface.SeqLike.register
@attr.define
class Batlab:
"""Example custom annotation format"""
name: ClassVar[str] = 'batlab'
ext: ClassVar[str] = '.mat'
annotations: np.ndarray = attr.field(eq=attr.cmp_using(eq=np.array_equal))
audio_paths: np.ndarray = attr.field(eq=attr.cmp_using(eq=np.array_equal))
4. Use your format with the crowsetta.Transcriber
class to work with your annotations#
If you have worked with Crowsetta
already, or gone through the
tutorial, you know that we can work with a Transcriber
that does the
work of making Sequence
s of Segment
s from annotation files
for us. We create a new instance of a Transcriber
by writing
something like this:
scribe = crowsetta.Transcriber(format='name')
You will do the same thing here:
scribe = crowsetta.Transcriber(format='batlab')
Here’s what it looks like to do all of that in a few lines of code:
import crowsetta
import batlab # gets registered automatically when we import, because of decorator
scribe = crowsetta.Transcriber(format='batlab')
seq_list = scribe.from_file('bat1_annotation.mat').to_seq()
And now, just like you do with the built-in formats, you get back a list
of Sequence
s from your format:
print(f'First item in seq_list: {seq_list[0]}')
print(f'First segment in first sequence:\n{seq_list[0].segments[0]}')
First item in seq_list: <Sequence with 15 segments>
First segment in first sequence:
Segment(label='1', onset_s=0.0029761904761904934, offset_s=0.14150432900432905, onset_sample=None, offset_sample=None)
Summary#
Now you have seen in detail the process of working with your own
annotation format in Crowsetta
. Here’s a review of the steps, with
some code snippets worked in to tie it all together:
get your annotations into some variables in Python (maybe you already wrote code to do this)
write a class that represents your format
register your new format by calling
crowsetta.register_format
use your format with the
crowsetta.Transcriber
class to work with your annotations
steps 1-3 will give you something like this in a file named something
like myformat.py
import pathlib
from typing import ClassVar
import attr
from crowsetta import Sequence, Annotation
from crowsetta.typing import PathLike
import crowsetta
@crowsetta.formats.register_format
@crowsetta.interface.SeqLike.register
@attr.define
class MyFormat:
"""Example custom annotation format"""
name: ClassVar[str] = 'myformat'
ext: ClassVar[str] = '.csv'
...
@classmethod
def from_file(cls, annot_path):
...
return cls(annotations, annot_path)
...
def to_seq():
seq_list = []
for annotation in self.annotations:
# load annotation into some Python variables, e.g. a dictionary
annot_dict = magic_annotation_unpacking_function(annotation)
seq = Sequence.from_dict(annot_dict)
seq_list.append(seq)
return seq_list
def to_annot():
seqs = self.to_seq()
annots = []
for seq in seqs:
annots.append(Annotation(seq=seq, annot_path=self.annot_path))
return annots
and then as in step 4, you will be able to
make a Transcriber
that knows to use this class when you tell
it you want to turn your annotation files into Sequence
s or Annotation
s.
scribe = crowsetta.Transcriber(format='myformat')
myformat = scribe.from_file('my-annotations.txt')
seq_list = myformat.to_seq()