parlai.core.teachers¶
This module provides a set of teachers that deal with dialog.
FixedDialogTeacher(Teacher)
Base class for teachers in tasks that have fixed dialog - i.e., dialog that is not dynamically generated but rather is pulled from set examples. However, the class can be extended to all tasks involved fixed data. Implements much of the basic functionality of these teachers, includingobserve()
,act()
,next_example()
DialogTeacher(FixedDialogTeacher)
Base teacher class for doing dialog specifically with fixed chat logs.
ParlAIDialogTeacher(DialogTeacher)
Teacher class that provides access to data in the ParlAI Dialog format. See the class description for more details.
ConversationTeacher(DialogTeacher)
Teacher class that provides access to data in the Conversations format. See the class description for more details.FbDeprecatedDialogTeacher(DialogTeacher)
Teacher class that provides access to data in the Facebook Dialog format. See the class description for more details. This class is deprecated.
This module also includes DataLoader
, a threadpool data loader for
FixedDialogTeacher
, and DialogData
/StreamDialogData
, data
structures for accessing textual dialog data and utilized by DialogTeacher
- class parlai.core.teachers.DataLoader(opt)[source]¶
Bases:
Thread
A worker thread that provides a threadpool for data loading.
A teacher may submit a request to the loader, which will return the appropriate data.
To submit a request, a teacher should call
request_load
.- __init__(opt)[source]¶
This constructor should always be called with keyword arguments. Arguments are:
group should be None; reserved for future extension when a ThreadGroup class is implemented.
target is the callable object to be invoked by the run() method. Defaults to None, meaning nothing is called.
name is the thread name. By default, a unique name is constructed of the form “Thread-N” where N is a small decimal number.
args is the argument tuple for the target invocation. Defaults to ().
kwargs is a dictionary of keyword arguments for the target invocation. Defaults to {}.
If a subclass overrides the constructor, it must make sure to invoke the base class constructor (Thread.__init__()) before doing anything else to the thread.
- request_load(receive_fn, load_fn, args)[source]¶
Queue a request for loading.
- Parameters
receive_fn – a receive function (for receiving the data)
load_fn – a load function (for loading the data)
args – arguments for the load function. args can be either a dictionary of arguments for a function, or a list of positional arguments
- class parlai.core.teachers.Teacher(opt: Opt, shared=None)[source]¶
Bases:
Agent
Basic Teacher agent that keeps track of how many times it’s received messages.
Teachers provide the
report()
method to get back metrics.- num_examples()[source]¶
Return the number of examples (e.g. individual utterances) in the dataset.
Default implementation returns None, indicating an unknown number.
- num_episodes()[source]¶
Return the number of episodes (e.g. conversations) in the dataset.
Default implementation returns None, indicating an unknown number.
In addition to default Agent shared parameters, share metrics.
- class parlai.core.teachers.FixedDialogTeacher(opt, shared=None)[source]¶
Bases:
Teacher
A teacher agent for all teachers involved in tasks with fixed data.
This class provides the following functionality for its subclasses:
Resets a teacher
Provides an observe method
Computes and retrieves the next episode index for a teacher
Provides a threadpool option for loading data (especially useful for large data, e.g. images)
In order to take advantage of the first few features, all a subclass has to implement is three functions:
num_episodes
,num_examples
, andget
(which returns a specific example from a specific episode).To utilize the DataLoader for threadpool loading, a teacher should implement the
submit_load_request
function to send a load request to the DataLoader by callingself.data_loader.request_load
with the appropriate arguments (receive_fn, load_fn, args
). The DataLoader then returns the data to the teacher’sdata_queue
, which the teacher can poll in itsact
method.The following is an example of the DataLoader usage in the VQA-V1 teacher.
In the teacher’s
init
function, the teacher calls itssubmit_load_request
function to preload an image.The
submit_load_request
function gets the nextepisode_idx
, and computes the image path for the load request.At the end of
submit_load_request
, the teacher callsself.data_loader.request_load
with three args:self.receive_data
- the function that the DataLoader calls to return the the loaded objectself.image_loader.load
- the function used to load the image from the image path[img_path]
- a list of arguments for the load function, which in this case is the path of the image.
In the teacher’s
act
function, the teacher loads the data from its data queue.At the end of the
act
function, the teacher callssubmit_load_request
to preload an image for the next example.
To see this in action, take a look at this teacher in
tasks.vqa_v1.agents
.- submit_load_request()[source]¶
Submit a load request.
An agent should implement this method to submit requests to the data loader. At the end of this method, the agent should call
self.data_loader.request_load()
with the appropriate args.By default, this method does nothing.
- receive_data(future: Future)[source]¶
Receive data from the data loader.
- Parameters
future – result from the load request.
Share the data and dataloader.
- next_episode_idx(num_eps=None, loop=None)[source]¶
Return the next episode index.
- Parameters
num_eps – default None uses
num_episodes
value.loop – default None loops during training but not evaluation.
- next_example()[source]¶
Return the next example.
If there are multiple examples in the same episode, returns the next one in that episode. If that episode is over, gets a new episode index and returns the first example of that episode.
- get(episode_idx, entry_idx=0)[source]¶
Get the specified episode and the specified entry in that episode.
Children must override this method in order to inherit the next_example method.
- Parameters
episode_idx – which episode to return examples from
entry_idx – which example to return from the episode. Many datasets have only single-entry episodes, so this defaults to zero.
- custom_evaluation(teacher_action: Message, labels: Optional[Tuple[str]], model_response: Message) None [source]¶
A method designated for hooking custom evaluations into teachers.
Generally, a user will want to use self.metrics.add to record any specialized metrics that only make sense for this one dataset.
- Parameters
teacher_action – The message last sent from this teacher.
labels – The previous correct labels, if there were any.
model_response – The raw response from the model. Generally you want to rely on the text field, but others may be necessary in specific situations.
- get_orig_action() Message [source]¶
Get the unprocessed action and reset if needed.
This function will return the raw action from self.next_example(), before the self.last_act and self.lastY attributes have been defined based on this action for metrics or custom evaluations. This is so that wrapper teachers can modify the raw action first, such as to change the contents of its ‘text’ and ‘label’ fields, without the action becoming out of sync with self.last_act and self.lastY.
- class parlai.core.teachers.DialogTeacher(opt, shared=None)[source]¶
Bases:
FixedDialogTeacher
A base teacher class for doing dialog with fixed chat logs.
This class provides a set a basic functionality:
uses data class to store and query text data
generates action tables to send to the student agent from the data
In order to subclass this class, you must implement
setup_data()
in your class, which reads your data file as an iterator.- abstract setup_data(datafile: str)[source]¶
The core method which the user should override.
Yields the data, one message at a time, as well as markers indicating new episodes.
- Parameters
datafile (str) – If the initializer set a ‘datafile’ field within the initialization, this will be provided here. Otherwise, datafile will be the fold: either “train”, “valid”, or “test”.
- Returns
Yields pairs (message, new_episode) containing a Message object and whether the message marks the beginning of a totally new episode.
Share the data.
- class parlai.core.teachers.DialogData(opt, data_loader=None, cands=None, shared=None, **kwargs)[source]¶
Bases:
object
Provides a data structure for accessing textual dialog data.
This can be used whenever the dialog data is a fixed log of chats (i.e not a simulator setting). The logs can include dialog text and possibly supervised labels, candidate labels and rewards.
All these are stored in this internal data format which is used by the
DialogTeacher
class.- Parameters
opt – options to initialize the class
data_loader – an iterable with each call returning a tuple in the form
((x, y, r, c, i), new_episode?)
where thex
andnew_episode
fields are mandatory and other fields may be omitted orNone
.cands – can be set to provide a list of candidate labels for every example in this dataset, which the agent can choose from (the correct answer should be in this set).
random – tells the data class whether or not to visit episodes sequentially or randomly when returning examples to the caller.
The contents of the
((x, y, r, c, i), new_episode?)
tuples returned by the data loader is the following:x
(str) is a query and possibly contexty
(iter) is an iterable of label(s) for that queryr
(str) is the str reward for getting that query correctc
(iter) is an iterable of label candidates that the student can choose fromi
(str) is a str path to an image on disk, which will be loaded by the data class at request-time. should always point to the raw image file.new_episode?
(bool) is a boolean value specifying whether that example is the start of a new episode. If you don’t use episodes set this toTrue
every time.
Share the data.
- num_examples()[source]¶
Return total number of entries available.
Each episode has at least one entry, but might have many more.
- class parlai.core.teachers.StreamDialogData(opt, data_loader=None, cands=None, shared=None, **kwargs)[source]¶
Bases:
DialogData
Provides a data structure for streaming textual dialog data.
This can be used whenever the dialog data follows the format described in DialogData but cannot fit entirely into memory.
Additional keyword-argument cycle defines if the stream should restart from the beginning after an epoch is finished (defaults to True).
- Parameters
opt – options to initialize the class
data_loader – an iterable with each call returning a tuple in the form
((x, y, r, c, i), new_episode?)
where thex
andnew_episode
fields are mandatory and other fields may be omitted orNone
.cands – can be set to provide a list of candidate labels for every example in this dataset, which the agent can choose from (the correct answer should be in this set).
random – tells the data class whether or not to visit episodes sequentially or randomly when returning examples to the caller.
cycle – (default True) whether to restart at beginning when end of stream reached without reset being called.
Share the stream.
- load_length()[source]¶
Calculate the length of the dataset and caches it in a file.
Note that this can take some time for large datasets. Episode and entry indexes cannot be specified during streaming.
- class parlai.core.teachers.FbDeprecatedDialogTeacher(opt, shared=None)[source]¶
Bases:
DialogTeacher
This module provides access to data in the Facebook Dialog format.
Subclasses
DialogTeacher
for functionality and provides an implementation ofsetup_data()
which iterates over datasets in the “fbdialog” format. If your data is in the format below, use this class to handle file parsing for you.The way FB Dialog data is set up is as follows:
1 Sam went to the kitchen. 2 Pat gave Sam the milk. 3 Where is the milk?<TAB>kitchen<TAB>1<TAB>hallway|kitchen|bathroom 4 Sam went to the hallway. 5 Pat went to the bathroom. 6 Where is the milk?<TAB>hallway<TAB>1<TAB>hallway|kitchen|bathroom
Lines 1-6 represent a single episode, with two different examples: the first example is lines 1-3, and the second is lines 4-6.
Lines 1,2,4, and 5 represent contextual information.
Lines 3 and 6 contain a query, a label, a reward for getting the question correct, and three label candidates.
Since both of these examples are part of the same episode, the information provided in the first example is relevant to the query in the second example and therefore the agent must remember the first example in order to do well.
In general dialog in this format can contain any speech, not just QA pairs:
1 Hi how's it going?<TAB>It's going great. What's new? 2 Well I'm working on a new project at work.<TAB>Oh me too! 3 Oh cool!<TAB>Tell me about yours.
etc.
Note that dialogs are interpreted as being one-way. For example, consider this dialog:
1 X1 Y1 2 X2 Y2 3 X3 Y3
A set of examples X1 => Y1, X2 => Y2, and X3 => Y3 will be generated. However, Y1 => X2 and Y2 => X3 are not created as separate examples by default. This makes sense for some data (we don’t need to train on the idea that “kitchen” should be followed by “Sam went to the hallway…” above), but for other datasets it may be helpful to add additional examples in the reverse direction (“Oh cool!” is a response to “Oh me too!” above).
Share the data and candidates.
- load_cands(path)[source]¶
Load a global fixed set of candidates.
The candidates will be provided by the teacher for every example (the true labels for a specific example are also added to this set, so that it’s possible to get the right answer).
- setup_data(path)[source]¶
Read data in the fbdialog format.
Returns
((x,y,r,c), new_episode?)
tuples.x
represents a query,y
represents the labels,r
represents any reward, andc
represents any label_candidates.The example above will be translated into the following tuples:
x: 'Sam went to the kitchen\nPat gave Sam the milk\nWhere is the milk?' y: ['kitchen'] r: '1' c: ['hallway', 'kitchen', 'bathroom'] new_episode = True (this is the first example in the episode)
x: 'Sam went to the hallway\\nPat went to the bathroom\\nWhere is the milk?' y: ['hallway'] r: '1' c: ['hallway', 'kitchen', 'bathroom'] new_episode = False (this is the second example in the episode)
- class parlai.core.teachers.ParlAIDialogTeacher(opt, shared=None)[source]¶
Bases:
FixedDialogTeacher
This module provides access to data in the ParlAI Text Dialog format.
Subclasses
FixedDialogTeacher
for functionality and provides an implementation ofsetup_data()
which iterates over datasets in the “ParlAI text” format. If your data is in the format below, use this class to handle file parsing for you.The way the data is set up is as follows:
text:Sam went to the kitchen. <NEWL> Pat gave Sam the milk. <NEWL> Where is the milk? <TAB> labels:kitchen <TAB> reward:1 <TAB> label_candidates:hallway|kitchen|bathroom text:Sam went to the hallway. <NEWL> Pat went to the bathroom. <NEWL> Where is the milk? <TAB> labels:hallway <TAB> reward:1 <TAB> label_candidates:hallway|kitchen|bathroom <TAB> episode_done:True
Lines 1-2 represent a single episode, with a different example on each line. The lines contain a query and a label for getting the question correct, and three label candidates.
Since both of these examples are part of the same episode, the information provided in the first example is relevant to the query in the second example and therefore the agent must remember the first example in order to do well.
In general dialog this format can contain any speech, not just QA pairs:
text:Hi how's it going?<TAB>labels:It's going great. What's new? text:Well I'm working on a new project at work.<TAB>labels:Oh me too! text:Oh cool!<TAB>labels:Tell me about yours.
etc.
Note that dialogs are interpreted as being one-way. For example, consider this dialog:
1 X1 Y1 2 X2 Y2 3 X3 Y3
A set of examples X1 => Y1, X2 => Y2, and X3 => Y3 will be generated. However, Y1 => X2 and Y2 => X3 are not created as separate examples by default. This makes sense for some data (we don’t need to train on the idea that “kitchen” should be followed by “Sam went to the hallway…” above), but for other datasets it may be helpful to add additional examples in the reverse direction (“Oh cool!” is a response to “Oh me too!” above).
Share the episodes.
- class parlai.core.teachers.YamlTeacher(opt, shared=None)[source]¶
Bases:
DialogTeacher
Teacher which loads data generated by parlai.utils.testing.AutoTeacherTest.
- setup_data(datafile)[source]¶
The core method which the user should override.
Yields the data, one message at a time, as well as markers indicating new episodes.
- Parameters
datafile (str) – If the initializer set a ‘datafile’ field within the initialization, this will be provided here. Otherwise, datafile will be the fold: either “train”, “valid”, or “test”.
- Returns
Yields pairs (message, new_episode) containing a Message object and whether the message marks the beginning of a totally new episode.
- class parlai.core.teachers.ConversationTeacher(opt, shared=None)[source]¶
Bases:
DialogTeacher
This module provides access to data in the Conversations format.
Subclasses
DialogTeacher
for functionality and provides an implementation ofsetup_data()
which iterates over datasets in the “Conversations” format. If your data is in the format below, use this class to handle file parsing for you.The data should be set up so that each dialogue instance (or, episode) occupies one line of valid JSON. The way the data is set up is as follows:
:: { “dialog”: [ [ {“id”: “partner1”, “text”: “hello!”}, {“id”: “partner2”, “text”: “hi back!”} ] ] }
NOTE: If the data is not on one line per dialogue, it will not load. Further, note that by default, dialogs are interpreted as being one-way. For example, consider this dialog (not that the data below is not on:
{ "dialog":[ [ {"id":"modelx", "text": X1}, {"id":"modely", "text": Y1}, {"id":"modelx", "text": X2}, {"id":"modely", "text": Y2}, {"id":"modelx", "text": X3}, {"id":"modely", "text": Y3}, ] ] }
(Note: we use line breaks for readability above, but this data will not load as stated, it must be on one line.)
A set of examples X1 => Y1, X2 => Y2, and X3 => Y3 will be generated, forming one episode. However, Y1 => X2 and Y2 => X3 are not created as separate examples by default. To change this behavior, you can set
opt['label_turns']
or--label-turns flag
. The default value is ‘secondspeaker’ (i.e., the second speaker’s utterances are used as labels), but ‘firstspeaker’ and ‘both’ are also options. In the case of ‘both’, two episodes are generated for each conversation.- setup_data(path)[source]¶
The core method which the user should override.
Yields the data, one message at a time, as well as markers indicating new episodes.
- Parameters
datafile (str) – If the initializer set a ‘datafile’ field within the initialization, this will be provided here. Otherwise, datafile will be the fold: either “train”, “valid”, or “test”.
- Returns
Yields pairs (message, new_episode) containing a Message object and whether the message marks the beginning of a totally new episode.
- class parlai.core.teachers.AbstractImageTeacher(opt, shared=None)[source]¶
Bases:
FixedDialogTeacher
Abstract class to allow easier creation of image + dialogue tasks.
This class handles creating image features via ImageLoader if applicable (resnet, resnext variants) or loading existing image features from a dict path as per get_image_features_path().
Important methods and properties (override in subclass if needed):
get_data_path(): where data file is found (default: <datapath>/<task>)
get_image_path(): where images found (default: <datapath>/<task>/images)
get_image_features_path(): dict of image features (default: <datapath>/<task>/image_features)
@property image_id_key: which key in data file objects represents image_id
@property text_key: which key in data file objects represents text
Note: Assumes data files are named <dt>.json
@abstractmethod image_id_to_image_path() must be implemented in subclass
Example with the key defaults (but the keys can be customized):
obs = { 'text': <caption>, 'image': <image features if specified else image> }
- get_available_image_mode_names()[source]¶
Available image model names.
resnet and resnext variants available from the ImageLoader. resnext101_XXXXX_wsl is the open-sourced FB AI model (960m images, 1.5k hashtags, finetuned on ImageNet).
- property image_id_key¶
Which key in the input data dict objects uniquely identify each image.
Common image keys are “image_id” or “image_num”. May be implemented by subclass.
- property text_key¶
Which key in the input data dict objects identifies the text.
Common keys are “text” or “comment”. May be implemented by subclass.
- abstract image_id_to_image_path(image_id)[source]¶
Get the path of the image on disk.
Must be implemented by subclass.
- get_image_path(opt)[source]¶
Return the path to the data directory and to the image directory.
Is based on opt fields: task, datatype (train, valid, test), datapath.
Subclass can override this.
- get_image_features_path(task, image_model_name, dt)[source]¶
Image features for the dataset images are stored here.
Can be overridden in subclass to use custom paths. Image features can be manually copied into this directory or in the case of ImageLoader eligible models, they will be built and stored here if not already there.
- is_image_mode_buildable(model_name)[source]¶
Is buildable if features can be calculated by ImageLoader.
Users may wish to compute features for the dataset offline and use in the model, in which case, the image model should return False and get_image_features() should be overridden in subclass.
- load_data(data_path, opt)[source]¶
Loading the data file, which is the index to the images and text.
It is often a .json file with the name of the <datatype>.json (i.e. train.json). Stores in self.data.
Can be override by subclass.
- setup_image_features(data_path)[source]¶
Load text and image data.
The image features all live in dicts by default in <data_path>/ image_features/ but get_image_features_path() above can be overridden by subclass to put them elsewhere.
In the (very odd) case that the resnet or resnext dicts (models buildable using ImageLoader) are not found, we build them.
- get_image_features(example)[source]¶
Get image features for example.
Can be overridden in subclass for different behavior. For large datasets, it may be more appropriate to use the ImageLoader.load() method to load image features (as this is essentially streaming the features from disk, so that we do not have to load a large image feature dict in memory). #TODO Could be the default option if we are using -dt train:stream
- get(episode_idx, entry_idx=0)[source]¶
Override this in subclass if your data should be handled in a different format.
Share the data and dataloader.
- class parlai.core.teachers.MultiTaskTeacher(opt: Opt, shared=None)[source]¶
Bases:
Teacher
MultiTaskTeacher which teaches multiple tasks.
Creates a teacher that is actually a set of teachers each based on a task string – each of these teachers will get called in turn, either randomly or in order. They are all in the same world (they are the same agent switching tasks).
The task string format is described for the
create_task_agents()
function above.Shares this teacher by sharing each subtask.
- class parlai.core.teachers.ChunkTeacher(opt, shared=None)[source]¶
Bases:
FixedDialogTeacher
,ABC
Useful for loading large amounts of data.
Data is separated into chunks and loaded one chunk at a time. Loads the data off of the main thread.
- abstract get_num_samples(opt: Opt) Tuple[int, int] [source]¶
[Abstract] Return the number of samples.
Returns a tuple of (num_examples, num_episodes) based on the data split.
- abstract get_fold_chunks(opt: Opt) List[int] [source]¶
[Abstract] Return a list of chunk IDs (integer).
Given the datatype (train/test/valid), return the list of chunk IDs that correspond to that split.
- get_buffersize()[source]¶
Size of buffer.
Override this in your child class to change the buffer size.
Share the data and dataloader.
- next_episode_idx()[source]¶
Return the next episode index.
- Parameters
num_eps – default None uses
num_episodes
value.loop – default None loops during training but not evaluation.
- receive_data(future)[source]¶
Receive loaded data and place it onto the sample queue.
- Parameters
future – A Future object which will return a value from a call to get_chunk()
- abstract load_from_chunk(chunk_idx: int) List[ChunkOutput] [source]¶
[Abstract] Given the chunk index, load examples from that chunk.
Return a list of tuples. The function _create_message will take these tuples to form the Message object that is returned by the teacher.
- abstract create_message(queue_output: ChunkOutput, entry_idx=0) Message [source]¶
[Abstract] Given the tuple output of the queue, return an act.
May depend on entry index if queue output is a multi-turn episode.
- next_example()[source]¶
Return the next example.
If there are multiple examples in the same episode, returns the next one in that episode. If that episode is over, gets a new episode index and returns the first example of that episode.
- get(episode_idx, entry_idx=0)[source]¶
Get the specified episode and the specified entry in that episode.
Children must override this method in order to inherit the next_example method.
- Parameters
episode_idx – which episode to return examples from
entry_idx – which example to return from the episode. Many datasets have only single-entry episodes, so this defaults to zero.
- parlai.core.teachers.create_task_agent_from_taskname(opt: Opt)[source]¶
Create task agent(s) assuming the input
task_dir:teacher_class
.e.g. def_string is a shorthand path like
babi:Task1k:1
or#babi
or a complete path likeparlai.tasks.babi.agents:Task1kTeacher:1
, which essentially performsfrom parlai.tasks.babi import Task1kTeacher
with the parameter1
inopt['task']
to be used by the classTask1kTeacher
.