parlai.core.build_data¶

Utilities for downloading and building data.

These can be replaced if your particular file system does not support them.

class parlai.core.build_data.DownloadableFile(url, file_name, hashcode, zipped=True, from_google=False)[source]¶

Bases: object

A class used to abstract any file that has to be downloaded online.

Any task that needs to download a file needs to have a list RESOURCES that have objects of this class as elements.

This class provides the following functionality:

Download a file from a URL / Google Drive
Untar the file if zipped
Checksum for the downloaded file
Send HEAD request to validate URL or Google Drive link

An object of this class needs to be created with:

url <string> : URL or Google Drive id to download from
file_name <string> : File name that the file should be named
hashcode <string> : SHA256 hashcode of the downloaded file
zipped <boolean> : False if the file is not compressed
from_google <boolean> : True if the file is from Google Drive

__init__(url, file_name, hashcode, zipped=True, from_google=False)[source]¶

checksum(dpath)[source]¶

Checksum on a given file.

Parameters: dpath – path to the downloaded file.

check_header()[source]¶: Performs a HEAD request to check if the URL / Google Drive ID is live.

parlai.core.build_data.built(path, version_string=None)[source]¶

Check if ‘.built’ flag has been set for that task.

If a version_string is provided, this has to match, or the version is regarded as not built.

parlai.core.build_data.mark_done(path, version_string=None)[source]¶

Mark this path as prebuilt.

Marks the path as done by adding a ‘.built’ file with the current timestamp plus a version description string if specified.

Parameters

path (str) – The file path to mark as built.
version_string (str) – The version of this dataset.

parlai.core.build_data.download(url, path, fname, redownload=False, num_retries=5)[source]¶

Download file using requests.

If redownload is set to false, then will not download tar file again if it is present (default False).

parlai.core.build_data.make_dir(path)[source]¶: Make the directory and any nonexistent parent directories (mkdir -p).

parlai.core.build_data.remove_dir(path)[source]¶: Remove the given directory, if it exists.

parlai.core.build_data.untar(path, fname, delete=True, flatten_tar=False)[source]¶

Unpack the given archive file to the same directory.

Parameters

path (str) – The folder containing the archive. Will contain the contents.
fname (str) – The filename of the archive file.
delete (bool) – If true, the archive will be deleted after extraction.

parlai.core.build_data.ungzip(path, fname, deleteGZip=True)[source]¶

Unzips the given gzip compressed file to the same directory.

Parameters

path (str) – The folder containing the archive. Will contain the contents.
fname (str) – The filename of the archive file.
deleteGZip (bool) – If true, the compressed file will be deleted after extraction.

parlai.core.build_data.download_from_google_drive(gd_id, destination)[source]¶: Use the requests package to download a file from Google Drive.

parlai.core.build_data.download_models(opt, fnames, model_folder, version='v1.0', path='aws', use_model_type=False, flatten_tar=False)[source]¶

Download models into the ParlAI model zoo from a url.

Parameters

fnames – list of filenames to download
model_folder – models will be downloaded into models/model_folder/model_type
path – url for downloading models; defaults to downloading from AWS
use_model_type – whether models are categorized by type in AWS

parlai.core.build_data.modelzoo_path(datapath, path)[source]¶

Map pretrain models filenames to their path on disk.

If path starts with ‘models:’, then we remap it to the model zoo path within the data directory (default is ParlAI/data/models). We download models from the model zoo if they are not here yet.

parlai.core.build_data.download_multiprocess(urls, path, num_processes=32, chunk_size=100, dest_filenames=None, error_path=None)[source]¶

Download items in parallel (e.g. for an image + dialogue task).

WARNING: may have issues with OS X.

Parameters

urls – Array of urls to download
path – directory to save items in
num_processes – number of processes to use
chunk_size – chunk size to use
dest_filenames – optional array of same length as url with filenames. Images will be saved as path + dest_filename
error_path – where to save error logs

Returns

array of tuples of (destination filename, http status code, error message if any). Note that upon failure, file may not actually be created.