Standard Ingest Procedure

Introduction

Use the Standard Ingest Service to ingest images and records into the FamilySearch Infinity System by uploading them and their associated metadata into a Transfer S3 Bucket managed by FamilySearch.

Access to the bucket is controlled by an AWS IAM Account. The credentials for the IAM account are provided to you by FamilySearch. Each account restricts access to objects named with a specific prefix assigned to that account so providers can only access their files and information in the Transfer S3 Bucket.

Here’s how it works:

  1. Obtain your AWS bucket account name, login credentials, and project ID from FamilySearch.
  2. Organize the files you want to ingest in a folder on your computer and create any new files you will need for the metadata.
  3. Upload your files to S3, making sure to follow the naming conventions.
  4. Monitor the ingest status file for the specific ingest in S3. Pay particular attention to processing status, error messages, and notification that the ingest is complete.

The ingest files are organized by file and directory name. Make sure you follow the specified path and naming conventions provided below so that the files can be recognized and ingested.

  • If you use a third party S3 access tool like Cloudberry to upload your files, you can stage the entire directory structure on your workstation before uploading them to S3.
  • During the upload process, Cloudberry will name the uploaded files with key names, matching the full path name of the files on your workstation.
  • If you write your own upload process, your system can rename the file names according to the naming conventions during the upload process.

Directory layout requirements for the S3 bucket

Base Subdirectory

Images and metadata files are uploaded by Groups (sometimes called Folders). The naming of objects, including their paths and file names, is important. We use the following pattern to create the base subdirectory for all image, record, and metadata files uploaded for a specific group. Note that image and record files are stored in an "artifacts" subdirectory located within the base directory.

The base subdirectory is composed of the following components:

/[Provider Name]/[FamilySearch project id]/[Capture Id]/

  • Provider Name: This is the root directory under which all information will be uploaded for a specific provider.  This name is also your IAM account name, and must be setup by FamilySearch before data uploading can begin. 
  • FamilySearch project id: The FamilySearch project id defined for this project.
  • Capture Id: This is a globally unique Capture Id created by the provider specifically for this group of images. 

Metadata can be provided for groups to be ingested using one of two different formats, METS or CSV. Based on the format chosen the metadata files in the base directory will differ.

CSV formatted metadata files

When using a CSV format to specify metadata you need to provide two different metadata files to contain the necessary information. One file contains the group metadata and other contains the artifact metadata. There are two possible types of Group Metadata files depending on the type of artifacts being ingested and one type of Artifact Metadata file.

  • GroupMetadataImage.csv: use to specify group metadata for image ingests
  • GroupMetadataRecord.csv: use to specify group metadata for record ingests
  • ArtifactMetadata.csv: used to specify artifact data for either image or record type of ingests.

Example image file names and location in S3:

  • /[Provider Name]/[FamilySearch project id]/[Capture group id]/GroupMetadataImage.csv
  • /[Provider Name]/[FamilySearch project id]/[Capture group id]/ArtifactMetadata.csv

Note: The CSV files must use a comma as the field separator. When a field value contains a comma or quote ("), the entire field must be enclosed in quotes, and any quotes must be "escaped" by using an extra quote (""). So the string
Fred "Bud" Jones, Jr.

would appear as

"Fred ""Bud"" Jones, Jr."

Details of the CSV metadata files are included below.

Image and Record files

Image and record files are written to an "artifacts" subdirectory located in the group base subdirectory as follows:

/[Provider Name]/[FamilySearch project id]/[Capture group id]/artifacts/

Example image file names and location in S3:

  • /[Provider Name]/[FamilySearch project id]/[Capture group id]/artifacts/Image001.tiff
  • /[Provider Name]/[FamilySearch project id]/[Capture group id]/artifacts/Image002.tiff

The directory layout should look like the following image:

Group CSV metadata file specifications

Optional fields can be left blank

The CSV file must have a header to indicate column order and contents. The header must use the exact names from the table below. Columns for optional fields can be omitted from the file.

GroupMetadataImage.csv

Field / Header TitleSampleDescriptionRequirementsRequiredMETS Key

Rework

TRUE

Indicates that this is rework of previously submitted material.

NOTE: If you are ingesting using the same Capture Group Id and same Project ID as a previous ingest, you must set this to TRUE or you will receive an error.

TRUE or FALSE

Yes

COM_REWORK

Title

Michigan Births 1850-1875

The title of the book/folder (GRMS listing title if no title is available)

Yes

MODS_TITLE

Place

Wayne, Michigan, United States

The location(s) of the events recorded on the image (or other artifact). Multiple values allowed separated by bar

Yes

MODS_PLACE_TERM

Start Date

1850

The start date of the events on the image (or other artifact)

Yes

MODS_CREATED_START_DATE

End Date

1875

The end date of the events on the image (or other artifact)

Yes

MODS_CREATED_END_DATE

Record Type

Birth certificate

The record type(s) of the image (or other artifact). Multiple values allowed separated by bar. Can be an official record type string, or the concept ID, which is preferred.

Yes

COM_RECORD_TYPE

Language

English

The language(s) used on the image (or other artifact). Multiple values allowed separated by bar

Yes

MODS_LANGUAGE_TERM

Record Custodian

Michigan State Archive

The organization/entity that has custody of the physical artifact

Yes

METS_HDR_AGENT_CUSTODIAN

Custodian Reference ID

45-3453-345

The ID used by record custodian (required if there is an ID available – needed for delivery)

Yes

Capture ID

999-999-99999

The ID used by the capturing organization (used for rework).
The Capture ID must be unique for each group ingested for a provider. The Capture ID and Provider name are combined during the ingest process to create a global ID. The total length of the combined values must not exceed 64 characters.

The Capture ID must be unique for each group ingested for a provider.

Yes

COM_GROUP_CAPTURE_ID

Total Artifacts

1078

The number of images (or other artifacts) in the group

Integer number

Yes

COM_TOTAL_ARTIFACTS

Capture Operator Name

SmithJane

Username of vendor associated with the Project ID

Must be a valid FamilySearch Username

Yes

COM_OPERATOR_NAME

Capture Operator ID

cis.user.MMM9-TGFQ

CIS ID associated with the Username that is assigned to the Project ID (this should match the username provided in the Operator Name field)

Must be a valid FamilySearch account ID (CIS ID) or be left blank. Max Length: 18 Characters (CIS ID on backend)

Yes

COM_OPERATOR_NUMBER

Volume

3

The identifier for a volume of a series of books/folders all under the same title (optional)

No

Capture Date

2019-02-13

The date this group was captured (optional)

No

MODS_CAPTURED_DATE

Digitizing Entity

FamilySearch

The name of the entity that did the digital creation/capture of the group of artifacts (optional)

No

METS_HDR_AGENT_CREATOR

Artifact Type

NBX

The type of artifacts in this group.

Match values in ArtifactMetadata.csv "Artifact Type" field.

No

Currently only one group per metadata file is supported. As a result, the group metadata file only contains two rows: a header row and one data row.

Ensure that the CSV files is UTF-8 Comma Delimited. Fields values that have a comma have the whole value surrounded by double quotes. If you review your CSV file in Notepad or Textedit, you should see something similar like this image:

Artifact CSV Metadata File Specifications

The CSV file must have a header to indicate column order and contents. The header must use the exact names from the table below. Columns for optional fields can be omitted from the file.

Please Note:  The order of the entries in this file will be maintained when the images are published, so ensure that you have each row in the order you want it to show up (i.e. Book Covers, title page, page 1, page 2)

ArtifactMetadata.csv (one row for each artifact)


Field / Header TitleSampleDescriptionRequirementsRequiredMETS Key
FilenameImage0004.jpgThe filename for the artifact.
Note: the extension must match the actual image file type.
Yesmets:FLocat-href
File Size3765432Size of the artifact file in bytesInteger numberYesmets:file-SIZE
Artifact TypeImageThe type of artifact (Image, Audio, Text, ...)
(optional - default to image)
Yesmets:file-MIMETYPE
Capture ID5b17447b-e972-403c-987f-ee4346784839The ID used by the capturing organization (used for rework).Each artifact included in the csv file must have its own unique Capture IDYesfscommon:artifactCaptureId
Hash AlgorithmMD5The hash algorithm used for computing the checksum.Yesmix:messageDigestAlgorithm
Hashd4f7beaa9828bb62b58f9497dc3778ccThe actual hash valueMust be a valid MD5 hash for the fileYesmets:file-CHECKSUM, mix:messageDigest
Image Width5192The width of the image in pixelsInteger numberNomix:imageWidth
Image Height2834The height of the image in pixelsInteger numberNomix:imageWidth

We recommend checking if your MD5 hash is correct on 1 or 2 images. You can compare you hash with this free online MD5 hash generator: https://emn178.github.io/online-tools/md5_checksum.html (Note: we aren't sponsored or affiliated with this website).

Image Requirements

TIFF Images

Tiff images must be 1 channel 8-bit greyscale images, or 3 channel 24-bit color or greyscale images.

Record CSV Metadata File Specifications

The CSV file must have a header to indicate column order and contents. The header must use the exact names from the table below.

GroupMetadataRecord.csv


FieldSampleDescriptionMETS Key
TemplateUUIDThe ID of the template being used for records ingest (records)fscommon:template
Flat File FlavorAncestryThe identifier of the template mapping standard (records)fscommon:flavor
Data Formatflat-fileThe type of record being ingested, i.e. flat-file, gedcom, or gedcomxfscommon:dataFormat
ReworkTRUEIndicates that this is rework of previously submitted materialCOM_REWORK

Currently only one group per file is supported, so the group metadata file only contains two rows. A header row and a data row.

Uploading objects

Login credentials limit access to a root directory (S3 prefix) in the S3 Transfer Bucket. Bucket Name, Account Name, Access Key and Secret Key will be provided by FamilySearch. For security reasons, Access Key and Secret Key will be changed periodically.

FamilySearch has separate S3 buckets for our Development, Test, and Production environments. All three bucket names will be provided to you, and access to all three is allowed with your Account. The Development and Test environments may be used for testing, however any tests must be coordinated with FamilySearch to ensure a project is set up to receive your ingest. If a receiving project is not set up in advance the ingest will fail.

The group metadata file (GroupMetadataImage.csv or GroupMetadataRecord.csv) serves as the completion trigger and should be the last file uploaded for a specific group.

Multiple options are available for uploading the objects into S3 from your computer. For example:

  1. A third party application such as Cloudberry can be used.
  2. Amazon provides an SDK with API's that allow you to write files to S3
  3. Standard HTML POST calls can be made to upload files.

Cloudberry

Cloudberry provides a client application for MS Windows that enables easy file uploads to an S3 bucket that can be used to copy files from your computer to the FamilySearch S3 Transfer Bucket. It offers both a free and a paid version. The main difference lies in their upload capabilities:

  • The free version uploads one image at a time.
  • The paid version utilizes multiple threads and simultaneous uploads of multiple files.

Depending on your internet speed, this difference can have a big impact on how long it takes to upload a large number of files.
To get started with Cloudberry:

  1. Download and install the application on your computer.
  2. Register your Amazon S3 account within the Cloudberry application using the account information provided by FamilySearch.
    1. To do so, open Cloudberry and navigate to "File > Amazon S3."
    2. Enter the Access key and Secret Key provided, and click "Test Connection" to verify, then OK to save.

Once you have added the Amazon S2 account you can select it from the "Source" dropdown shown on the right half of the screen.

Next, enter the S3 bucket name and account name in the folder field just below the Source dropdown.

  • Use the same naming format that you would for a folder: bucketName/accountName. FamilySearch will provide you with this information.

After completing these steps, you'll see the contents of the root S3 directory for your account on the right side of the screen. Note that because you entered your account name as part of the folder string, the initial page will display files and directories in your bucket/accountName/ directory.
Due to account restrictions, you won't be able to view anything higher than this directory. Be aware that when following this procedure with Cloudberry or a similar tool, you don't need to add your accountName to the path of copied files since that is the current directory displayed on the right side.

  • The accountName root is automatically included. Project ID directories should be copied to this root directory.

On the left side of the screen, you can browse folders on your computer.
To ingest, navigate to the appropriate folder location on both screens and copy the files and/or directories to be ingested into the destination folder by selecting, dragging, and dropping them.

  • If you have multiple groups under a Project Name, you can copy at the Project Name level by dragging the project folder from your computer to the root directory in the S3 view.
  • If you've already set up the project directory on S3, you can copy the group level (Capture Group ID) directories into the project directory.

AWS API's for programmatic uploading to the S3 bucket

Instructions for using the AWS SDK to access S3 can be found here.


Viewing Ingest Errors and Status

The status for groups submitted for ingesting will be returned in JSON files located in the same S3 bucket used to upload images and metadata.

The status files will be named as follows:

/[Provider Name]/[Project ID]/Status/[Status Level]/[Capture Group ID].json

"Status Level" indicates the current ingest status of the specific group specified by Capture Group ID. Status Level will be one of the following values:

  • inProgress
  • Issue - This status indicates an ingest Error has occurred.  The json status file will contain the error details.
  • Completed. 

The JSON files will contain additional details about the ingest and current status.