Product Transaction Delivery Through S3

3 Minutes to read

Product Transaction Delivery Through S3

3 Minutes to read

Article summary

Did you find this summary helpful?

Thank you for your feedback!

For partners who do not support sending data via the Yahoo Conversion API (CAPI), Yahoo offers the ability to upload Transaction Data via S3.

Important
This solution powers Yahoo In-Flight Outcomes (IFO) only and is not for standard conversion tracking.

Set up the delivery of data files to designated AWS S3 locations for Yahoo to download.

Architecture and Data Flow

Onboarding Requirements from Data Provider

Please provide the following to allow for load estimation of the data you plan to send:

Upload frequency: assume daily.
Number of files per batch upload and their size after compression, .bz2 preferred.
Projected data volume per day.

High-Level Guidelines: Privacy, Security and Performance

The following are best practices for sharing data via S3:

Do not send data for opted out users.
Do not send duplicate user events.
1. If duplication cannot be avoided due to integrating with multiple data onboarding endpoints such as S3, Conversion API, and DOT Pixel, an extra string field ‘‘event_id”, needs to be sent for each event on all endpoints, so that Yahoo system can use it as a deduplication key.
Protect S3 bucket/location with security policy, such as
1. Disable public access.
2. Enforce HTTPS (TLS1.2 or above) connection.
3. Enable server side encryption (SSE-S3) with cypher key rotation at least every 12 months.
Group files by feed types and day/hour and upload them to folders named by feed types and date/time. Refer to Directory Layout and File Format.
Limit the file size to around 1 GB (after .bz2 compressed).
1. For small data sets, limit the number of files to under 5 per hour.
Avoid many small files.
Support credential rotation: annually or in need.
Secrets, e.g., credentials, should be delivered in encrypted format: GPG public key to be provided by the receiver of the credentials. Refer to Sharing Secrets with External Partners using GPG.

Directory Layout and File Format

s3://<bucket name>/<3p-m>/<feed_n>/yyyyMMdd

_manifest -- 0 byte; upload completion marker; upload this after all other files are uploaded

<file_1.csv.bz2> --- data file; TAB, ‘\t’, delimited

<file_2.csv.bz2>

...

<file_n.csv.bz2>

File format and Contents of “_manifest”

The unit of the file size is “byte”.

<.schema size><SPACE><.schema> --- <SPACE> delimited

<.meta size><SPACE><.meta>

<file_1 size><SPACE><file_1.csv.bz2>

<file_2 size> <file_2.csv.bz2>

…

<file_n size> <file_n.csv.bz2>

File format and Contents of “<.schema>”

The schema file, <.schema>, could be either of the following two formats.

A Pig header file, ”.pig_header”, in CSV column header format with extended data type support:
1. <column_header:data_type>[,<column_header:data_type>]{*}
2. Supported data types: http://pig.apache.org/docs/latest/basic.html#data-types
3. https://pig.apache.org/docs/r0.17.0/api/constant-values.html#org.apache.pig.data.DataType
The standard Apache Pig schema file, “.pig_schema”, in JSON file format.
Note
The names, data types and order of the attributes in the “.schema” should match the columns in the data files.

Pig Data Types

Simple Types in .pig_header	Constant Value in .pig_schema	Description	Example
INT	10	Signed 32-bit integer	10
LONG	15	Signed 64-bit integer	Data: 10L or 10l Display: 10L
FLOAT	20	32-bit floating point	Data: 10.5F or 10.5f or 10.5e2f or 10.5E2F Display: 10.5F or 1050.0F
DOUBLE	25	64-bit floating point	Data: 10.5 or 10.5e2 or 10.5E2 Display: 10.5 or 1050.0
CHARARRAY	55	Character array (string) in Unicode UTF-8 format	hello world
BYTEARRAY	50	Byte array (blob)
BOOLEAN	5	boolean	true/false (case insensitive)
DATETIME	30	datetime	1970-01-01T00:00:00.000+00:00
BIGINTEGER	65	Java BigInteger	200000000000
BIGDECIMAL	70	Java BigDecimal	33.45678332

Contacts of Operational alerts

<provider-specific-name>_s3_feed_alerts@<provider-specific-name>.com

Sharing Secrets with External Partners using GPG

Background

This document describes how company A, the secret sender, can share secrets with company B, the secret recipient, over the internet securely with GPG encryption.

In a nutshell,

Company B uses GPG to generate a pair of public key, b_armor.pub, and private key, b_armor.
1. Company B needs to provide a password, password_B, in generating the public/private keys.
2. Company B will need this password in decrypting the encrypted secret from company A.
Company B emails b_armor.pub to company A
Company A uses b_armor.pub to encrypt the secret, and emails the encrypted blob to company B.
The company B uses b_armor (and password_B when prompted) to decrypt the encrypted blob.

Detailed Steps

Company A requests company B to generate (if not yet) and export their GPG public key. Company B follows instruction here: https://kb.iu.edu/d/awio
1. Generate a key, assume company B’s email address is: “[email protected]”
  1. gpg --gen-key
    1. Choose “(1) RSA and RSA (default)”
    2. Key length: 2048
    3. Expire in 2 weeks: 2w
    4. Full Name
    5. Email address: [email protected]
    6. Comment: GPG key w/ company A
    7. Enter passphrase: <password_B> (company B need in decrypting the secret)
2. gpg -o b_armor.pub -a --export [email protected]
3. Company B emails “b_armor.pub” as attachment to company A.
Company A imports company B’s public key
1. gpg --import “b_armor.pub”
Company A puts secrets in “my_secret.txt” and encrypt it using b_armor.pub identified as “[email protected]”
1. gpg -o my_secret_armor.txt -a -e -r [email protected] my_secret.txt
Company A emails “my_secret_armor.txt” as attachment to company B

Was this article helpful?

What's Next

Bring Your Own Algorithm (BYOA) API

Contents

Architecture and Data Flow
Onboarding Requirements from Data Provider
High-Level Guidelines: Privacy, Security and Performance