Product Transaction Delivery Through S3

Prev Next

For partners who do not support sending data via the Yahoo Conversion API (CAPI), Yahoo offers the ability to upload Transaction Data via S3.

Important

This solution powers Yahoo In-Flight Outcomes (IFO) only and is not for standard conversion tracking.

Set up the delivery of data files to designated AWS S3 locations for Yahoo to download.

Architecture and Data Flow

Onboarding Requirements from Data Provider

Please provide the following to allow for load estimation of the data you plan to send:

  1. Upload frequency: assume daily.

  2. Number of files per batch upload and their size after compression, .bz2 preferred.

  3. Projected data volume per day.

High-Level Guidelines: Privacy, Security and Performance

The following are best practices for sharing data via S3:

  1. Do not send data for opted out users.

  2. Do not send duplicate user events.

    1. If duplication cannot be avoided due to integrating with multiple data onboarding endpoints such as S3, Conversion API, and DOT Pixel, an extra string field ‘‘event_id”, needs to be sent for each event on all endpoints, so that Yahoo system can use it as a deduplication key.

  3. Protect S3 bucket/location with security policy, such as

    1. Disable public access.

    2. Enforce HTTPS (TLS1.2 or above) connection.

    3. Enable server side encryption (SSE-S3) with cypher key rotation at least every 12 months.

  4. Group files by feed types and day/hour and upload them to folders named by feed types and date/time. Refer to Directory Layout and File Format.

  5. Limit the file size to around 1 GB (after .bz2 compressed).

    1. For small data sets, limit the number of files to under 5 per hour.

  6. Avoid many small files.

  7. Support credential rotation: annually or in need.

  8. Secrets, e.g., credentials, should be delivered in encrypted format: GPG public key to be provided by the receiver of the credentials. Refer to Sharing Secrets with External Partners using GPG.

Directory Layout and File Format

s3://<bucket name>/<3p-m>/<feed_n>/yyyyMMdd

_manifest -- 0 byte; upload completion marker; upload this after all other files are uploaded

<file_1.csv.bz2> --- data file; TAB, ‘\t’, delimited

<file_2.csv.bz2>

...

<file_n.csv.bz2>

File format and Contents of “_manifest”

The unit of the file size is “byte”.

<.schema size><SPACE><.schema> --- <SPACE> delimited

<.meta size><SPACE><.meta>

<file_1 size><SPACE><file_1.csv.bz2>

<file_2 size> <file_2.csv.bz2>

<file_n size> <file_n.csv.bz2>

File format and Contents of “<.schema>”

The schema file, <.schema>, could be either of the following two formats.

  1. A Pig header file, ”.pig_header”, in CSV column header format with extended data type support:

    1. <column_header:data_type>[,<column_header:data_type>]{*}

    2. Supported data types: http://pig.apache.org/docs/latest/basic.html#data-types

    3. https://pig.apache.org/docs/r0.17.0/api/constant-values.html#org.apache.pig.data.DataType

  2. The standard Apache Pig schema file, “.pig_schema”, in JSON file format.

    Note

    The names, data types and order of the attributes in the “.schema” should match the columns in the data files.

Pig Data Types

Simple Types in .pig_header

Constant Value in .pig_schema

Description

Example

INT

10

Signed 32-bit integer

10

LONG

15

Signed 64-bit integer

Data: 10L or 10l

Display: 10L

FLOAT

20

32-bit floating point

Data: 10.5F or 10.5f or 10.5e2f or 10.5E2F

Display: 10.5F or 1050.0F

DOUBLE

25

64-bit floating point

Data: 10.5 or 10.5e2 or 10.5E2

Display: 10.5 or 1050.0

CHARARRAY

55

Character array (string) in Unicode UTF-8 format

hello world

BYTEARRAY

50

Byte array (blob)

BOOLEAN

5

boolean

true/false (case insensitive)

DATETIME

30

datetime

1970-01-01T00:00:00.000+00:00

BIGINTEGER

65

Java BigInteger

200000000000

BIGDECIMAL

70

Java BigDecimal

33.45678332

Contacts of Operational alerts

<provider-specific-name>_s3_feed_alerts@<provider-specific-name>.com

Sharing Secrets with External Partners using GPG

Background

This document describes how company A, the secret sender, can share secrets with company B, the secret recipient, over the internet securely with GPG encryption.

In a nutshell,

  1. Company B uses GPG to generate a pair of public key, b_armor.pub, and private key, b_armor.

    1. Company B needs to provide a password, password_B,  in generating the public/private keys.

    2. Company B will need this password in decrypting the encrypted secret from company A.

  2. Company B emails b_armor.pub to company A

  3. Company A uses b_armor.pub to encrypt the secret, and emails the encrypted blob to company B.

  4. The company B uses b_armor (and password_B when prompted) to decrypt the encrypted blob.

Detailed Steps

  1. Company A requests company B to generate (if not yet) and export their GPG public key. Company B follows instruction here: https://kb.iu.edu/d/awio

    1. Generate a key, assume company B’s email address is: “b_poc@b.com”

      1. gpg --gen-key

        1. Choose “(1) RSA and RSA (default)”

        2. Key length: 2048

        3. Expire in 2 weeks: 2w

        4. Full Name

        5. Email address: b_poc@b.com

        6. Comment: GPG key w/ company A

        7. Enter passphrase: <password_B> (company B need in decrypting the secret)

    2. gpg -o b_armor.pub -a --export b_poc@b.com

    3. Company B emails “b_armor.pub” as attachment to company A.

  2. Company A imports company B’s public key

    1. gpg --import “b_armor.pub”

  3. Company A puts secrets in “my_secret.txt” and encrypt it using b_armor.pub identified as “b_poc@b.com”

    1. gpg -o my_secret_armor.txt -a -e -r b_poc@b.com my_secret.txt

  4. Company A emails “my_secret_armor.txt” as attachment to company B