- 3 Minutes to read
Product Transaction Delivery Through S3
- 3 Minutes to read
For partners who do not support sending data via the Yahoo Conversion API (CAPI), Yahoo offers the ability to upload Transaction Data via S3.
Important
This solution powers Yahoo In-Flight Outcomes (IFO) only and is not for standard conversion tracking.
Set up the delivery of data files to designated AWS S3 locations for Yahoo to download.
Architecture and Data Flow
Onboarding Requirements from Data Provider
Please provide the following to allow for load estimation of the data you plan to send:
Upload frequency: assume daily.
Number of files per batch upload and their size after compression, .bz2 preferred.
Projected data volume per day.
High-Level Guidelines: Privacy, Security and Performance
The following are best practices for sharing data via S3:
Do not send data for opted out users.
Do not send duplicate user events.
If duplication cannot be avoided due to integrating with multiple data onboarding endpoints such as S3, Conversion API, and DOT Pixel, an extra string field ‘‘event_id”, needs to be sent for each event on all endpoints, so that Yahoo system can use it as a deduplication key.
Protect S3 bucket/location with security policy, such as
Disable public access.
Enforce HTTPS (TLS1.2 or above) connection.
Enable server side encryption (SSE-S3) with cypher key rotation at least every 12 months.
Group files by feed types and day/hour and upload them to folders named by feed types and date/time. Refer to Directory Layout and File Format.
Limit the file size to around 1 GB (after .bz2 compressed).
For small data sets, limit the number of files to under 5 per hour.
Avoid many small files.
Support credential rotation: annually or in need.
Secrets, e.g., credentials, should be delivered in encrypted format: GPG public key to be provided by the receiver of the credentials. Refer to Sharing Secrets with External Partners using GPG.
Directory Layout and File Format
s3://<bucket name>/<3p-m>/<feed_n>/yyyyMMdd
_manifest -- 0 byte; upload completion marker; upload this after all other files are uploaded
<file_1.csv.bz2> --- data file; TAB, ‘\t’, delimited
<file_2.csv.bz2>
...
<file_n.csv.bz2>
File format and Contents of “_manifest”
The unit of the file size is “byte”.
<.schema size><SPACE><.schema> --- <SPACE> delimited
<.meta size><SPACE><.meta>
<file_1 size><SPACE><file_1.csv.bz2>
<file_2 size> <file_2.csv.bz2>
…
<file_n size> <file_n.csv.bz2>
File format and Contents of “<.schema>”
The schema file, <.schema>, could be either of the following two formats.
A Pig header file, ”.pig_header”, in CSV column header format with extended data type support:
<column_header:data_type>[,<column_header:data_type>]{*}
Supported data types: http://pig.apache.org/docs/latest/basic.html#data-types
https://pig.apache.org/docs/r0.17.0/api/constant-values.html#org.apache.pig.data.DataType
The standard Apache Pig schema file, “.pig_schema”, in JSON file format.
Note
The names, data types and order of the attributes in the “.schema” should match the columns in the data files.
Pig Data Types
Simple Types in .pig_header | Constant Value in .pig_schema | Description | Example |
---|---|---|---|
10 | Signed 32-bit integer | 10 | |
15 | Signed 64-bit integer | Data: 10L or 10l Display: 10L | |
20 | 32-bit floating point | Data: 10.5F or 10.5f or 10.5e2f or 10.5E2F Display: 10.5F or 1050.0F | |
25 | 64-bit floating point | Data: 10.5 or 10.5e2 or 10.5E2 Display: 10.5 or 1050.0 | |
55 | Character array (string) in Unicode UTF-8 format | hello world | |
50 | Byte array (blob) | ||
5 | boolean | true/false (case insensitive) | |
30 | datetime | 1970-01-01T00:00:00.000+00:00 | |
65 | Java BigInteger | 200000000000 | |
70 | Java BigDecimal | 33.45678332 |
Contacts of Operational alerts
<provider-specific-name>_s3_feed_alerts@<provider-specific-name>.com
Sharing Secrets with External Partners using GPG
Background
This document describes how company A, the secret sender, can share secrets with company B, the secret recipient, over the internet securely with GPG encryption.
In a nutshell,
Company B uses GPG to generate a pair of public key, b_armor.pub, and private key, b_armor.
Company B needs to provide a password, password_B, in generating the public/private keys.
Company B will need this password in decrypting the encrypted secret from company A.
Company B emails b_armor.pub to company A
Company A uses b_armor.pub to encrypt the secret, and emails the encrypted blob to company B.
The company B uses b_armor (and password_B when prompted) to decrypt the encrypted blob.
Detailed Steps
Company A requests company B to generate (if not yet) and export their GPG public key. Company B follows instruction here: https://kb.iu.edu/d/awio
Generate a key, assume company B’s email address is: “[email protected]”
gpg --gen-key
Choose “(1) RSA and RSA (default)”
Key length: 2048
Expire in 2 weeks: 2w
Full Name
Email address: [email protected]
Comment: GPG key w/ company A
Enter passphrase: <password_B> (company B need in decrypting the secret)
gpg -o b_armor.pub -a --export [email protected]
Company B emails “b_armor.pub” as attachment to company A.
Company A imports company B’s public key
gpg --import “b_armor.pub”
Company A puts secrets in “my_secret.txt” and encrypt it using b_armor.pub identified as “[email protected]”
gpg -o my_secret_armor.txt -a -e -r [email protected] my_secret.txt
Company A emails “my_secret_armor.txt” as attachment to company B