Data export in Parquet format
Overview
Apache Parquet is an open-source data format for efficient data storage and retrieval. Sitefinity Insight enables you to export marketing data in Parquet format to easily integrate your data into your analytical solutions and perform fast queries over vast amounts of data from different sources.
PREREQUISITES: To export Parquet files, you must have a paid Sitefinity Insight subscription. To upgrade, contact Sitefinity Sales.
Support for generating Parquet files is turned off by default. To use it, you must enable it per data center. After enabling this feature for a particular data center, Sitefinity Insight automatically generates Parquet files containing all the data in that center once per day. You use Sitefinity Insight API to download the generated Parquet files. For more information, see Work with the Sitefinity Insight API.
For more information about enabling Parquet files generation, see Data Exports.
NOTE: Sitefinity Insight may throttle the download of Parquet files. To avoid throttling, check the value of the LastDataUpdatedOn
property, as described below, and download the refreshed files only if they are newer than the ones you already have.
RECOMMENDATION: Because Sitefinity Insight generates refreshed files at most once per day, we recommend downloading the generated Parquet files once per day to avoid throttling.
Handle data deletion requested by GDPR
When a user submits a data deletion request, Sitefinity Insight ensures that the deleted personal information is not present in the exported files generated after the request is handled.
IMPORTANT: If you have copies of the exported files in any external systems, you are responsible for deleting the data in these systems.
Get the Parquet-encoded visitors' data
In this procedure, you learn how to get visitors’ data from your Sitefinity Insight data center, encoded in Parquet format and how to authenticate and authorize the API calls.
Perform the following:
- Choose the Sitefinity Insight API server depending on the region where your Sitefinity Insight account is provisioned.
For more information about available Insight regional deployments, see Sitefinity Insight deployment options » Sitefinity CMS and Sitefinity Insight deployments.
This tutorial assumes the US deployment - https://api.insight.sitefinity.com
.
- Generate an Access key.
For more information, see Connect your sites to Sitefinity Insight » Access keys.
IMPORTANT: After you close the window, you will not be able to see this key again. Make sure you have a copy of the key in a secure place.
- Obtain an ephemeral access token to use in the Authorization header when performing subsequent API calls.
To do this, follow the procedure in Work with the Sitefinity Insight API » Authorization.
- Call the
GET /exports/tracked-data
API.
For more information, see API browser.
- Check the value of the returned
LastDataUpdatedOn
property.
If you already have downloaded Parquet files for the date specified in this property, you do not need to download the files again. To avoid throttling, skip the next steps.
- Get the list of files to download by reading the
File
property of the ContactsFile
, MappingsFile
, and InteractionFiles
properties.
- For each file you get in Step 6, call the
GET /v3/data-centers/{apiKey}/exports/tracked-data/download?file={File}
API.
Replace the placeholders {apiKey}
and {File}
in the template above.
The result of the API call is an octet stream with the content of the respective file.
For more information, see API browser.
Parquet data files schemas
Sitefinity Insight provides the following types of Parquet-formatted data files:
- Contacts’ demographic details
- Contacts’ interactions with your web properties
- Mappings between multiple tracking cookies to a single specific contact
Contacts demographic information
ContactsFile
is a parquet file containing demographic data about contacts. It has the following schema:
Id
The identifier of the contact.
Contains the same value as the VisitorId
column in the other data types.
KnownSince
The date when the visitor provided an email for the first time, thus becoming a contact.
FirstVisitOn
The date when the contact visited the website for the first time. This date can be before the KnownSince
date when the person visits a page first anonymously and then provides their email.
LastVisitOn
The date when the contact was last active on the website.
- Any additional contact properties, whether defined by you or predefined in Insight - Email, Company, Phone, JobTitle, Country, FirstName, LastName, Address, and Birthday.
For more information , see Configure contact properties.
Interactions information
Interactions file is a Parquet file containing a contact’s interactions for a single day. It has the following schema:
Predicate
The action that the visitor performed.
For example, Visit, Submit Form, and Login.
Object
The object the visitor interacted with.
For example, if the Predicate is a Visit, the Object is the URL of the visited page, and if the Predicate is Submit Form, the object is the name of the form.
SubjectId
Identifies a set of interactions that belong together to the same visitor.
It is a more compact numerical representation of the Subject value stored in the sf-data-intell-subject cookie. A single visitor may have more than one SubjectId.
To map all the SubjectId
s to the same contact, you use the Mapping file, as explained in the section Subject Mapping information below.
Timestamp
The time and date when the interaction happened.
DocumentTitle
Optional. If the Predicate is Visit, this is the document title of the visited page as displayed in the browser.
Taxonomies
The tags and the categories this object is annotated with at the time of the interaction.
For more information, see Add categories and tags to content items.
ReceivedOn
The date when Sitefinity Insight received the interaction in the data center.
The Timestamp
and ReceivedOn
properties of the interaction could have different dates - for example, when interactions for past events are reported to Sitefinity Insight.
Subject mapping information
To track visitor’s behavior in browsers, Sitefinity Insight creates and stores a SubjectId
in a cookie named sf-data-intell-subject
. Because a person can use multiple devices and browsers, or clear the browser cookies, it is possible that the same person is represented by multiple SubjectId
values.
When a contact uses multiple devices, each device is associated with a distinct SubjectId
. But if the person has provided their email on each device via a form submission, or if the person has logged in with their account on each device, each SubjectId
is associated with the same VisitorId
.
Interactions before cookie clean-up are associated with one SubjectId
, and interactions after the cookie clean-up are associated with another SubjectId
. But if the person has logged in or provided their email in other ways, these two district SubjectId
values would be associated with the same VisitorId.
EXAMPLE: Consider that your data center has contact Helena Smith with e-mail hsmith@example.com.
When you look in the ContactsFile
parquet file, you can find that Helena’s Id
is, for example, 1234.
When you look at the MappingFile
parquet file, you can find that the Id
1234 maps to multiple SubjectId
s, for example, 100001 and 200123.
Finally, you can open the different interaction files and look for records with SubjectId
equal to 100001 or 200123 and use these records to build a customer journey of the contact hsmith@example.com that span multiple devices, used by Helena.
The MappingFile
represents the association between SubjectId
s and VisitorId
s and has the following schema:
SubjectId
Identifies interactions made by the same subject, as described above. The same column is available in the interactions files, described above.
VisitorId
Identifies a single visitor or contact. A single visitor can have multiple SubjectId
s. The same column is available in the Contacts
file and is named Id
.
You can join the interactions and the contacts files in the SubjectId
column and work with the VisitorId
column as the identifier of the person who did the interactions. This enables you to analyze journeys that happen on multiple devices or different browsing sessions, but were identified as belonging to the same visitor by Sitefinity Insight.
You join the two sets of data if you want to annotate the interactions data with demographics data located in the ContactsFile
.
For more information, see SubjectMapping.