Data export in Parquet format

Overview

Apache Parquet is an open-source data format for efficient data storage and retrieval. Sitefinity Insight enables you to export marketing data in Parquet format to easily integrate your data into your analytical solutions and perform fast queries over vast amounts of data from different sources.

PREREQUISITES: To export Parquet files, you must have a paid Sitefinity Insight subscription. To upgrade, contact Sitefinity Sales.

Support for generating Parquet files is turned off by default. To use it, you must enable it per data center. After enabling this feature for a particular data center, Sitefinity Insight automatically generates Parquet files containing all the data in that center once per day. You use Sitefinity Insight API to download the generated Parquet files. For more information, see Work with the Sitefinity Insight API.

For more information about enabling Parquet files generation, see Data Exports.

NOTE: Sitefinity Insight may throttle the download of Parquet files. To avoid throttling, check the value of the LastDataUpdatedOn property, as described below, and download the refreshed files only if they are newer than the ones you already have.

RECOMMENDATION: Because Sitefinity Insight generates refreshed files at most once per day, we recommend downloading the generated Parquet files once per day to avoid throttling.

Handle data deletion requested by GDPR

When a user submits a data deletion request, Sitefinity Insight ensures that the deleted personal information is not present in the exported files generated after the request is handled.

IMPORTANT: If you have copies of the exported files in any external systems, you are responsible for deleting the data in these systems.

Get the Parquet-encoded visitors' data

In this procedure, you learn how to get visitors’ data from your Sitefinity Insight data center, encoded in Parquet format and how to authenticate and authorize the API calls.

Perform the following:

  1. Choose the Sitefinity Insight API server depending on the region where your Sitefinity Insight account is provisioned.
    For more information about available Insight regional deployments, see Sitefinity Insight deployment options » Sitefinity CMS and Sitefinity Insight deployments.
    This tutorial assumes the US deployment - https://api.insight.sitefinity.com.
  2. Generate an Access key.
    For more information, see Connect your sites to Sitefinity Insight » Access keys.
    IMPORTANT: After you close the window, you will not be able to see this key again. Make sure you have a copy of the key in a secure place.
  3. Obtain an ephemeral access token to use in the Authorization header when performing subsequent API calls.
    To do this, follow the procedure in Work with the Sitefinity Insight API » Authorization.
  4. Call the GET /exports/tracked-data API.
    For more information, see API browser.
  5. Check the value of the returned LastDataUpdatedOn property.
    If you already have downloaded Parquet files for the date specified in this property, you do not need to download the files again. To avoid throttling, skip the next steps.
  6. Get the list of files to download by reading the File property of the ContactsFile, MappingsFile, and InteractionFiles properties.
  7. For each file you get in Step 6, call the GET /v3/data-centers/{apiKey}/exports/tracked-data/download?file={File} API.
    Replace the placeholders {apiKey} and {File} in the template above.
    The result of the API call is an octet stream with the content of the respective file.
    For more information, see API browser.

Parquet data files schemas

Sitefinity Insight provides the following types of Parquet-formatted data files:

  • Contacts’ demographic details
  • Contacts’ interactions with your web properties
  • Mappings between multiple tracking cookies to a single specific contact

Contacts demographic information

ContactsFile is a parquet file containing demographic data about contacts. It has the following schema:

  • Id
    The identifier of the contact.
    Contains the same value as the VisitorId column in the other data types.
  • KnownSince
    The date when the visitor provided an email for the first time, thus becoming a contact.
  • FirstVisitOn
    The date when the contact visited the website for the first time. This date can be before the KnownSince date when the person visits a page first anonymously and then provides their email.
  • LastVisitOn
    The date when the contact was last active on the website.
  • Any additional contact properties, whether defined by you or predefined in Insight - Email, Company, Phone, JobTitle, Country, FirstName, LastName, Address, and Birthday.
    For more information , see Configure contact properties.

Interactions information

Interactions file is a Parquet file containing a contact’s interactions for a single day. It has the following schema:

  • Predicate
    The action that the visitor performed.
    For example, Visit, Submit Form, and Login.
  • Object
    The object the visitor interacted with.
    For example, if the Predicate is a Visit, the Object is the URL of the visited page, and if the Predicate is Submit Form, the object is the name of the form.
  • SubjectId
    Identifies a set of interactions that belong together to the same visitor.
    It is a more compact numerical representation of the Subject value stored in the sf-data-intell-subject cookie. A single visitor may have more than one SubjectId.
    To map all the SubjectIds to the same contact, you use the Mapping file, as explained in the section Subject Mapping information below.
  • Timestamp
    The time and date when the interaction happened.
  • DocumentTitle
    Optional. If the Predicate is Visit, this is the document title of the visited page as displayed in the browser.
  • Taxonomies
    The tags and the categories this object is annotated with at the time of the interaction.
    For more information, see Add categories and tags to content items.
  • ReceivedOn
    The date when Sitefinity Insight received the interaction in the data center.
    The Timestamp and ReceivedOn properties of the interaction could have different dates - for example, when interactions for past events are reported to Sitefinity Insight.

Subject mapping information

To track visitor’s behavior in browsers, Sitefinity Insight creates and stores a SubjectId in a cookie named sf-data-intell-subject. Because a person can use multiple devices and browsers, or clear the browser cookies, it is possible that the same person is represented by multiple SubjectId values.

When a contact uses multiple devices, each device is associated with a distinct SubjectId. But if the person has provided their email on each device via a form submission, or if the person has logged in with their account on each device, each SubjectId is associated with the same VisitorId.

Interactions before cookie clean-up are associated with one SubjectId, and interactions after the cookie clean-up are associated with another SubjectId. But if the person has logged in or provided their email in other ways, these two district SubjectId values would be associated with the same VisitorId.

EXAMPLE: Consider that your data center has contact Helena Smith with e-mail hsmith@example.com.
When you look in the ContactsFile parquet file, you can find that Helena’s Id is, for example, 1234.
When you look at the MappingFile parquet file, you can find that the Id 1234 maps to multiple SubjectIds, for example, 100001 and 200123.
Finally, you can open the different interaction files and look for records with SubjectId equal to 100001 or 200123 and use these records to build a customer journey of the contact hsmith@example.com that span multiple devices, used by Helena.

The MappingFile represents the association between SubjectIds and VisitorIds and has the following schema:

  • SubjectId
    Identifies interactions made by the same subject, as described above. The same column is available in the interactions files, described above.
  • VisitorId
    Identifies a single visitor or contact. A single visitor can have multiple SubjectIds. The same column is available in the Contacts file and is named Id.

You can join the interactions and the contacts files in the SubjectId column and work with the VisitorId column as the identifier of the person who did the interactions. This enables you to analyze journeys that happen on multiple devices or different browsing sessions, but were identified as belonging to the same visitor by Sitefinity Insight.

You join the two sets of data if you want to annotate the interactions data with demographics data located in the ContactsFile.

For more information, see SubjectMapping.

Was this article helpful?