All too often, personally identifiable information makes its way into Google Analytics, breaching their terms of service, and ultimately putting your data (and reputation) at risk. If you're keeping up with the heated debates and actions around privacy today, it's a bad time for an organization to overreach, even if it's accidental.
PII, or Personally Identifiable Information, is information that could potentially be used to identify a unique individual. For example, full names, email addresses, social security numbers, or even precise location information, are some, but not all, examples of things that could uniquely identify a user. In all likelihood, your website handles this information in some way, and that means there's the potential for Google Analytics to collect it.
This information could wind up in a variety of places. Typically, however, we've found the following to be most common:
- Custom Dimensions. This is a big no, and implies you knew what you were doing! You can still do what you need to do here, but not like this. We'll talk more about that below.
- Page Path. There are many ways PII creeps into page data. Often, it'll make its way into query parameters when users submit forms (/submit?user=john.doe). Other times, it'll be used within URI segments to identify what content to serve /user/john-doe/.
- Page Title. If your users log in, it's not uncommon for configuration pages to include a user's name or other PII in the Page Title (My Website - User - John Doe), which is then automatically pulled into Google Analytics on every hit.
- Event Category, Action, and Label. Websites that generate events, especially auto-events like listeners on all clicks, could automatically populate PII that's part of button text and other HTML elements.
Having PII reach your Google Analytics data does not necessarily imply you, as a website owner, collected this data intentionally. For example, website search fields produce valuable information about your user's intent, and is often part of a baseline implementation. If a user decides to search for their social security number instead of, say, a city name, that unfortunately becomes your problem.
Strategies for better PII Management
There are both proactive and reactive steps the best businesses take to make sure they're respecting their user's privacy, and abiding by Google's terms of service.
First, collect only the information you need to collect – be as strict as is reasonable to provide value for your users. If, for example, you're offering a mailing list, do you really need to collect a user's name, or will an email address suffice? Erring on the side of caution reduces future problem solving.
Second, audit your data frequently. From what we've seen, it's probable PII will reach your dataset eventually, and the best thing you can do is to handle cases as soon as possible. Frequent audits ensure cases are caught and handled in days or weeks rather than months or years.
Third, ensure your developers understand how to handle PII. Whether it's part of the interview process, part of onboarding, or ongoing training efforts, developers should not only understand best practices around security certificates, passing data via the URL, and other standard development practices, but also the impact of their solutions on web analytics.
Finally, build logic around your implementation, whether natively or through a tag management solution, to strip out PII before it has a chance to make it into your Google Analytics hits. Phone numbers, email addresses, and other PII that you may handle can likely be honed into and proactively removed by a regular expression.
How to Capture PII Correctly
A common misconception is that website owners cannot leverage the PII they do have to create better data sets. Thankfully, Google provides guidelines for using the information you have to build more accurate datasets for better decision-making.
Take, for example, the User ID functionality. If you're unfamiliar, this Google Analytics feature takes an ID you provide and overwrites the Client ID, creating an – over time – infinitely more accurate data set.
Isn't a User ID personally identifiable? Maybe, maybe not. But to be on the safe side, pass Google your User ID, or other PII, as a salted, SHA-256 encrypted string, ensuring that if someone with malicious intent was to capture these values, they wouldn't be able to be decrypted into anything identifiable.
Lastly, one other common misconception is that building view filters or excluding query parameters is a satisfactory way to handle PII in your data. This is false! By the time filters are run and query parameters are excluded, your data is already stored on Google's servers. Any additional formatting is exclusively for your analytical benefit.