corrupt data simo ahava
Growth Interviews

Simo Ahava podcast: All data is corrupt. What to do about it?

In this week’s episode of Growth Interviews, we invite you to join our podcast conversation with Simo Ahava, partner and Senior Data Advocate at 8-bit-sheep, helping with strategy, transformation, enterprise innovation, tech, user experience and metrics.

Simo Ahava has also been a Google Developer Expert for Google Analytics and Google Tag Manager since 2014, to improve the entire “life cycle” of data collection, processing, and reporting. Most often, he writes on his blog about web analytics, but also about digital marketing, more specifically SEO, web development and working with the Google Cloud Platform.

He believes in the power of data and technology. He is also a strong advocate for altruistic knowledge sharing.

Welcome to Growth Interviews!

Welcome to Growth Interviews, the fun, stimulating and engaging series of conversations driven by digital business growth. Our mission is to provide valuable insights from the eCommerce arena, and each episode is a fascinating quest into the best-kept business secrets and money-making strategies of an insightful world-class expert.

In today’s episode, Simo Ahava has shared with us some incredible insights about the usage of corrupt data in marketing and what to do about it. Is it good? Is it bad for business? Is it normal? He answered all these questions and much more in this fascinating data-driven interview.

Here are the biggest takeaways:

  • The most probable misuse of data in marketing – 07:45
  • What is fixable and not in the misuse of corrupt data – 20:09
  • Should web tracking be blocked? – 34:19
  • Anomaly detection – the next best thing in machine learning – 41:15

Listen and subscribe to our podcast! You can find us on: Podcast.co, Spotify, Apple Podcasts, Google Podcast, Overcast, Acast, TuneIn, Pocket Casts, Breaker, Stitcher

Google’s most attractive feature

Simo Ahava pointed out from his Google Developer expertise and history that “Google was always a very clear choice for [him].”  How so? First of all, Google Analytics being a free tool was a sure advantage and a perfect incentive to start walking on this path for those who wanted to indulge in analytics just as a “passion project”.

Secondly, having the possibility to play with the technology along with the extension with GTM (Google Tag Manager) for a long fruitful time, encouraged Simo to not give up on the Google tools even when competition started to appear on the online market.

Simo feels Google has always been the best solution for him, considering the fact that he did not feel pressured to be “a generalist”, whereas during conferences and workshops he would talk all about generalist topics such as building organizations around data, tool-wise.

Also, the data-crunching capabilities of the Google Cloud Platform are an amazement to him. There is an unexplored potential there where people are gradually becoming cost-effective by looking at cloud computing for even the most mundane automation tasks.

The most probable misuse of data in marketing

Simo is a strong believer that the tools people use are not the issue to look at. On the contrary, the problem lies within the people that use them. “They think good data is a tool selection problem or a platform selection problem,” he argues. The discussions that arise in organizations focus mainly on how to use the tool better, treating data and analytics as a project where they just need to find the right combination of tactics and platforms. However…

“Data is very much a reflection of how our organizations are built and how communication structures flow within the organization.”

Data will always reflect the holes and data discrepancies in places where people should have communicated properly, from IT to marketing, to recruitment, to business development and so on. “It will always leak in data from the CRM, from your user databases, from the applications, from other things that the users do.”

The biggest mistake Simo sees here is when people treat Google Analytics as canon, reflecting exactly what visitors are doing and reporting sales data just like in the sales engine. This cannot be further from the truth. Another issue is the latest hot topic of privacy and data breaches when data is collected and opt-in and opt-out mechanisms are introduced. The only matter here is accidentally collecting harmful data from visitors and transferring it to Google and Google Analytics. Imagine the scale of brand damage and the necessary steps to solve the situation! However, this happens even at bigger companies, as Simo assures us.

Some other ideas mentioned around this topic involved accidentally implementing password loggers malware through GTM and credit card skimmers as a result of adding some javascript you were not aware would act as such.

What is fixable and not in the misuse of corrupt data

“Will you start fixing that corrupt data, at the same time risk breaking all your processes? And I know that this sounds really weird but in some cases, all data is corrupt. Let’s start with that. That’s a fact. There’s no way that data will reflect reality.”

Given two different examples, Simo Ahava gave us two possible explanations for an unfixable situation and a fixable one when corrupt data is used in companies.

What is unfixable, yet not problematic? Transactions being displayed by 10-15-20% off from what is in fact collected in the sales engine. Considering that marketing campaigns are adjusted to this kind of situation, the only possible solution could be to search through the advertising data, since this is the most unreliable and compromised type of data. Be efficient and log and record all known corrupted data in the system, make a note on the unfixable type and introduce it as variables in your analysis.

What is fixable, yet the reverse can become unethical? Fixable would be a situation where Google Analytics displays 30% more transactions than they are in reality. This is translated as an implementation mistake, not a visit context. A typical answer would be the revisits to the “thank you” page, sending the same transaction again. In other words, the transactions that have more than one instance – the transaction IDs – are being duplicated in reports and analyses.

If the reverse (reporting 30% fewer transactions than in reality) were true, however, this could become an unethical issue.

  • Answer #1: An implementation mistake. Some browsers are incompatible with the javascript code you’re trying to use, so you can simply fix the code.
  • Answer #2: Ad blockers or modern browsers that block user scripts. Should you act upon this case? Most probably not. This is a feature for users to block Google Analytics. Blocking transactions suggest they are most likely blocking everything else about their sessions as well. However, this doesn’t mean that all the other metrics are inflated; everything about this particular user is missing from the data set. Should you reintroduce this data? Considering that the user wants to block, they probably should be allowed to do so. Therefore, this is an ethics-related situation.

Simo concludes that “you should start fixing what’s fixable and then labeling the stuff that’s unfixable and not trying to fix that.”

About browsers & co and where to find more info

Born from a wish to have a single resource from which people engaging in analytics to learn more about news such as the latest iteration of Safari’s tracking prevention mechanism, Simo Ahava created cookiestatus.com. This website is an open-sourced informational portal with resources to share various tracking protection mechanisms implemented by the major browsers and browser engines, such as Chrome, Edge, Firefox or Safari.

“[cookiestatus.com] is a purely collaborative effort and I hope that it’s very useful. The target audience would most likely be multidisciplinary teams comprising developers, analysts and marketers because it explains how things like Google Analytics cookies no longer function as we would expect them to. And if there is something that we can do about it.”

Simo also confirms that browsers are starting to be much stricter, such that, for instance, even though Chrome does not possess any tracking prevention mechanisms like other browsers (Safari, Edge, Brave), engineers at Chrome are very interested in protecting the privacy of the user.

For people in the analytics domain, even if they are dealing with first-party analytics and are not leaking, following the rules of GDPR with opt-in and consent, it is impossible not to suffer because of the restrictions coming from browsers upon the technologies used to receive useful data.

Should web tracking be blocked?

“If you create server-side proxies for analytics data, obviously you’ll have better quality data because you’ll be able to avoid ad blockers”, but is this ethical for users who want to block tracking? Again, probably not. 

“The worst-case scenario is that all tracking should be blocked. I think that’s my opinion. I know many people disagree with it. I think many people use the brick and mortar store analogy where if a user enters the store, this would mean that the shopkeeper would have to look away so that they don’t see the user ID or something silly like that.”

Even big players cannot realistically support economic interests if they are sacrificing user autonomy and user independence, together with the user’s right to their own data. Legal frameworks themselves will start acting and imposing huge fines in this case.

However, Simo believes that browsers are persuading ad tech vendors to cooperate with first parties since they are communicating that it is the first party’s decision to perform tracking if they want to do it and there is nothing browsers can do about it. He gives the example of Facebook which convinces site owners to add their tracking scripts as the first kind of scripts running on the server, offering a proxy script to be installed on your web server and then you start recycling Facebook hits through that proxy. Therefore, Facebook will not be blocked anymore and it’s doing that across site tracking, with no interruptions from the browsers.

E-commerce 2020 growth ideas

  1. Audit the whole pipeline of e-commerce data collection. Start from how the sales engine collects that information from the successful purchase, how credit card validations are done, what happens on the “thank you” page, how data is collected to an analytics system and what kind of metadata is being set with the data.
  2. Educate people working with that data and introduce the paradigm to them. Make sure that everybody understands what e-commerce analytics is. What product analytics is, what kind of things can and cannot be measured.
  3. Turn it into an Agile process where they set up the implementation, re-evaluated periodically. At least once a month, make sure that all the metadata that needs to be collected is indeed collected. Make sure that the data discrepancies are holding at a steady rate. Make sure that all changes are predictable. Make sure that there are rollback mechanisms and enforce a common understanding of what can be done with e-commerce data.

Anomaly detection – the next best thing in machine learning

Apart from being a useful tool in case of repetitive tasks, one of the most interesting applications of machine learning is, in Simo’s opinion, anomaly detection, being able to identify broken traffic and corrupt data without human interaction.

“Machine learning is greatly used by browsers’ tracking protection mechanisms. They’re trying to identify possibly hazardous sources that are doing cross-site tracking. Machine learning algorithms are identifying these domains that might be participating in some kind of tracking that’s hazardous to user privacy. Those are really interesting applications. And obviously, once those mechanisms are introduced, then hackers will introduce counter mechanisms, also relying on machine learning – an interesting feedback loop!”

Conclusion

Simo Ahava is an oasis of intelligent insights, some of the most interesting ones being shared in this very interview: Google’s most intriguing feature, what is fixable and what is not in the pool of corrupt data (that is unavoidable, by the way!), browsers and the ethical question of whether or not to allow users to block web tracking, three amazing e-commerce growth ideas that you cannot miss this year and a little something about anomaly detection. It’s a full package of shared knowledge!

We hope you enjoyed our podcast interview with Simo Ahava!
For more valuable insights, make sure you come back to check out our next Growth Interviews as well.

How did you find this experience? Was it insightful?
Spread the knowledge! ➡️➡️

Leave a Reply

Your email address will not be published. Required fields are marked *