Marco Giordano
Marco Giordano

@GiordMarco96

27 Tweets Apr 03, 2023
Handling larger websites is annoying.
Especially if you have 10K+ articles, it's challenging, right?
But don't worry, Analytics can do this and much more.
A short guide to analyzing large websites ๐Ÿงต
First of all, this is often not Big Data.
Before you get your hopes up, it's simply handling larger datasets.
In most cases, your laptop is completely fine.
The edge cases where you need better solutions still exist though.
The first thing to recall is that you don't really need to analyze every detail.
There are some sections that are more important or that have a higher weight.
Always start by studying the website on a higher level so you will save a lot of time and resources.
If it's a publisher and you clearly have separate categories, start by analyzing them separately.
You can crawl by batches, which is more efficient as a whole.
There is no reason to do one super big crawl, I recommend you do divide et impera.
For the actual crawling, there are cases where Screaming Frog is still a viable option.
An excellent resource is given by @screamingfrog, which tells you how to do it.
The database storage option is a must-have in such cases.
screamingfrog.co.uk
For websites that are even bigger (50K+ pages), you may need alternative solutions.
Other than Lumar and Botify, you can try to create your crawler too.
Scrapy is the best library out there, although a little bit complex at first.
I have crawled a super large beauty website as an exercise time ago and I only used #Python.
Instead of Scrapy, I used advertools, which is based on it.
The speed is amazing and you can pretty much do what you want.
advertools.readthedocs.io
You can also use Screaming Frog on the cloud or have your custom functions.
These cases are rarer but they happen.
I don't think many of you will actually use a cloud solution for crawling though.
The tool is just a means, the important is what you do with this data.
Speaking of which, you also need a reliable way to get GSC data somewhere safe.
The 1-button integration with BigQuery finally solves this issue for you.
However, Search Console data tend to be quite sampled and inaccurate for larger websites.
The best solution to fight this issue is to create multiple properties.
This way, you get less sampling and more accurate info.
Consider this a must for many big websites.
Google Analytics suffers from sampling too and you can find comfort with a similar solution.
The important is that you are able to merge this data later.
There isn't much else you can do at this point.
For analyzing data, Google Sheets isn't enough.
The rows limitation and the computational power aren't suitable.
This is where you finally switch to SQL/Python/R.
But even so, that's not the only reason.
BigQuery allows you to use SQL for pretty much anything.
I don't think this is the best option either, as you want this data out of there.
Vertex AI is a nice solution but you can also use your laptop in many cases.
SQL is good to tackle those cases where the data is too big and queries are more efficient.
In SEO, those cases are quite rare and you won't have to get your hands that dirty.
That's why I recommend Python/R over SQL.
The straightforward choice is to use pandas/tidyverse, 2 dedicated libraries for Py/R.
The problem is they are not enough for larger datasets.
You need better alternatives, such as polars or data.table.
They are worth it if you want more power.
Spreadsheets completely fail at keeping up at this point.
Even the most basic operation can be daunting, depending on the # of rows and cols you have.
This is super common once you have GSC data with date, query, page, country.
Once you are able to get your data, you need to clean it.
I always recommend removing the following:
- branded queries
- foreign queries
- pages with # (ToC) or any parameter
If you were able to categorize your pages, very well.
Now the analysis will be a piece of cake.
If not, it may become a big problem.
Ask if they already have a content plan with labeled data first.
If not, keep reading.
Screaming Frog allows for Custom Extraction but pretty much any crawler.
You can scrape their categories if you detect the correct HTML element.
This only works if the categories are good and what you were looking for.
In all the other cases, you may attempt to use custom rules based on URLs.
If a page contains this pattern, then put it into this group.
It only works if you have expressive URLs though!
Once you finish, you can split your queries/pages into groups depending on their performance.
I like to create custom bands to see how I can group pages.
What are the most interesting queries?
How can I spot weak pages?
And so on...
Plotting comes next, as you want to tell stories or have a quick overview of your website.
Barplots and heatmaps are mostly fine, no need for cringy pie charts.
It's that simple at this point, yes.
For the insights/actions, please recall that it's not the same as your blog.
You need a stronger prioritization on some areas of the website instead of doing it all.
It's naive to think you can treat them as a niche website.
For Ecommerce, the rules are a little bit different but the idea is the same.
Focus on areas, crawl by sections, and find the sweet spot.
It's more challenging for sure but you'll get used to it.
If you want to learn more, you can consult my Analytics ebook:
marcogiordano96.gumroad.com
Follow me for threads, tips, and case studies (coming soon) about SEO, content, and Python/data.
If you liked this thread, consider liking and retweeting it!๐Ÿงต
I offer:
- Content audits for publishers and B2C content
- Consultancies and freelancing for publishers and B2C content
- Training and mentorship for data to any business/agency
bookk.me

Loading suggestions...