One of my projects has a bunch of XML Sitemaps. When I QA these, I want to download live copies to use in my analysis. It’s pretty impossible to do this at scale without a tool. There are crawlers and other tools that can do this for you, but I’m old school, and I like touching the raw data with my grimy fingers. This tutorial is how I bilk-prepare sitemaps for auditing.
Downloading XML Sitemaps in Bulk
This tutorial assumes that the sitemaps you want are neatly organized in a Sitemap Index. If not, do that first. The second prerequisite is to have a bulk-download browser extension. I use one called Simple Mass Downloader for Chrome. Once that’s all straightened out, navigate to the sitemap index URL and do the following.
- Open Simple Mass Downloader and click Load Page Links
- Click File name to bulk-select the XML sitemaps
- Click the plus sign at the bottom to add files to the passive queue
At this point, it might be helpful to update your browser’s download settings. Typically, I keep my browser set to the default download directory, and I have to approve every download. When I download sitemaps, I change the default download folder to a custom destination and untick Ask where to save each file before downloading. Otherwise, I have to approve every single download which sucks.
Don’t forget to undo this settin after your download is done.
- Click Start Selected and all of the files are dumped into your default downloads folder.
Loading Bulk Sitemaps into Excel
If you don’t have Power Query Editor for Excel, you’ll need to download and install it. If you don’t have a PC, you’re not out of luck. Run Windows on a virtual machine and do all of this fancy stuff in there.
- Click Get Data > From File > From Folder
- Browse to the folder where you downloaded the XML sitemaps and click OK.
- The next screen shows structured data for every file in your chosen folder. Click Edit.
- Click the two down-arrows next to the Content column to combine all of the sitemap data into one set.
- In the Combine Files screen, click url to load the data into the preview window. If everything looks good, click OK.
- Now all of my sitemaps are parsed and laid-out in columns. I can do additional cleaning and structuring in here, but this is good for now. Click Close & Load > Close & Load To…
- In the Import Data screen, choose Only Create Connection and Add this to the Data Model.
This step stores the data instead of laying it out in one sheet. Excel limits the number of lines to just over 1 million, and some of my projects have a lot more URLs than that.
Now all of my sitemap data is parsed and ready to pivot into manageable chunks.