Developer Corner – the Document Report

16 Nov 2020 | Developer Corner

Michael Carter
  • Tweet this item
  • share this item on Linkedin

What is the Document Report?

We scan your full site to find all your PDFs and MS Office documents that can be accessed from your webpages. Typically, we find thousands of documents, so all the information is compiled into a spreadsheet and emailed to you.

Why a spreadsheet? When we asked Sitemorse Clients how they address issues with website documents, their most common type of document is PDFs. Clients told us that with PDF issues, it almost always involved a different part of the organisation from the web team. It might be staff who used MS Word to save as PDF or it might be third parties who create PDFs using Adobe InDesign. In most cases, the Sitemorse Client usually wanted just a list of the PDF issues and they created a spreadsheet from the information. We decided to cut straight to the chase and email the spreadsheet directly.

 Although most clients are concerned with PDFs, we can find the MS Office documents at the same time, so we include those in the report as well. Often Clients do not realise how many MS Office documents they have on a site. Some Clients do not know they have these documents on the site at all, and occasionally the Client response is, “Oh my goodness! What’s that spreadsheet doing on the website”.

The report runs every quarter. This reflects the fact that the process of changing PDFs and MS Office documents used on the website tends to take longer than updates to web pages.

The spreadsheet is divided into several tabs. Here is an explanation of the tabs.

Summary tab

This tab shows summary information including the site assessed and the date of the assessment. It contains the totals for each of the individual tabs and some extra information about PDFs that are not tagged and not linearised. This tab also lists any warnings, for example, if any information has been truncated.

Tagging is something that helps PDFs provide accessibility, so it is worth knowing that all your PDFs are tagged. It is something you will see Adobe Acrobat check as well with its built-in accessibility checker.

Linearisation is a way to optimise a PDF for web download. It is not wrong to have unlinearised PDFs, but it is also useful to know how many you have that are not optimised for web use.

This tab also contains a link to the underlying Sitemorse assessment report used to compile the spreadsheet information so you can examine more detail there if you wish.

HELP tab

This tab has a brief description of the other tabs but contains less information that you will find in this blog.

PDFs tab

This tab lists every PDF found, a SMARTVIEW link to each PDF found, and all the items that link to this PDF.

It is worth noting that this is a list of all PDFs regardless of if we found any issues. Often you need to know the location of every PDF on your site and the URL of every item that links to it. In Sitemorse reports, we usually just list items that have failed some test and need to be improved. We do this so you are not overwhelmed with data. This tab is different. It might contain a great deal of data if your site has many PDFs. In testing the PDF report with Sitemorse Clients, we often found this tab had five or six thousand rows. All the PDFs are here so you can use MS Excel to sort them, search them, or whatever takes your fancy.

Broken links tab

This tab lists every PDF that contains a broken link, a SMARTVIEW link to each PDF, and each broken link in that PDF.

You will be familiar with the links section of a Sitemorse report that shows which items contain broken links. Here we have a list of just the PDFs that contain broken links, one PDF per row. For each row, there will be at least one column that contains the URL of the link that is broken. There might be many columns with URLs if a PDF contains multiple broken links.

Bad emails tab

This tab lists every PDF that contains an email address that is badly formed or does not work, a SMARTVIEW link to each PDF, and each bad email address in that PDF.

Remember we are checking for the way the email is formatted (does it have the @ in the right place, etc) but as equally important, does the email server recognise the email address. That means we send a message to the email server to ask if they recognise the email address. PDFs are not updated as often as web pages, so it is quite common for an email to be in a PDF that was published a few years ago. Since then, the email recipient has left, and the email address has been removed from the email server. The email address in the PDF is perfectly well formatted but it will never work because the address no longer exists.

Each PDF with a bad email address is on a separate row and there are one or more columns for the one or more email addresses that are not correct.

Accessibility tab

This tab lists every PDF and the number of WCAG 2.1 tests that failed the Sitemorse automated tests. It includes PDFs that passed tests. The second column is a link to SMARTVIEW, and the remaining columns represent each WCAG 2.1 PDF technique.

Here we list all the WCAG 2.1 PDF techniques (tests), that is PDF1 through PDF23. Some of these tests are manual so we cannot test them, but we have a full list for completeness.

We have included all PDFs here so you can see easily, which have passed the automated checks. This results in a very large amount of information; however, this is the ideal dataset to export to other systems, create charts, etc.

Duplicates tab

This tab lists all PDFs that were identical (the same binary file) but were served from different URLs by your webserver. It is most likely that you have multiple copies of the same PDF on your site. If there are several duplicates of the same files, these are listed in separate columns.

With many PDFs in circulation within an organisation, it is easy to add the PDF to your website and links to that PDF without realising that other groups in your organisation have had the same idea. This results in multiple copies of the same PDF being served from your webserver. This is not a problem as such until you need to change the PDF. Then you need to find all the copies and that is made more difficult if you did not know there were multiple copies in the first place.

If you are updating your PDFs to improve the accessibility, this can be a time-consuming and costly exercise. The last thing you want is to find out you have wasted resouces applying the same changes to a load of copies.

Missing tab

This tab lists the URLs we found on your site that linked to PDFs, but the link was broken. It contains both links to PDFs we could not find on your site and links to external PDFs from your site.

Here we are showing you broken links but just broken links to PDFs rather than all broken links to all items. In other words, this is a list of PDFs that should exist, but we cannot find them. This list is for both PDFs that are on your website and PDFs you link to on external sites. You can sort this list by URL in MS Excel to find just the ones that should be available on your site if that is something you need to address internally. The broken links to external PDFs can be given to the web editors to update.

MS Office tab

This is a list of all MS Word, MS Excel and MS PowerPoint files we found on your site and from which pages they are linked from.

MS Office brokens links tab

This tab lists every MS Office document that contains a broken link and each broken link in that document.