Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEO: Automated XML & HTML sitemap generation #220

Open
audi5 opened this issue Sep 29, 2021 · 4 comments
Open

SEO: Automated XML & HTML sitemap generation #220

audi5 opened this issue Sep 29, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@audi5
Copy link

audi5 commented Sep 29, 2021

Site indexation (discoverability of content) is an important SEO impact area, as we need content to be crawled for SEO results, and XML Sitemaps is a good centralized way to do that. And, we need an HTML sitemap to manage site structures, clean up content, etc.

Need a XML and HTML sitemap generator for Adobe.com, HelpX, Acrobat and other sites currently on the Dexter Platform consolidating URLs from all AEM versions.

That will help SEO team and Authoring teams understand / see all the pages currently published and to analyze the structure and familiarize them with all the pages on the site.

For XML format that is needed for SEO purposes, we need: http://adobe-consulting-services.github.io/acs-aem-commons/features/simple-sitemap.html

Please also provide capability to add a url manually in case the page is not hosted on AEM.

We need a XML format for SEO and HTML format for Design
HTML format – production publish instance URLs
Outside of AEM 6.0 URLs need to be manually added to the list

Need to figure out a way to show only public URLs rather than full path URLs (/sitemap.xml)
Can the sitemaps be split by geo – Yes
http://www.adobe.com/robots.txt

Current sitemap urls on Adobe.com:
https://www.stage.adobe.com/content/acom/us/en.sitemap.html?allowfullpath=true
http://www.stage.adobe.com/content/acom/us/en.sitemap.xml?allowfullpath=true

Need to validate that we can generate regular XML Sitemap files with AEM that are designed to improve site indexation.

Requirements:
Auto-generation of the XML sitemap files
One sitemap file per geo - www.adobe.com/ca/sitemap.xml, www.adobe.com/uk/sitemap.xml
Auto- publish new URLs into a sitemap file (approx within 1 hour - cache refresh time).
Auto - removal of non-canonical URLs (3XX, 404s, should be wiped out from XML sitemaps.)
Provide a way for authors to override the page url value that should show up in the sitemap instead of an absolute path
for ex: the url that should show up for the home page should be www.env.adobe.com instead of www.env.adobe.com/index.html.
For the above kind of pages, currently the workaround is to modify the xml manually but it would be nice to have a field provided using which the authors could mention the url that should be showing in the xml.
Separate implementation of Remove from sitemap checkbox for html and xml sitemaps
Automatically exclude non-html: https//www.adobe.com/1, https://www.adobe.com/1/creative-2015-07-20-mascha
Enforce Removal for pages with meta robots noindex e.g. http://www.adobe.com/confirmation.html, https://www.adobe.com/search.html
Include rewrite paths per the canonical tag. E.g. http://www.adobe.com/leaders.html (per canonical), not http://www.adobe.com/about-adobe/leaders.html (the actual resolving URL)
List URLs in alphabetic order
Possible to verify DNS to exclude pages that 404 on live site? https://www.adobe.com/qa_test_020.html
Sitemap generated should be http but also be able to be generated on https.
Sitemap generated should also take floodgated content that's available for visitors into account.

Acceptance Criteria:
Sitemaps are generated / refreshed on the fly, when the page is accessed.
verify new pages get added to author sitemaps when created
verify pages get updated on author sitemaps when moved or renamed
verify timestamps get properly updated in the XML sitemap when a page is updated/activated (author & publish)
verify pages are added to publish sitemaps when activated and cache flushed
Verify name update makes it to publish sitemaps
verify deactivated pages are removed from pulish sitemaps
Pages can be excluded from the sitemap via page property (or a similar place in helix)
verify new pages are included in sitemaps by default
verify that authors can remove pages from sitemap
verify that page can be de-activated, and removed from publish sitemap
verify child pages are also removed from sitemap (config is inherited, overriding inheritance was NOT tested
verify that fragments folder can be excluded from the sitemap at the folder level, and properly inherited
verify that the config is available on the FW, Lobby, Lobby tab, and fragment templates
verify the HTML sitemap has meta: (???)
Verify if the sitemaps are Floodgate aware and consider floodgated content as well that's visible tp end user.

@audi5 audi5 added the enhancement New feature or request label Sep 29, 2021
@rofe
Copy link
Contributor

rofe commented Sep 29, 2021

@rofe
Copy link
Contributor

rofe commented Sep 29, 2021

@dominique-pfister do you see anything missing in the current implementation?

@audi5 audi5 changed the title SEO: Automated XML sitemap generation SEO: Automated XML & HTML sitemap generation Sep 29, 2021
@dominique-pfister
Copy link
Contributor

dominique-pfister commented Sep 30, 2021

@dominique-pfister do you see anything missing in the current implementation?

Looking at the list of Requirements above, most are already built-in or can be done by using a separate helix-sitemap sheet in the index. The following, though, are not available:

  • Possible to verify DNS to exclude pages that 404 on live site? https://www.adobe.com/qa_test_020.html
  • Sitemap generated should be http but also be able to be generated on https.
  • Sitemap generated should also take floodgated content that's available for visitors into account.

And we don't generate HTML sitemaps (yet)

@rofe
Copy link
Contributor

rofe commented Oct 1, 2021

Sitemap generated should be http

I don't think we should do anything other than https. It's 2021 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants