Case study

How GDS used data to improve content and user journeys on 188体育

The data science team and developers from GDS used machine learning, network analysis and semantic vectors to improve content and user journeys on 188体育.

Summary

Project timing - May - August 2019

Team - Data science team and developers from the Government Digital Service (GDS)

Tools - machine learning, network analysis and creating semantic vectors using

Objective

The GDS team working on this project wanted to improve navigation on 188体育 so it was easier for users to find what they needed. The team aimed to:

  • reduce duplicate or overlapping pieces of guidance
  • group related content with links

To do this the team used data to:

  • identify similar or duplicate 188体育 content
  • use an automated process to generate related content links for most 188体育 pages

Background

is the main website for government. It enables users to interact with government and find services, guidance and news from all government departments. There are over 400,000 unique pieces of content on 188体育 and statistics showed that over 6 months, 700 pieces of content were published per week.

This amount of content, and high publishing rate, leads to duplicate content and makes it hard for users to find what they need.

188体育 provides different types of navigational help, including:

  • a search function
  • that show a list of nested web pages
  • related links, which are shown to the right-hand side of content on a page

Before this project, only around 2% of 188体育鈥檚 content (about 8,000 pages) had related links.

Investigating and trying solutions

To improve users鈥� experience, we investigated:

  • semantic vectors to group content and find similar or duplicate entries
  • network analysis to align how 188体育 pages are connected with hyperlinks (the structural network) with the way users travel between pages (the user generated functional network)
  • machine learning to auto generate links on 188体育 pages

Semantic vectors

To reduce the amount of similar or duplicate content on 188体育 we first needed to find this content. We wrote a to plot the content with similarities and duplications.

After testing various tools we chose Google鈥檚 to turn content into semantic vectors. By using the universal sentence encoder we could tell publishers where their content overlapped with pre-existing content. Publishers could use the information at the publication stage to prevent duplicate content going live, or retrospectively to clean up existing duplicate content.

We could also use semantic vectors to make decisions on changes to 188体育鈥檚 taxonomy. The taxonomy is structured into topics and publishers tag content with the most appropriate topics.

Network analysis

We wrote to discover and assess:

  • whether all content areas of 188体育 are accessible and have links
  • individual pages鈥� position and significance within the overall network

The team identified hub pages which have more connections than others. To investigate these, we calculated network properties, including:

  • network density
  • connectedness
  • link distribution
  • centrality measures

The team created the using Python to automatically extract user journeys from BigQuery 鈥� the database that stores our Google Analytics data. We aggregated these journeys for a specific time period to learn how users interact with the site. The result of these journeys was a functional network, or map of connected content.

We compared the functional network to how 188体育 was structured, and used the results to inform changes to that structure and improve navigation for users.

Machine learning and A/B testing

When publishers create a new piece of content they add links to other related pages. Navigational links that facilitate browsing are automatically linked to the new content item. This creates 188体育鈥檚 structural network which consists of approximately 350,000 links. We wrote to check the structure of 188体育 content and automatically generate related content links.

After considering different algorithms, we decided to implement the to create links and display them on the right-hand side of 188体育 pages. These would help people find the content they needed more easily.

We tested the user journeys through the links using 2 :

  1. There is no significant difference in the proportion of journeys using at least one related link. If this is true, then people are not clicking our links, probably because they do not find them useful or interesting.

  2. There is no significant difference in the proportion of journeys that use internal search. If this is true, then we did not provide people with relevant links to click on so they still use internal search to navigate.

We used an to randomly assign users to one of two possible versions of the site. Half of the users were directed to the control site (A) where only a small percentage of the pages had links which were added by the publishers. The other half of the users were directed to the version of the site (B) with the algorithmically generated related links. To speed up our analysis that allowed us to run routine analysis, as soon as each experiment completed.

The results showed that both of our null hypotheses were false. The first implied that users found the related links interesting and/or relevant so they clicked them. User journeys showed an improvement for more than 10,000 users per day. The second result implied that there鈥檚 a potential reduction in internal search by about 20% to 40%.

Applying our solutions

We wrote . We used notebooks for rapid development and prototyping which allowed us to code, document and .

To automate the process of generating related links, we decided to modularise the code from the Jupyter notebooks and write it in object-oriented Python. By using a combination of user journey information and the structure of 188体育 we could keep adapting the generated links.

To maintain a high quality of related links on 188体育, we added a number of steps to our process, including:

  • excluding a number of pages that should not have links coming from or going to them
  • applying a confidence threshold to the links generated by node2vec which only allowed through the most relevant links
  • generating a spreadsheet of the top 200 most popular pages so that our content designers could check the generated links and make changes where necessary

Results

We鈥檝e been running the process to generate links every 3 weeks. These links are automatically displayed on 188体育 when publishers have not set their own links, which ensures that users have a way to find the content they鈥檙e looking for.

We鈥檙e continuously iterating and refining the related links process, and monitoring results to make sure users鈥� experiences are improving. In the future, we aim to bring the link generation online and use it as part of the publishing process. All of our code is available to .

Automatically generated links will not replace hand-curated links as they will not have the same considered context as those created by a subject matter expert.

Prior to this work, the vast majority of 188体育 pages did not have curated recommended links. We automated that process, improving thousands of user journeys per day.

Updates to this page

Published 20 December 2019