How to scrape top 5 headlines from BBC using puppeteer & NodeJS

Ildana Ruzybayeva
2 min readAug 3, 2020

Web scraping is a method of extracting desired data from web sites. In the world of Javascript “puppeteer” is the most commonly used tool to do such kind of tasks. In this tutorial, we will use puppeteer to scrape the top 5 headlines from the world’s leading news website BBC. All in less than 35 lines of code

Let’s start by creating a folder and a server.js file. Then we can initialize the project and install the only 2 dependencies we need: express and puppeteer.

npm init -ynpm i express puppeteer

Setup our server from the express boilerplate. Go to http://localhost:3000/ and make sure you see “Hello World!” message.

Since puppeteer takes a while to complete each task we will first wrap our code in an async function called getData which will accept our target URL.

Now we have a 6 step process that we want puppeteer to go through within our getData function:

1. Launch the browser

2. Open a new tab

3. Go to our URL

4. Scrape

5. Send data to us

6. Close

Steps one to three are very straightforward:

Step 4 is where most work lies. Puppeteer has a method on a page called evaluate which runs a callback function where you can specify what DOM element you want the data from. After inspecting the BBC page we can find that the articles are simply list items with the class name of ‘media-list__item’. Selecting all of them will return a node list which we want to transform into an array and save into the variable like this:

Now in order to access the actual titles, we need to dig deeper within each element of the array in the following way:

Finally, we want to loop through listOfAllNews array and push as many titles as we wish into a new array. We should also make sure we are not pushing already existing titles. The whole code for step 4 looks the following way:

Finally, add steps 5 and 6 inside our getData function and you should be all set. The final code looks the following way:

Now when we go to http://localhost:3000/ we should get an array with 5 strings which are our titles

Hope you found this useful. You can browse through the whole code in this repo and read more about puppeteer here

Cheers,

Ildana

--

--

Ildana Ruzybayeva

Central Asian soul living & reflecting on my life in Sweden. Full-time JavaScript developer & Machine Learning enthusiast.