|
| 1 | +# Extracting Data with the `extract` Method |
| 2 | + |
| 3 | +The `extract` method in Cheerio allows you to extract data from an HTML document |
| 4 | +and store it in an object. The method takes a `map` object as a parameter, where |
| 5 | +the keys are the names of the properties to be created on the object, and the |
| 6 | +values are the selectors or descriptors to be used to extract the values. |
| 7 | + |
| 8 | +To use the `extract` method, you first need to import the library and load an |
| 9 | +HTML document. For example: |
| 10 | + |
| 11 | +```js |
| 12 | +import * as cheerio from 'cheerio'; |
| 13 | + |
| 14 | +const $ = cheerio.load(` |
| 15 | +<ul> |
| 16 | + <li>One</li> |
| 17 | + <li>Two</li> |
| 18 | + <li class="blue sel">Three</li> |
| 19 | + <li class="red">Four</li> |
| 20 | +</ul>`); |
| 21 | +``` |
| 22 | + |
| 23 | +Once you have loaded the document, you can use the `extract` method on the |
| 24 | +loaded object to extract data from the document. |
| 25 | + |
| 26 | +Here are some examples of how to use the `extract` method: |
| 27 | + |
| 28 | +```js |
| 29 | +// Extract the text content of the first .red element |
| 30 | +const data = $.extract({ |
| 31 | + red: '.red', |
| 32 | +}); |
| 33 | +``` |
| 34 | + |
| 35 | +This will return an object with a `red` property, whose value is the text |
| 36 | +content of the first `.red` element. |
| 37 | + |
| 38 | +To extract the text content of all `.red` elements, you can wrap the selector in |
| 39 | +an array: |
| 40 | + |
| 41 | +```js |
| 42 | +// Extract the text content of all .red elements |
| 43 | +const data = $.extract({ |
| 44 | + red: ['.red'], |
| 45 | +}); |
| 46 | +``` |
| 47 | + |
| 48 | +This will return an object with a `red` property, whose value is an array of the |
| 49 | +text content of all `.red` elements. |
| 50 | + |
| 51 | +To be more specific about what you'd like to extract, you can pass an object |
| 52 | +with a `selector` and a `value` property. For example, to extract the text |
| 53 | +content of the first `.red` element and the `href` attribute of the first `a` |
| 54 | +element: |
| 55 | + |
| 56 | +```js |
| 57 | +const data = $.extract({ |
| 58 | + red: '.red', |
| 59 | + links: { |
| 60 | + selector: 'a', |
| 61 | + value: 'href', |
| 62 | + }, |
| 63 | +}); |
| 64 | +``` |
| 65 | + |
| 66 | +The `value` property can be used to specify the name of the property to extract |
| 67 | +from the selected elements. In this case, we are extracting the `href` attribute |
| 68 | +from the `a` elements. This uses Cheerio's |
| 69 | +[`prop` method](/docs/api/classes/Cheerio#prop) under the hood. |
| 70 | + |
| 71 | +`value` defaults to `textContent`, which extracts the text content of the |
| 72 | +element. |
| 73 | + |
| 74 | +As an attribute with special logic inside the `prop` method, `href`s will be |
| 75 | +resolved relative to the document's URL. The document's URL will be set |
| 76 | +automatically when using `fromURL` to load the document. Otherwise, use the |
| 77 | +`baseURL` option to specify the documents URL. |
| 78 | + |
| 79 | +There are many props available here; have a look at the |
| 80 | +[`prop` method](/docs/api/classes/Cheerio#prop) for details. For example, to |
| 81 | +extract the `outerHTML` of all `.red` elements: |
| 82 | + |
| 83 | +```js |
| 84 | +const data = $.extract({ |
| 85 | + red: [ |
| 86 | + { |
| 87 | + selector: '.red', |
| 88 | + value: 'outerHTML', |
| 89 | + }, |
| 90 | + ], |
| 91 | +}); |
| 92 | +``` |
| 93 | + |
| 94 | +You can also extract data from multiple nested elements by specifying an object |
| 95 | +as the `value`. For example, to extract the text content of all `.red` elements |
| 96 | +and the first `.blue` element in the first `ul` element, and the text content of |
| 97 | +all `.sel` elements in the second `ul` element: |
| 98 | + |
| 99 | +```js |
| 100 | +const data = $.extract({ |
| 101 | + ul1: { |
| 102 | + selector: 'ul:first', |
| 103 | + value: { |
| 104 | + red: ['.red'], |
| 105 | + blue: '.blue', |
| 106 | + }, |
| 107 | + }, |
| 108 | + ul2: { |
| 109 | + selector: 'ul:eq(2)', |
| 110 | + value: { |
| 111 | + sel: ['.sel'], |
| 112 | + }, |
| 113 | + }, |
| 114 | +}); |
| 115 | +``` |
| 116 | + |
| 117 | +This will return an object with `ul1` and `ul2` properties. The `ul1` property |
| 118 | +will be an object with a `red` property, whose value is an array of the text |
| 119 | +content of all `.red` elements in the first ul element, and a `blue` property. |
| 120 | +The `ul2` property will be an object with a `sel` property, whose value is an |
| 121 | +array of the text content of all `.sel` elements in the second `ul` element. |
| 122 | + |
| 123 | +Finally, you can pass a function as the `value` property. The function will be |
| 124 | +called with each of the selected elements, and the `key` of the property: |
| 125 | + |
| 126 | +```js |
| 127 | +const data = $.extract({ |
| 128 | + links: [ |
| 129 | + { |
| 130 | + selector: 'a', |
| 131 | + value: (el, key) => { |
| 132 | + const href = $(el).attr('href'); |
| 133 | + return `${key}=${href}`; |
| 134 | + }, |
| 135 | + }, |
| 136 | + ], |
| 137 | +}); |
| 138 | +``` |
| 139 | + |
| 140 | +This will extract the `href` attribute of all `a` elements and return a string |
| 141 | +in the form `links=href_value` for each element, where `href_value` is the value |
| 142 | +of the `href` attribute. The returned object will have a `links` property whose |
| 143 | +value is an array of these strings. |
| 144 | + |
| 145 | +## Putting it all together |
| 146 | + |
| 147 | +Let's fetch the latest release of Cheerio from GitHub and extract the release |
| 148 | +date and the release notes from the release page: |
| 149 | + |
| 150 | +```js |
| 151 | +import * as cheerio from 'cheerio'; |
| 152 | + |
| 153 | +const $ = await cheerio.fromURL( |
| 154 | + 'https://github.com/cheeriojs/cheerio/releases' |
| 155 | +); |
| 156 | + |
| 157 | +const data = $.extract({ |
| 158 | + releases: [ |
| 159 | + { |
| 160 | + // First, we select individual release sections. |
| 161 | + selector: 'section', |
| 162 | + // Then, we extract the release date, name, and notes from each section. |
| 163 | + value: { |
| 164 | + // Selectors are executed whitin the context of the selected element. |
| 165 | + name: 'h2', |
| 166 | + date: { |
| 167 | + selector: 'relative-time', |
| 168 | + // The actual date of the release is stored in the `datetime` attribute. |
| 169 | + value: 'datetime', |
| 170 | + }, |
| 171 | + notes: { |
| 172 | + selector: '.markdown-body', |
| 173 | + // We are looking for the HTML content of the element. |
| 174 | + value: 'innerHTML', |
| 175 | + }, |
| 176 | + }, |
| 177 | + }, |
| 178 | + ], |
| 179 | +}); |
| 180 | +``` |
0 commit comments