Skip to content

Commit 976b087

Browse files
committed
Add extract guide
1 parent bf3fdd0 commit 976b087

File tree

1 file changed

+180
-0
lines changed

1 file changed

+180
-0
lines changed

website/docs/basics/06-extract.md

+180
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# Extracting Data with the `extract` Method
2+
3+
The `extract` method in Cheerio allows you to extract data from an HTML document
4+
and store it in an object. The method takes a `map` object as a parameter, where
5+
the keys are the names of the properties to be created on the object, and the
6+
values are the selectors or descriptors to be used to extract the values.
7+
8+
To use the `extract` method, you first need to import the library and load an
9+
HTML document. For example:
10+
11+
```js
12+
import * as cheerio from 'cheerio';
13+
14+
const $ = cheerio.load(`
15+
<ul>
16+
<li>One</li>
17+
<li>Two</li>
18+
<li class="blue sel">Three</li>
19+
<li class="red">Four</li>
20+
</ul>`);
21+
```
22+
23+
Once you have loaded the document, you can use the `extract` method on the
24+
loaded object to extract data from the document.
25+
26+
Here are some examples of how to use the `extract` method:
27+
28+
```js
29+
// Extract the text content of the first .red element
30+
const data = $.extract({
31+
red: '.red',
32+
});
33+
```
34+
35+
This will return an object with a `red` property, whose value is the text
36+
content of the first `.red` element.
37+
38+
To extract the text content of all `.red` elements, you can wrap the selector in
39+
an array:
40+
41+
```js
42+
// Extract the text content of all .red elements
43+
const data = $.extract({
44+
red: ['.red'],
45+
});
46+
```
47+
48+
This will return an object with a `red` property, whose value is an array of the
49+
text content of all `.red` elements.
50+
51+
To be more specific about what you'd like to extract, you can pass an object
52+
with a `selector` and a `value` property. For example, to extract the text
53+
content of the first `.red` element and the `href` attribute of the first `a`
54+
element:
55+
56+
```js
57+
const data = $.extract({
58+
red: '.red',
59+
links: {
60+
selector: 'a',
61+
value: 'href',
62+
},
63+
});
64+
```
65+
66+
The `value` property can be used to specify the name of the property to extract
67+
from the selected elements. In this case, we are extracting the `href` attribute
68+
from the `a` elements. This uses Cheerio's
69+
[`prop` method](/docs/api/classes/Cheerio#prop) under the hood.
70+
71+
`value` defaults to `textContent`, which extracts the text content of the
72+
element.
73+
74+
As an attribute with special logic inside the `prop` method, `href`s will be
75+
resolved relative to the document's URL. The document's URL will be set
76+
automatically when using `fromURL` to load the document. Otherwise, use the
77+
`baseURL` option to specify the documents URL.
78+
79+
There are many props available here; have a look at the
80+
[`prop` method](/docs/api/classes/Cheerio#prop) for details. For example, to
81+
extract the `outerHTML` of all `.red` elements:
82+
83+
```js
84+
const data = $.extract({
85+
red: [
86+
{
87+
selector: '.red',
88+
value: 'outerHTML',
89+
},
90+
],
91+
});
92+
```
93+
94+
You can also extract data from multiple nested elements by specifying an object
95+
as the `value`. For example, to extract the text content of all `.red` elements
96+
and the first `.blue` element in the first `ul` element, and the text content of
97+
all `.sel` elements in the second `ul` element:
98+
99+
```js
100+
const data = $.extract({
101+
ul1: {
102+
selector: 'ul:first',
103+
value: {
104+
red: ['.red'],
105+
blue: '.blue',
106+
},
107+
},
108+
ul2: {
109+
selector: 'ul:eq(2)',
110+
value: {
111+
sel: ['.sel'],
112+
},
113+
},
114+
});
115+
```
116+
117+
This will return an object with `ul1` and `ul2` properties. The `ul1` property
118+
will be an object with a `red` property, whose value is an array of the text
119+
content of all `.red` elements in the first ul element, and a `blue` property.
120+
The `ul2` property will be an object with a `sel` property, whose value is an
121+
array of the text content of all `.sel` elements in the second `ul` element.
122+
123+
Finally, you can pass a function as the `value` property. The function will be
124+
called with each of the selected elements, and the `key` of the property:
125+
126+
```js
127+
const data = $.extract({
128+
links: [
129+
{
130+
selector: 'a',
131+
value: (el, key) => {
132+
const href = $(el).attr('href');
133+
return `${key}=${href}`;
134+
},
135+
},
136+
],
137+
});
138+
```
139+
140+
This will extract the `href` attribute of all `a` elements and return a string
141+
in the form `links=href_value` for each element, where `href_value` is the value
142+
of the `href` attribute. The returned object will have a `links` property whose
143+
value is an array of these strings.
144+
145+
## Putting it all together
146+
147+
Let's fetch the latest release of Cheerio from GitHub and extract the release
148+
date and the release notes from the release page:
149+
150+
```js
151+
import * as cheerio from 'cheerio';
152+
153+
const $ = await cheerio.fromURL(
154+
'https://github.com/cheeriojs/cheerio/releases'
155+
);
156+
157+
const data = $.extract({
158+
releases: [
159+
{
160+
// First, we select individual release sections.
161+
selector: 'section',
162+
// Then, we extract the release date, name, and notes from each section.
163+
value: {
164+
// Selectors are executed whitin the context of the selected element.
165+
name: 'h2',
166+
date: {
167+
selector: 'relative-time',
168+
// The actual date of the release is stored in the `datetime` attribute.
169+
value: 'datetime',
170+
},
171+
notes: {
172+
selector: '.markdown-body',
173+
// We are looking for the HTML content of the element.
174+
value: 'innerHTML',
175+
},
176+
},
177+
},
178+
],
179+
});
180+
```

0 commit comments

Comments
 (0)