Metadata
Average length of listings in the four subcorpora:
- e05p: 43 tokens
- e17p: 49 tokens
- e17x: 177 tokens
- e18v: 97 tokens
The length of the listing varies depending on the category.
The XML file contains various metadata for each listing: a unique ID, the year and month it was collected in and the category the listing belongs to.
Distribution of categories in the first three subcorpora (e05p, e17c, e17p):
Category | Distribution |
---|---|
maison | 41 |
voiture et moto | 21 |
vêtêments | 122 |
PC et téléphone | 20 |
enfant | 14 |
collections | 39 |
loisir | 41 |
Additional metadata
Some subcorpora have additional metadata, listed below:
- e18v: number of ratings the user has
- e05p: ‘svo’ – 0/1, if the listing contains at least one well-formed sentence with subject-verb-object
- e17p: ‘text’ – Y/N, if the listing resembles a text with sentences and punctuation or not
- e17p: the listing is split into two categories, either ‘inf’ or ‘ad’ – ‘inf’ refers to information that is either copy-pasted or numerical details (e.g. dimensions), ‘ad’ refers to everything written by the user