Input
This home-made automation will allow you to scrape any given website data.
All you need to provide is:
- website (e.g. https://www.producthunt.com/)
Note: you can use several inputs in one configuration for the generic scraper only if they come from the same website (e.g. https://www.doctolib.fr/dermatologue/paris, https://www.doctolib.fr/dermatologue/saint-germain-en-laye, etc.)
Parameters
You can extract up to 1000 items per input.
Select the extraction mode that suits you best:
- Single result: select it if you want just one result (e.g. extract one product on a page). Does not require a Pagination Path nor an Item Anchor Path
- Multiple results: select it if you want several data (e.g. extract data from a search). Does not require a Pagination Path but requires an Item Anchor Path
- Multiple results with pagination: select it if you want several data event if there is pagination. Requires a Pagination Path and an Item Anchor Path
- Multiple results with infinite scrolling: select it if your input has infinite scrolling. The robot will scroll until no new items are found or the quantity is enough. Does not require a Pagination Path
- Multiple results with infinite scrolling on button press: select it if your input has infinite scrolling but you need to press a button to load more items. The robot will scroll until no new items are found, the quantity is enough or the load more button cannot be found
Dismiss popup:
This is optional if you want to dismiss a banner or a popup. In order to do so, enter a css selector such as #onetrust-reject-all-handler or [name="accept"] for example. If there's no popup to dismiss, leave empty.
Item Anchor Path:
It's mandatory if you want multiple results. Enter a class like .card or an element like li, for example.
Configuring Selectors:
- Format: removes everything what you entered matches. When the input is "Company: " then each time the character "Company:" is seen in this data, it will be deleted (e.g. "Company: Google" -> "Google")
- Key: the key will be the name you want to give to your data (e.g. In "Format" you put a css selector that points towards a title. If you put "Name" in "Key", the column where your data will appear will be called "Name". We recommend to use key names such as company_name, title, company_address, etc. without spaces
- List: if the path you submitted matches a series of tags or a list, you can extract them all when checking this box. Your data will appear in one cell with "," between each data
- Path: the css selector of the data you want to extract
- Type: where the data is located ("innerText" by default). This means it will extract the text you see. It can also be "href" to extract an URL or "src" to extract an image URL
Pagination Path:
If you selected "Mutiple Results with Pagination", you should enter a css selector such as button[class*="result--more"]. It should point to a ‘next page’ or ‘load more’ button css selector.
Example with Doctolib
As an input, we provide https://www.doctolib.fr/dermatologue/paris.
For the parameters:
Dismiss popup: there is no popup to dismiss, so we leave it empty.
Extraction mode: we want to get multiple results even if there is pagination. We're choosing "multiple results with pagination".
Item Anchor Path: we are going in the "inspect" option on our webpage and select the css selector that contains the item we want to extract : .dl-search-result
Selectors:
We want to scrape the practitioner's name:
Key: name
The css selector or the data we want to extract:
Path: .dl-search-result-title
In order to extract the text where the data is located, it's innerText by default:
Type: innerText
These selectors allowed us to extract all the practitioner's names present on the results page.
Do the same for each information you want to extract.
Pagination Path: you need to enter the css selector that points to the next page results. In our case it's "Suivant", so we select the appropriate selector: body.profiles.index .next-previous-links .next