geotribu_scraper.pipelines module

Custom pipelines.

See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

class geotribu_scraper.pipelines.CustomImagesPipeline(store_uri, download_func=None, settings=None)[source]

Bases : ImagesPipeline

Customize how images are downloaded. Stores images into a subfolder named full under the path defined in setting IMAGES_STORE. Inherits from ImagesPipeline, the generic images pipelines from Scrapy. See: <https://doc.scrapy.org/en/latest/topics/media-pipeline.html?#using-the-images-pipeline>

file_path(request, response=None, info=None)[source]

Output image path.

Paramètres
  • request ([type]) – [description]

  • response ([type]) – [description]. Defaults to: None - optional

  • info ([type]) – [description]. Defaults to: None - optional

Renvoie

path and filename

Type renvoyé

str

get_media_requests(item, info)[source]

Retrieve meta from request

Paramètres
  • item ([type]) – [description]

  • info ([type]) – [description]

thumb_path(request, thumb_id, response=None, info=None)[source]

Output thumbnails path.

Paramètres
  • request ([type]) – [description]

  • thumb_id ([type]) – [description]

  • response ([type]) – [description]. Defaults to: None - optional

  • info ([type]) – [description]. Defaults to: None - optional

Renvoie

path and filename

Type renvoyé

str

class geotribu_scraper.pipelines.JsonWriterPipeline[source]

Bases : object

close_spider(spider)[source]
open_spider(spider)[source]
process_item(item, spider)[source]
class geotribu_scraper.pipelines.ScrapyCrawlerPipeline[source]

Bases : object

MAPPING_REDIRECTIONS: list = []
static check_url(url)[source]
Type renvoyé

bool

close_spider(spider)[source]

This method is called when the spider is closed.

Paramètres

spider (_type_) – _description_

process_content(in_md_str)[source]

Check images in content and try to replace broken paths using a dict (stored in settings).

Paramètres

in_md_str (str) – markdown content

Renvoie

markdown content with images paths replaced

Type renvoyé

str

process_item(item, spider)[source]

Process each item output by a spider. It performs these steps:

  1. Extract date handling different formats

  2. Use it to format output filename

  3. Convert content into a markdown file handling different cases

Paramètres
  • item (GeoRdpItem) – output item to process

  • spider (Spider) – Scrapy spider which is used

Renvoie

item passed

Type renvoyé

Item

static title_builder(raw_title, append_year_at_end=True, item_date_clean=None)[source]

Handy method to build a clean title.

Paramètres
  • raw_title (str) – scraped title

  • append_year_at_end (bool) – option to append year at the end of title. Defaults to: True - optional

  • item_date_clean (Union[datetime, str]) – cleaned date of the item

Renvoie

clean title ready to be written into a markdown file

Type renvoyé

str

yaml_frontmatter_as_str(author, category, in_date, introduction, legacy_content_node, tags, title)[source]

Build and return YAML FrontMatter.

Paramètres
  • author (str) – author name

  • category (str) – content category

  • in_date (datetime) – content date

  • introduction (str) – content introduction

  • legacy_content_node (int) – legacy content node

  • tags (list) – list of content keywords

  • title (str) – content title

Renvoie

YAML frontmatter ready to be written

Type renvoyé

str