Extract structured data with Ptyhon Scrapy

scrapy commands

  • scrapy bench: run a quick benchmarking test for how Scrapy runs on your hardware (memory and disk availability)
    • setup a local HTTP server and crawls it at maximum possible speed
  • scrapy fetch “website_url”: download the html file and write its contents to stdout
  • scrapy settings: gets the config settings for scrapy shows project specific settings if this is called inside a project
  • scrapy view “website_url”: open the url in the browser the way Scrapy will “see” it.
    • dynamic content is not rendered
  • scrapy shell “website_url”: interactive shell where you quickly prototype what you want to debug without building a Spider
    • crawler
    • item
    • request
    • response
    • settings

Scrapy architecture

  • scrapy engine: controls data flow and triggers all events