Current location - Quotes Website - Personality signature - Traversal document of reptile weapon soup
Traversal document of reptile weapon soup
Beautiful Soup is a Python library that can extract data from HTML or XML files. It provides some simple operation methods to help you deal with the tedious work of document navigation, document search and document modification. Because it is easy to use, delicious soup will save you a lot of working time.

You can use the following command to install Beautiful Soup. Just choose one of the two.

Meitang not only supports HTML parsers in Python standard library, but also supports many third-party parsers, such as lxml, html5lib and so on. If a parser is not specified when initializing the Beautifully Soup object, Beautifully Soup will choose the most appropriate parser (assuming a parser is installed on your machine) to parse the document. Of course, you can also specify the parser manually. Lxml parser is recommended here, which is powerful, convenient and fast, and it is the only parser that supports xml.

You can use the following command to install the lxml parser. Just choose one of the two.

The delicious soup is very simple to use. You only need to pass in a file operator or a piece of text to get a built document object. With this object, we can do something we want to do with the document. Most of the letters are crawled by reptiles, so the combination of Meitang and Request Library is better.

Beautiful Soup transforms complex HTML documents into complex tree structures. Each node is a Python object, and all objects can be divided into four types: Tag, NavigableString, Beautifully Soup and Comment.

Tag is a tag of HTML, such as p, p tag and so on. It is also the object we use the most.

NavigableString refers to the text inside the tag, and literal translation is a traversable string.

BeautifulSoup refers to the whole content of a document and can be used as a label.

Comment is a special NavigableString, and its output content does not include staring content.

For the smooth development of the story, let's first define a string of HTML text, and all the following examples are based on this text.

Tags have two very important attributes, name and attribute. The item name is the name of the tag and the attribute is the attribute of the tag. You can modify the name and properties of the label. Note that this modification will directly change the BeautifulSoup object.

As can be seen from the above example, you can get the label directly through the point property method, but this method can only get the first label. At the same time, you can call the point property method many times to get a deeper label.

If you want to get all the labels of a name, you can use the find_all(tag_name) function.

We can use it. Contents outputs the label as a list, that is, formats the child nodes of the label as a list, which is very useful, which means that the specified node can be accessed through subscripts. At the same time, we can also pass. Children's generator

. The child can only get the direct node of the label, but can't get the descendant node. Future generations can satisfy you.

By getting the parent node of the tag. Parent attribute. The parent tag of title is head, the parent tag of html is Beautifully Soup object, and the parent tag of Beautifully Soup object is None.

At the same time, we can get all the parent tags of the specified tag through parents.

Let the next label and the previous label pass. Next_sibling and. The previous brother and sister.

You may wonder why there is only one output after calling next_sibling twice. Is there a bug in this method? In fact, the first next_sibling of p is a newline between p and p, and this rule also applies to previous_sibling.

Besides, we can pass. Next _ Brothers and sisters and. Previous _ siblings property. In this example, we prefix each output so that we can see more intuitively that the first previous_sibling of dib is a newline.

By obtaining the previous or last analysis object of the specified tag. Next_element and. Previous_element Note that this is different from a sibling node, which refers to a child node with the same parent node, and the previous one or the latter one is calculated according to the parsing order of the document.

For example, in our text html_doc, the sibling node of head is body (regardless of line breaks), because they have the same parent node html, but the next node of head is title. That is, soup.head.next _ sibling = title soup.head.next _ element = title.

At the same time, it should be noted that the next parsing tag of title is not the text, but the content in the title tag, because the parsing order of html is to open the title tag, then parse the content, and finally close the title tag.

In addition, we can traverse the document tree. Next_elements and. The first _ elements. As can be seen from the examples left behind, line breaks also occupy parsing order, which is consistent with the effect of iterating over sibling nodes.

This chapter introduces the usage scenario of Meitang and the basic operation of operating document tree nodes. It seems that many things actually have rules to follow, such as the naming of functions, and the iterative function of a sibling node or the next node is to obtain the complex form of a single node function.

At the same time, it is very troublesome to operate because of the circular nested complex document structure of HTML or XML. Mastering the basic operation of nodes in this paper is helpful to improve the efficiency of writing crawler programs.