Class: HTMLReader

Extract the significant text from an arbitrary HTML document. The contents of any head, script, style, and xml tags are removed completely. The URLs for a[href] tags are extracted, along with the inner text of the tag. All other tags are removed, and the inner text is kept intact. Html entities (e.g., &) are not decoded.

Implements

BaseReader

Constructors

constructor

• new HTMLReader(): HTMLReader

Returns

HTMLReader

Methods

getOptions

▸ getOptions(): Object

Wrapper for our configuration options passed to string-strip-html library

Returns

Object

An object of options for the underlying library

Name	Type
`skipHtmlDecoding`	`boolean`
`stripTogetherWithTheirContents`	`string`[]

See

https://codsen.com/os/string-strip-html/examples

Defined in

packages/core/src/readers/HTMLReader.ts:48

loadData

▸ loadData(file, fs?): Promise<Document<Metadata>[]>

Public method for this reader. Required by BaseReader interface.

Parameters

Name	Type	Default value	Description
`file`	`string`	`undefined`	Path/name of the file to be loaded.
`fs`	`GenericFileSystem`	`defaultFS`	fs wrapper interface for getting the file content.

Returns

Promise<Document<Metadata>[]>

Promise<Document[]> A Promise object, eventually yielding zero or one Document parsed from the HTML content of the specified file.

Implementation of

BaseReader.loadData

Defined in

packages/core/src/readers/HTMLReader.ts:21

parseContent

▸ parseContent(html, options?): Promise<string>

Wrapper for string-strip-html usage.

Parameters

Name	Type	Description
`html`	`string`	Raw HTML content to be parsed.
`options`	`any`	An object of options for the underlying library

Returns

Promise<string>

The HTML content, stripped of unwanted tags and attributes

See

getOptions

Defined in

packages/core/src/readers/HTMLReader.ts:38

Implements​

Constructors​

constructor​

Returns​

Methods​

getOptions​

Returns​

Defined in​

loadData​

Parameters​

Returns​

Implementation of​

Defined in​

parseContent​

Parameters​

Returns​

Defined in​

Implements

Constructors

constructor

Returns

Methods

getOptions

Returns

Defined in

loadData

Parameters

Returns

Implementation of

Defined in

parseContent

Parameters

Returns

Defined in