Convert document to xml

General

In this article we are going to create an XML file from a text file.

We use a .txt file as source to conver to an XML file, but it is also possible to use other document types, for example to convert a PDF to XML.

Image

Configuration

This article asumes that you know how to create and configure a File Processor channel.

  1. Create a new channel and call it TXT to XML.
  2. Input: Select Local/Network and specify the path where the .TXT files are located.
    In our case we use path: E:\FP\Winking\Ricoh\in
  3. Input Filter: We only want to process our .TXT files, therefor we apply an Input Filter.
  1. Click the Add button and add a Property-Filter Type.
  2. Configure the property: File name, Match Regex with regular expression: (?i).*[.]txt$
    The regular expression accepts only file names with a .txt extension.
  1. Output: Select Local/Network and specify the path where the resulting .XML files should be saved.
    In our case we use path: E:\FP\Winking\Ricoh\result
  2. Post-Action: We want to delete the original input file after successful processing. Use Method-Type with option Delete (input file).

Now we have configured a basic channel which moves files with a .txt-extension from one folder to another. As you have noticed, we did not add any Converter in the steps above.
In the following steps we will configure the conversion to an XML file.

Configuring the XML conversion

Our goal is to convert a text file to an xml file. Therefor we will use a converter. These steps will guide you in the process of configuring such a converter. Your actual implementation might differ from this example.

Our Text file looks like this:

Title: Schaum's Outline of Signals and Systems
Author: Hwei Hsu
ISBN10: 0070306419
Pages: 470

Title: WPF 4 Unleashed
Author: Adam Nathan
ISBN10: 0672331195
Pages: 825

...

Our XML structure will look like this:

<?xml version="1.0" encoding="Windows-1252"?>
<Books>
  <Book>
    <Title />
	<Author />
	<ISBN />
	<Pages
  </Book>
  <Book>
    <Title />
	...
  </Book>
  ...
</Books>

Now we will create the scheme for this XML-structure:

  1. In the Channel Options, go to the Conversion-tab.
  2. Click Add Converter.
  3. Select Add to xml.
  4. In the Schema-section, give the root-element a name: Books.

    Image

  5. Click the Add-button, next to the root-element and choose Container element (repeating) and click Add.
    We have a repeating container because we have multiple books.

    Image
    Image

  6. Select the newly created element and on the right side in the properties panel, give the repeating container a name: Book.
  7. Click the button next to the Start repeating group: label and disable the checkbox.

    Image

  8. Next we will add the child elements for a Book-element.
  9. Click the Add-button next to the Book-element, select String and click Add.

    Image
  1. Select the newly added child-element and On the right side fill in Title next to the Name: label.

    Image

  2. Click on the button next to the Recognition: label to configure the recognition.
    Here we will use recognition and define which text we should take from the .txt file.
  3. Enable the Enable recognition for: Title checkbox.
  4. Select Property: Content.
  5. For Content Filter: use Label.
  6. Label: Whole word and fill in Title:
  7. Value position: Right
  8. Value type: Everything
  9. Click the Add-button at the bottom and select Trim. Change the value Start to Both.
  10. Click OK to close the recognition dialog.
    Image

For now we have added the Title-property.

Image

If we now test our XML-converter by putting our text file in the input folder and starting the File Processor, the resulting file will have this content if you configured everything correctly:

<?xml version="1.0" encoding="Windows-1252"?>
<Books>
  <Book>
    <Title>Schaum's Outline of Signals and Systems</Title>
  </Book>
  <Book>
    <Title>WPF 4 Unieashed</Title>
  </Book>
  <Book>
    <Title>Mastering Serial Communications</Title>
  </Book>
</Books>

Now you can add the other elements, like Author, ISBN, Pages to complete your XML.
After adding the other elements your scheme should look similar to this:

Image

Test Files

We used a text file (.txt) called books.txt with the following content:

Title: Schaum's Outline of Signals and Systems
Author: Hwei Hsu
ISBN10: 0070306419
Pages: 470

Title: WPF 4 Unleashed
Author: Adam Nathan
ISBN10: 0672331195
Pages: 825

Title: Mastering Serial Communications
Author: Peter W. Gofton
ISBN10: 0895881802
Pages: 289