Multi-Value Fields and Buffers

I need to update group membership in Active Directory. This is done by adding a DN to the multi-valued attribute attached to the group named ‘member’.

The problem is, when reading or adding a multi-valued attribute into or out of Active Directory, using the LDAPReader and LDAPWriter, the entire multi-valued attribute gets put into a single field, on a single record. Like this…

member=“cn=user1,dc=domain,dc=com|cn=user2,dc=domain,dc=com|cn=user3,dc=domain,dc=com|…”

…so when I want to add a user to a group, I…

  1. Read the existing group record.
  2. Use the Normalize component to split the multiple values stored in the ‘member’ attribute into separate records.
  3. Add in the new records.
  4. Use Denormalize to stick all of the values back into a single record.
  5. Update the attribute

…the problem is, I may have 10,000 group members. Which gives me 25,000+ characters in the string that makes up the field.

CloverETL keeps running out of buffer memory, so I’ve updated the defaultProperties file to increase the memory…

Record.MAX_RECORD_SIZE = 1024000
DEFAULT_INTERNAL_IO_BUFFER_SIZE = 2048000

…which doesn’t feel right, considering the comment for MAX_RECORD_SIZE states…

“…keep it under 64K”

So I have two questions…

…is it okay to just keep increasing the buffer sizes until I can fit everything?

…is there a more memory efficient way to deal with reading and writing large multi-valued attributes out of and into LDAP?

Thanks…

The MAX_RECORD_SIZE constant is used at various places by CloverETL engine during transformation execution. It may be set to much higher values than 64k, you just need to be aware of the fact that memory consumption will go higher. Each edge (in typical case) allocates 4x the MAX_RECORD_SIZE buffer for data processing. Most of the components allocate extra 1x MAX_RECORD_SIZE buffer/memory for their internal purpose.
Thus simple graph with 3 components and 2 edges may end up allocating 1+1+1+4+4 (11) times MAX_RECORS_SIZE of memory. Which is not a problem on most today’s systems. Only if you start running transformations with 50+ components, it may actually play a role.

One option is to use BYTE or CBYTE (compressed byte) type field instead of STRING. They are also capable of storing strings and tend to occupy much less memory therefore lower MAX_RECORD_SIZE could be used to accommodate your long string values. When converting string to bytes, the DEFAULT_CHARSET_ENCODER is used - typically set to ISO-8859-1. You may need to set it to UTF-8 if your string contains chars outside of the default codepage.